Log-Concave Sampling
Log-Concave Sampling
(unfinished draft)
Sinho Chewi
Preface i
2 Functional Inequalities 65
2.1 Overview of the Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.2 Proofs via Markov Semigroup Theory . . . . . . . . . . . . . . . . . . . . 68
2.3 Operations Preserving Functional Inequalities . . . . . . . . . . . . . . . 77
2.4 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.5 Isoperimetric Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.6 Metric Measure Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.7 Discrete Space and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3
4 CONTENTS
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
References 277
Index 291
6 CONTENTS
Preface
i
ii PREFACE
with graduate-level analysis and probability; (2) since much of the theory of sampling
is inspired by ideas from optimization, we highly recommend that the reader is familiar
with the latter, as treated in, e.g., [Bub15; Nes18].
normalizing factor (or the partition function in statistical physics). Although 𝑍 (and thus
𝜋) are explicitly given in terms of the known function 𝑉 , a naı̈ve evaluation of 𝑍 as a
high-dimensional integral is intractable. Indeed, the usual approach of approximating an
integral by a sum requires discretizing space via a fine grid whose size scales exponentially
in the dimension 𝑑. Moreover, even if we had access to 𝑍 , it is still not clear how we could
use this to sample from 𝜋. Therefore, the focus here is to develop direct methods for the
sampling task which bypass the computation of 𝑍 .1
Not only do we want to develop fast algorithms, we also want to understand the
inherent complexity of the sampling task, which in turn allows us to identify optimal
algorithms. By complexity, we do not mean computational complexity, since proving
lower bounds in that context is out of reach (besides, we would have to spend too much
time worrying about the bit representation of 𝑉 ). Instead, following the well-trodden
path of optimization, we adopt a model in which we only have access to 𝑉 through
queries made to an oracle, and our notion of complexity is the number of queries made.
This is known as oracle complexity or query complexity; see [NY83, §1] for a detailed
discussion. We will usually consider a first-order oracle, i.e. given a point 𝑥 ∈ R𝑑 , the
oracle returns (𝑉 (𝑥), ∇𝑉 (𝑥)). Since 𝑉 is only well-defined up to an additive constant, we
can equivalently imagine that the oracle returns (𝑉 (𝑥) − 𝑉 (0), ∇𝑉 (𝑥)).
Before considering the problem further, here are some important applications.
1 In
fact, it goes the other way around: the state-of-the-art methods for approximately computing 𝑍 are
based on the sampling methods we develop here.
iii
𝑝𝜗 (𝜃 ) 𝑝𝑋 |𝜗 (𝑋 | 𝜃 )
𝑝𝜗 |𝑋 (𝜃 | 𝑋 ) ∝ ∫ ,
𝑝 (d𝜃 0) 𝑝 | 𝜃 0)
Θ 𝜗 𝑋 |𝜗 (𝑋
which encodes our new beliefs about 𝜗 after seeing the data.
Typically, we have access to the functional forms of 𝑝𝜗 and {𝑝𝑋 |𝜗 (· | 𝜃 )}𝜃 ∈Θ , so we
can evaluate these densities and compute gradients. However, the denominator
of 𝑝𝜗 |𝑋 is precisely the normalizing constant described previously and cannot be
naı̈vely evaluated. Moreover, even if we had the functional form of 𝑝𝜗 |𝑋 (· | 𝑋 ), we
would still not be able to compute expectations E[𝜑 (𝜗) | 𝑋 ] of test functions w.r.t.
the posterior without evaluating another high-dimensional integral. Instead, the
sampling methods we discuss in this book can output random variables 𝜗 1, . . . , 𝜗𝑛
whose distributions are approximately 𝑝𝜗 |𝑋 (· | 𝑋 ), and the expectation can be
approximated to arbitrary accuracy via the averages 𝑛 −1 𝑛𝑖=1 𝜑 (𝜗𝑖 ).
Í
over states is the Boltzmann (or Gibbs) distribution whose density is proportional
to exp(−𝑉 /𝑇 ) (where 𝑇 is the temperature of the system). Naturally, sampling
provides a method for probing properties of the equilibrium distribution. More
subtly, the mixing time of specific sampling algorithms also provides information
about the system such as metastability phenomena; we revisit this in Chapter 11.
Due to this physical interpretation, we will often refer to 𝑉 as the potential energy.
With a pure gradient flow d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡, we would expect the dynamics to converge to
stationary points of 𝑉 . The Brownian motion ensures that we fully explore the distribution
𝜋, as is required in sampling. Under mild conditions, the unique stationary distribution
of the Langevin diffusion is indeed 𝜋 ∝ exp(−𝑉 ), which makes this diffusion a good
candidate upon which to base a sampling algorithm.
Since the Langevin diffusion is “a gradient flow + noise”, it is no wonder that researchers
have drawn parallels between this diffusion and the gradient flow from optimization.
However, the connection actually lies much deeper than this superficial observation would
suggest. There is a natural geometry on the space of probability measures with finite
second moment, P2 (R𝑑 ), namely the 2-Wasserstein distance 𝑊2 from the theory of optimal
transport. The space (P2 (R𝑑 ),𝑊2 ) turns out to be much richer than a metric space; in fact,
it is almost a Riemannian manifold. In turn, the Riemannian structure allows us to define
v
gradient flows on this space. The punchline here is that if 𝜋𝑡 denotes the law of 𝑍𝑡 , then
the curve of measures 𝑡 ↦→ 𝜋𝑡 is the gradient flow of the Kullback–Leibler (KL) divergence
KL(· k 𝜋) with respect to the 𝑊2 geometry. Hence, at the level of the trajectory (𝑍𝑡 )𝑡 ≥0 ,
the Langevin diffusion is a noisy gradient flow, but at the level of measures (𝜋𝑡 )𝑡 ≥0 , it is
precisely a gradient flow! This remarkable connection was introduced in the seminal work
of Jordan, Kinderlehrer, and Otto [JKO98].
This perspective suggests that we can study the convergence of the Langevin diffusion
using tools from optimization. For example, a standard assumption in the optimization
literature which allows for fast rates of convergence is that of strong convexity of the
objective function. Hence, we can ask under what conditions the functional KL(· k 𝜋)
on the space of measures is strongly convex along 𝑊2 geodesics. Quite pleasingly, this
is equivalent to the (Euclidean) strong convexity of the potential 𝑉 . Consequently, the
assumption of strong convexity of 𝑉 , which is natural in optimization for studying gradient
flows, turns out to be natural in the sampling context as well.
Much of this book is devoted to the case when 𝑉 is strongly convex; we refer to 𝜋
as being strongly log-concave. Besides its naturality and simplicity, it is also a practical
assumption. For example, in the application to Bayesian statistics, the Bernstein–von
Mises theorem states that as the number of data points tends to infinity, the posterior
distribution closely resembles a Gaussian distribution and is thus (almost) strongly log-
concave, a fact which has already been exploited to give sampling guarantees in the
context of Bayesian inverse problems (see, e.g., [NW20]). However, some of the results
also apply to restricted classes of non-log-concave measures, and in Chapter 11 we will
see what can be said about non-log-concave sampling in general.
Before using the Langevin diffusion for sampling, however, it is first necessary to
discretize the process in time. The simplest discretization, known as the Euler–Maruyama
discretization, proceeds by fixing a step size ℎ > 0 and following the iteration
√
𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
Since the Brownian increment 𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ has the normal(0, ℎ𝐼𝑑 ) distribution, this
iteration can be easily implemented once we have access to a gradient oracle for 𝑉 and
the ability to draw standard Gaussian variables. This iteration is commonly known as the
Langevin Monte Carlo (LMC) algorithm, or the unadjusted Langevin algorithm (ULA); in
this book, we stick to the former acronym.
The LMC algorithm is the starting point of our study. As a result of research in the
last decade, we now have the following guarantee. For any of the common √︁ divergences
d between probability measures, e.g., d(𝜇, 𝜋) = 𝑊2 (𝜇, 𝜋) or d(𝜇, 𝜋) = KL(𝜇 k 𝜋), and
with an appropriate choice of initialization and step size, the law 𝜇 𝑁ℎ of the 𝑁 -th iterate
of LMC satisfies d(𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 with a number of iterations 𝑁 which is polynomial
vi PREFACE
in the problem parameters (the dimension 𝑑,√the condition number 𝜅 of 𝑉 , and the
inverse accuracy 1/𝜀). For example, when d = KL, the state-of-the-art guarantee reads
𝑁 =𝑂 e(𝜅𝑑/𝜀 2 ). Whereas the convergence of the continuous-time diffusion is classical and
typically proven via abstract calculus, the quantitative non-asymptotic convergence of the
discretized algorithm necessitates the development of a new toolbox of analysis techniques.
A primary goal of this book to make this toolbox more accessible to researchers who are
not yet acquainted with the field.
Beyond the standard LMC algorithm, there is now a rich variety of algorithms in
the sampling literature. Some algorithms are directly inspired by other optimization
algorithms (e.g., mirror descent), whereas other algorithms have their root in the classical
theory of Markov processes (e.g., the use of a Metropolis–Hastings filter). We will also
explore some of these more sophisticated algorithms in detail, as in many cases they
represent (provably) substantial improvements over standard LMC.
Finally, although we began this introduction by discussing the goal of understanding
the complexity of sampling, in fact the complexity is not yet well-understood. The issue
here is that there are currently very few lower bounds on the complexity of sampling. This
is in contrast with the field of optimization, in which oracle complexity lower bounds have
in most situations identified nearly optimal algorithms for optimizing various function
classes. In Chapter 9, we will explain the current progress towards achieving this goal for
sampling, but much work remains to be done.
For example, here is the precise statement for a fundamental open question about the
complexity of sampling.
extends to even recent theoretical works on log-concave sampling, for which we have
omitted any discussion of sampling from convex bodies or polytopes. Although these
works constitute fundamental developments in the field, here we chose to limit our focus
to the part of the literature which is more strongly inspired by optimization algorithms.
Naturally, the other topics we explore in the book are far from comprehensive.
Notational Conventions
The symbols ∧ and ∨ mean “minimum” and “maximum” respectively. We write 𝑎 . 𝑏 or
𝑎 = 𝑂 (𝑏) to mean that 𝑎 ≤ 𝐶𝑏 for a universal constant 𝐶 > 0. Similarly, 𝑎 & 𝑏 or 𝑎 = Ω(𝑏)
mean that 𝑎 ≥ 𝑐𝑏 for a universal constant 𝑐 > 0, and 𝑎 𝑏 or 𝑎 = Θ(𝑏) mean that both
𝑎 . 𝑏 and 𝑎 & 𝑏.
For a function 𝑓 : R𝑑 → R, we write 𝜕𝑖 𝑓 to denote the 𝑖-th partial derivative of 𝑓 . The
gradient ∇𝑓 is the vector of partial derivatives (𝜕1 𝑓 , . . . , 𝜕𝑑 𝑓 ), and the Hessian ∇2 𝑓 is the
matrix (𝜕𝑖 𝜕 𝑗 𝑓 )𝑖,𝑗 ∈[𝑑] . For a vector field 𝑣 : R𝑑 → R𝑑 , we also use the notation ∇𝑣 to denote
the Jacobian matrix of 𝑣. The divergence of a vector field 𝑣 is div 𝑣 = ∇ · 𝑣 = 𝑑𝑖=1 𝜕𝑖 𝑣𝑖 , and
Í
the Laplacian of 𝑓 is Δ𝑓 = tr ∇2 𝑓 = 𝑑𝑖=1 𝜕𝑖2 𝑓 .
Í
Acknowledgements
This book started off as a series of lecture notes during my visit to New York University
in Spring 2022, and owes much to the audience and hospitality there.
Before its inception, however, the idea of writing a book was first conceived during
the Simons Institute for the Theory of Computing program on Geometric Methods in
Sampling and Optimization (GMOS) in Fall 2021. There, I met a lot of my long-term
collaborators and learned much of what I currently know about sampling. Overall, I found
the program to be incredibly intellectually stimulating and I am grateful.
I am indebted to collaborators with whom I have had many fruitful conversations,
including (but not limited to): Kwangjun Ahn, Francis Bach, Krishnakumar Balasubrama-
nian, Silvère Bonnabel, Joan Bruna, Sébastien Bubeck, Sitan Chen, Yongxin Chen, Yuansi
Chen, Xiang Cheng, Jaume de Dios Pont, Alain Durmus, Ronen Eldan, Murat Erdogdu,
Patrik Gerber, Ramon van Handel, Ye He, Marc Lambert, Thibaut Le Gouic, Holden Lee,
Yin Tat Lee, Jerry Li, Mufan (Bill) Li, Yuanzhi Li, Chen Lu, Jianfeng Lu, Yian Ma, Tyler
Maunu, Shyam Narayanan, Philippe Rigollet, Lionel Riou-Durand, Adil Salim, Ruoqi Shen,
Austin Stromme, Kevin Tian, Jure Vogrinc, Lihan Wang, Andre Wibisono, Anru Zhang,
and Matthew Zhang.
viii PREFACE
1
CHAPTER 1
3
4 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
In Section 1.1.4, we discuss the rather technical construction of the Itô integral. For
the remainder of the book, the details of the construction are less important than the
calculation rules that follow. The reader who is unfamiliar with stochastic calculus is
encouraged to skip Section 1.1.4 upon first reading.
1. 𝐵 0 = 0.
2. (independence of increments) For all 0 < 𝑡 1 < · · · < 𝑡𝑘 , the random variables
(𝐵𝑡1 , 𝐵𝑡2 − 𝐵𝑡1 , . . . , 𝐵𝑡𝑘 − 𝐵𝑡𝑘−1 ) are mutually independent.
𝐵𝑡−𝑠 ∼ normal(0, 𝑡 − 𝑠) .
Brownian motion was originally introduced over a century ago as a model for the
jittery path of a particle which is constantly colliding with surrounding molecules. Since
its inception, Brownian motion has been used to model the flow of heat, to price options
at the financial market, to solve partial differential equations, to tease out the geometry of
manifolds, and of course, to sample from probability distributions. It is perhaps not clear
at first sight that such a process even exists,1 but it would take us too far afield to give a
construction here.
Instead, our goal is to compute integrals involving Brownian motion: given a stochastic
process (𝜂𝑡 )𝑡 ≥0 , how do we make sense of an expression such as 0 𝜂𝑡 d𝐵𝑡 ? Once we have
∫𝑇
stochastic integration in hand, we can then formulate and solve stochastic differential
equations. The solution to such an equation is a diffusion process, no less jittery than the
Brownian motion which drives it, and yet in the right hands it becomes an incredible tool
for solving a plethora of disparate problems.
The main technical difficulty in defining the stochastic integral is that Brownian motion
is an irregular process: for√small 𝑡 > 0, by definition 𝐵𝑡 ∼ normal(0, 𝑡), which means that
|𝐵𝑡 | is typically of size 𝑡. In particular, this prevents Brownian motion from being
1 Of course, this did not stop Einstein from using it to probe the microscopic structure of matter.
1.1. A PRIMER ON STOCHASTIC CALCULUS 5
Defining the Itô integral at a single time 𝑇 . Suppose first that (𝜂𝑡 )𝑡 ≥0 is a process of
the form
𝑘−1
𝐻𝑖 1{𝑡 ∈ (𝑡𝑖 , 𝑡𝑖+1 ]}
∑︁
𝜂𝑡 = (1.1.2)
𝑖=0
for some 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 , where 𝐻𝑡𝑖 is bounded and ℱ𝑡𝑖 -measurable. We call 𝜂 an
elementary process. In this case, perhaps the only reasonable definition of the stochastic
integral is to take
∫ 𝑇 𝑘−1
∑︁
𝜂𝑡 d𝐵𝑡 B 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) . (1.1.3)
0 𝑖=0
This is indeed what we shall do, but for the moment we will refrain from using the integral
symbol and write this as I[0,𝑇 ] (𝜂) to avoid confusion.
We would like to extend this definition to more general processes, but before doing so
we record two key properties of the stochastic integral. The first is that 𝑡 ↦→ I[0,𝑡] (𝜂) is a
continuous martingale, i.e., it is continuous and satisfies the following definition.
Indeed, we deduce that 𝑡 ↦→ I[0,𝑡] (𝜂) is a martingale from the fact that 𝐻𝑖 is ℱ𝑡𝑖 -
measurable for each 𝑖, and because (𝐵𝑡 )𝑡 ≥0 is a martingale.
The second key property is that we can compute the variance:
𝑘−1
h∑︁ 2 i ∑︁
𝑘−1
E[I[0,𝑇 ] (𝜂) 2 ] = E 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) = E[|𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 )| 2 ] (1.1.5)
𝑖=0 𝑘=0
6 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
𝑘−1 ∫ 𝑇
E[𝐻𝑖2 ] 𝜂𝑡2 d𝑡 .
∑︁
(1.1.6)
= (𝑡𝑖+1 ∧ 𝑇 ) − (𝑡𝑖 ∧ 𝑇 ) = E
𝑘=0 0
Here, we used the basic properties listed in the definition of Brownian motion, such as
independence of increments. This equation shows that if P𝑇 B P ⊗ 𝔪| [0,𝑇 ] , where 𝔪| [0,𝑇 ]
is the Lebesgue measure on [0,𝑇 ], then the mapping 𝜂 ↦→ I[0,𝑇 ] (𝜂) is an isometry from
𝐿 2 (P𝑇 ) to 𝐿 2 (P). We use this isometry to extend the definition of the stochastic integral
as follows.
For a more general process (𝜂𝑡 )𝑡 ≥0 , assume that it is progressive2 and satisfies the
integrability condition
∫ 𝑇
k𝜂 k 𝐿2 2 (P𝑇 ) =E 𝜂𝑡2 d𝑡 < ∞ .
0
Defining the Itô integral as a stochastic process. Although the procedure above
successfully defines I[0,𝑡] (𝜂) for a fixed time 𝑡 > 0, there is no guarantee of coherence
between different times 𝑡. The trouble arises because I[0,𝑡] (𝜂) is defined as a limit, but
this limit is only well-specified up to an event of measure zero, and these measure zero
events for different times 𝑡 might conceivably accumulate into something more. This
is undesirable because the true power of stochastic calculus comes from viewing the
stochastic integral as a time-indexed stochastic process in its own right.
The key insight is to go back to the approximating sequence {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N}. For
each 𝑘, the Itô integral is defined as an entire process 𝑡 ↦→ I[0,𝑡] (𝜂 (𝑘) ) via (1.1.27), and
moreover this process is a continuous martingale. We can then apply powerful results
on martingale convergence, which are developed in Section 1.1.4, in order to prove the
following theorem.
Theorem 1.1.7. Suppose that (𝜂𝑡 )𝑡 ≥0 is progressive and satisfies E 0 𝜂𝑡2 d𝑡 < ∞. Then,
∫𝑇
2 The
process (𝜂𝑡 )𝑡 ≥0 is progressive if for all 𝑇 ≥ 0, the mapping (𝜔, 𝑡) ↦→ 𝜂𝑡 (𝜔) is measurable w.r.t.
ℱ𝑇 ⊗ ℬ[0,𝑇 ] , where ℬ[0,𝑇 ] is the Borel 𝜎-algebra on [0,𝑇 ].
1.1. A PRIMER ON STOCHASTIC CALCULUS 7
and satisfies
h ∫ 𝑡 2 i ∫ 𝑡
𝜂𝑠 d𝐵𝑠 = E 𝜂𝑠2 d𝑠 , for all 𝑡 ∈ [0,𝑇 ] . (1.1.8)
E
0 0
Extending the definition via localization. There is one final step which is traditionally
taken, namely to expand the class of allowable integrands to progressive processes 𝜂 with
∫ 𝑇
𝜂𝑠2 d𝑠 < ∞ almost surely . (1.1.9)
0
Note that this condition is weaker than the condition E 0 𝜂𝑠2 d𝑠 < ∞. Such an extension
∫𝑇
Definition 1.1.10. A stopping time 𝜏 is a random variable such that for each 𝑡 ≥ 0,
the event {𝜏 ≤ 𝑡 } is ℱ𝑡 -measurable.
1. for all 𝑛 ∈ N, (𝜂𝑡 1{𝑡 ≤ 𝜏𝑛 })𝑡 ≥0 has finite k·k 𝐿2 (P𝑇 ) norm, and
2. 𝜏𝑛 → 𝑇 almost surely.
The good news is that localizing sequences are easy to find, and the following proposi-
tion barely needs a proof (and so we omit it).
8 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
The idea now is simple: for each progressive process 𝜂 satisfying (1.1.9), let (𝜏𝑛 )𝑛∈N
be a localizing sequence for 𝜂 on [0,𝑇 ]. For each 𝑛 ∈ N, by the definition of a localizing
sequence, we can apply our existing definition of the Itô integral, which gives us a
continuous martingale
∫ 𝑡
𝑡 ↦→ 𝜂𝑠 1{𝑠 ≤ 𝜏𝑛 } d𝐵𝑠 . (1.1.13)
0
Then, we can define the Itô integral of 𝜂 to be a limit of the processes (1.1.13). The details
are straightforward, and omitted.
We can also define an analogue of martingales using localizing sequences; these are
almost martingales, but lack the required integrability.
Proposition 1.1.15. If 𝜂 is a progressive process satisfying (1.1.9), then the Itô integral
𝑡 ↦→ 0 𝜂𝑠 d𝐵𝑠 is a continuous local martingale.
∫𝑡
Looking forward. The construction of the Itô integral may seem quite abstract; indeed,
we are sorely lacking in examples. At this juncture, it is common to work out simple
exercises such as computing 0 𝐵𝑠 d𝐵𝑠 , and while this is pedagogically natural it is also
∫𝑡
liable to mislead the reader into thinking that the main use of Itô integration is to solve
synthetic problems with no apparent purpose. As counterintuitive as it may seem, our
solution to the heavy amount of abstraction will be more abstraction. In the next section,
we will develop the single most important computation rule in stochastic calculus (along
with the Itô isometry (1.1.8)), called Itô’s formula, after which we will hardly need to
1.1. A PRIMER ON STOCHASTIC CALCULUS 9
return to the definition of a stochastic integral ever again. And even Itô’s formula will be
abstracted out into the language of Markov semigroups in Section 1.2. The upshot is that
we introduced the Itô integral because it is the foundation of our field, but most of what
we have developed thus far is not necessary for the remainder of the book.
where (𝑏𝑡 )𝑡 ≥0 takes values in R𝑑 , (𝜎𝑡 )𝑡 ≥0 takes values in R𝑑×𝑁 , and (𝐵𝑡 )𝑡 ≥0 is a standard
Brownian motion in R𝑁 .
Implicit in the above definition is that the process should be well-defined: the coeffi-
cients (𝑏𝑡 )𝑡 ≥0 and (𝜎𝑡 )𝑡 ≥0 should be progressive processes for which the integrals exist.
Also, the random variable 𝑋 0 should be ℱ0 -measurable, in which case the process (𝑋𝑡 )𝑡 ≥0
is also progressive.
We refer to (𝑏𝑡 )𝑡 ≥0 as the drift coefficient and (𝜎𝑡 )𝑡 ≥0 as the diffusion coefficient.
When the drift coefficient is zero, then (𝑋𝑡 )𝑡 ≥0 is simply an Itô integral, and thus a
continuous local martingale (Proposition 1.1.15). Otherwise, for a non-zero drift coefficient,
the process (𝑋𝑡 )𝑡 ≥0 is no longer necessarily a local martingale. As a shorthand, we often
write the Itô process in differential form:
Our goal is to understand how the Itô process transforms when we compose it with a
smooth function 𝑓 : R𝑑 → R. This leads to Itô’s formula, which is the bread and butter of
stochastic calculus computations.
Although the notation (1.1.17) is informal, it√conveys the main intuition. For ℎ > 0
small, we√can approximate 𝑋𝑡+ℎ ≈ 𝑋𝑡 + ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉, where 𝜉 ∼ normal(0, 𝐼𝑑 ). Note
that the ℎ scaling comes from the fact that the Brownian increment 𝐵𝑡+ℎ − 𝐵𝑡 has
the normal(0, ℎ𝐼𝑑 ) distribution. Now suppose that 𝑓 : R𝑑 → R is twice continuously
differentiable. Normally, to compute 𝑓 (𝑋𝑡+ℎ ) − 𝑓 (𝑋𝑡 ) up to order 𝑜 (ℎ), a first-order Taylor
expansion of 𝑓 suffices, but in stochastic calculus this would miss important terms arising
10 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
from the Brownian motion: indeed, second-order terms in 𝐵𝑡+ℎ − 𝐵𝑡 are of order ℎ and
hence not negligible.
Therefore, we carry out the Taylor expansion to an extra term:
√ 1 √ √
𝑓 (𝑋𝑡+ℎ ) − 𝑓 (𝑋𝑡 ) ≈ h∇𝑓 (𝑋𝑡 ), ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉i + hℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉, ∇2 𝑓 (𝑋𝑡 ) (ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉)i
2
1 √
= ℎ h∇𝑓 (𝑋𝑡 ), 𝑏𝑡 i + h𝜎𝑡 𝜉, ∇2 𝑓 (𝑋𝑡 ) 𝜎𝑡 𝜉i + ℎ h𝜎𝑡T ∇𝑓 (𝑋𝑡 ), 𝜉i + 𝑜 (ℎ) .
2
This expression suggests that (𝑓 (𝑋𝑡 ))𝑡 ≥0 is also an Itô process. The third term, which is of
√
order ℎ, turns into an Itô integral once integrated. Perhaps the most interesting term
is the second term, ℎ2 h𝜎𝑡 𝜎𝑡T ∇2 𝑓 (𝑋𝑡 ), 𝜉𝜉 T i, which is a genuinely new feature of stochastic
calculus. If we sum up many of these increments, we end up with an expression like
1 Í𝐾 2
2 𝑘=0 h𝜎𝑡+𝑘ℎ 𝜎𝑡+𝑘ℎ ∇ 𝑓 (𝑋𝑡+𝑘ℎ ), ℎ 𝜉𝑘 𝜉𝑘 i. If we replace each ℎ 𝜉𝑘 𝜉𝑘 by its expectation ℎ 𝐼𝑑
T T T
(which must be carefully justified; see the calculation of quadratic variation in Section 3.1),
then this resembles a Riemann sum, which converges to the integral of 12 h∇2 𝑓 (𝑋𝑡 ), 𝜎𝑡 𝜎𝑡T i.
This is formalized in the following theorem.
Theorem 1.1.18 (Itô’s formula). Let (𝑋𝑡 )𝑡 ≥0 be an Itô process, d𝑋𝑡 = 𝑏𝑡 d𝑡 + 𝜎𝑡 d𝐵𝑡 , and
let 𝑓 ∈ C 2 (R𝑑 ). Then, (𝑓 (𝑋𝑡 ))𝑡 ≥0 is also an Itô process which satisfies, for 𝑡 ≥ 0:
1
∫ 𝑡 ∫ 𝑡
h∇𝑓 (𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠 + h𝜎𝑠T ∇𝑓 (𝑋𝑠 ), d𝐵𝑠 i .
𝑓 (𝑋𝑡 ) − 𝑓 (𝑋 0 ) =
0 2 0
We omit the proof, since the bulk of the intuition is carried in the informal Taylor
series argument described above. Observe that since Itô integrals are (under appropriate
integrability conditions) continuous martingales, the expectation of the last term in Itô’s
formula is typically zero. Therefore,3
1
∫ 𝑡
E h∇𝑓 (𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠 , (1.1.19)
E 𝑓 (𝑋𝑡 ) − E 𝑓 (𝑋 0 ) =
0 2
or in differential form,
1
𝜕𝑡 E 𝑓 (𝑋𝑡 ) = E h∇𝑓 (𝑋𝑡 ), 𝑏𝑡 i + h∇2 𝑓 (𝑋𝑡 ), 𝜎𝑡 𝜎𝑡T i .
2
Itô’s formula can also be extended to time-dependent functions via
1
∫ 𝑡
𝜕𝑠 𝑓 (𝑠, 𝑋𝑠 ) + h∇𝑓 (𝑠, 𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑠, 𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠
𝑓 (𝑡, 𝑋𝑡 ) − 𝑓 (0, 𝑋 0 ) =
0 2
3 Even when the Itô integral is only a local martingale, one can usually justify the formula (1.1.19) anyway
Suppose we are given a complete filtered probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) which
supports a standard 𝑁 -dimensional adapted Brownian motion 𝐵. A solution to the SDE
is an adapted R𝑑 -valued process 𝑋 such that
∫ 𝑡 ∫ 𝑡
𝑋𝑡 = 𝑋 0 + 𝑏 (𝑠, 𝑋𝑠 ) d𝑠 + 𝜎 (𝑠, 𝑋𝑠 ) d𝐵𝑠 .
0 0
The question we address in this section is under what conditions there exists a unique
solution to the SDE. This is a question of a technical nature, but the answer is instructive
because the proof introduces standard arguments that recur frequently in stochastic
calculus. The main result is that if the coefficients 𝑏, 𝜎 are Lipschitz in space uniformly in
time, then the SDE admits a unique solution.
Before proceeding, we need the following lemma, which is used throughout the book.
Lemma 1.1.20 (Grönwall’s lemma). Let 𝑇 > 0 and let 𝑔 : [0,𝑇 ] → [0, ∞) be bounded
and measurable. Assume there exists 𝐶 1, 𝐶 2 ≥ 0 such that
∫ 𝑡
𝑔(𝑡) ≤ 𝐶 1 + 𝐶 2 𝑔, ∀𝑡 ∈ [0,𝑇 ] .
0
Then,
𝑛
(𝐶 2𝑡)𝑘
∑︁ ∫ 𝑡 ∫ 𝑠𝑛
≤ 𝐶1 𝑛+1
+ 𝐶2 ··· 𝑔(𝑠𝑛+1 ) d𝑠𝑛+1 · · · d𝑠 1 .
𝑘=0
𝑘! 0 0
Theorem 1.1.21 (existence and uniqueness of SDE solutions). Assume that 𝑏 and 𝜎
are continuous, and there exists 𝐶 > 0 such that for all 𝑡 ≥ 0 and 𝑥, 𝑦 ∈ R𝑑 ,
Then, for any complete filtered probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) and 𝑥 ∈ R𝑑 , there
exists a unique solution 𝑋 for the SDE with 𝑋 0 = 𝑥.
Proof. Uniqueness. Fix a time 𝑇 > 0 and suppose there exist two solutions (𝑋𝑡 )𝑡 ∈[0,𝑇 ]
and (𝑋e𝑡 )𝑡 ∈[0,𝑇 ] of the SDE on [0,𝑇 ] with 𝑋 0 = 𝑋 e0 . For 𝑡 ≥ 0, we compute the difference
between 𝑋𝑡 and 𝑋 e𝑡 using the Itô isometry (1.1.8), the Cauchy–Schwarz inequality, and the
Lipschitz assumption:
h
∫ 𝑡
2
∫ 𝑡
2 i
2
E[k𝑋𝑡 − 𝑋𝑡 k ] ≤ 2 E
{𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋𝑠 )} d𝑠
+
{𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋𝑠 )} d𝐵𝑠
e
e
e
0
h ∫ 𝑡 ∫ 0𝑡 i
≤ 2E 𝑇 e𝑠 )k 2 d𝑠 +
k𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋 e𝑠 )k 2HS d𝑠
k𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋
0 0
∫ 𝑡
≤ 2𝐶 2 (1 + 𝑇 ) E k𝑋𝑠 − 𝑋 e𝑠 k 2 d𝑠 .
0
From Gronwall’s lemma (Lemma 1.1.20), we obtain 𝑋𝑡 = 𝑋 e𝑡 almost surely. Actually, this
proof has a gap: we have not fully justified why we can apply the Itô isometry. To get
around this, one may use technique of localization, the details of which are omitted.
1.1. A PRIMER ON STOCHASTIC CALCULUS 13
In other words, we “freeze” the coefficients of the SDE using the process from the previous
stage of the iteration. The stochastic integrals make sense because inductively, each 𝑋 (𝑛)
is adapted and has continuous sample paths. Thus, since
∫ 𝑡 ∫ 𝑡
(𝑛+1) (𝑛) (𝑛) (𝑛−1)
𝑋𝑡 − 𝑋𝑡 = {𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋𝑠 )} d𝑠 + {𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠
0 0
we bound
∫
h 𝑢
2
(𝑛) 2
E sup k𝑋 (𝑛+1)
−𝑋 k ≤ 2 E sup
{𝑏 (𝑠, 𝑋𝑠(𝑛) ) − 𝑏 (𝑠, 𝑋𝑠(𝑛−1) )} d𝑠
[0,𝑡] 𝑢∈[0,𝑡] 0
∫ 𝑢
2 i
+ sup
{𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠
𝑢∈[0,𝑡] 0
C I + II .
For the second term, recall that 𝑡 ↦→ 0 {𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠 is a martingale. By
∫𝑡
Doob’s 𝐿 2 maximal inequality (Corollary 1.1.30 in Section 1.1.4) and the Itô isometry (1.1.8),
we can bound
h
∫ 𝑡
2 i ∫ 𝑡
II ≤ 4 E
(𝑛)
{𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1)
)} d𝐵𝑠
= 4 E k𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )k 2HS d𝑠
∫ 0𝑡 0
By induction (and using the fact that E sup [0,𝑇 ] k𝑋 (1) − 𝑋 (0) k 2 ≤ 𝐶𝑇 for some constant
𝐶𝑇 < ∞, which follows from similar arguments), we get
{2𝐶 2𝑇 (4 + 𝑇 )}
𝑛−1
(𝑛−1) 2
E sup k𝑋 (𝑛)
−𝑋 k ≤ 𝐶𝑇 .
[0,𝑇 ] (𝑛 − 1)!
almost surely. Passing to the limit in (1.1.22) shows that 𝑋 solves the SDE on [0,𝑇 ].
There is one last argument to make, which is to find a solution for the SDE on the
entire time interval [0, ∞). For each 𝑡 > 0, let 𝑇 > 𝑡 and define 𝑋𝑡 to be the solution of the
SDE on [0,𝑇 ] at time 𝑡. The uniqueness assertion in the first half of the theorem shows
that this is well-defined (regardless of the choice of 𝑇 ), and it also shows that the resulting
process 𝑋 is adapted, has continuous sample paths, and solves the SDE on [0, ∞).
Reflecting on the proof, the basic strategy is to coupling together two diffusions (with
the same driving Brownian motion), use Lipschitz bounds on the coefficients, and apply
Gronwall’s inequality. The same strategy will also be used to obtain convergence bounds
for sampling algorithms, albeit with a more quantitative goal in mind.
A theorem similar in spirit to Theorem 1.1.21 can be established under the assumption
that 𝑏 and 𝜎 are only locally Lipschitz, but in this case the solution to the SDE is not
guaranteed to last for all time. The issue is that when the coefficients grow faster than
linearly, there can be a finite (random) time 𝔢, called the explosion time, such that k𝑋𝑡 k → ∞
as 𝑡 → 𝔢. This phenomenon is already present for ODEs (see Exercise 1.4). For the purposes
of this book, the assumption of Lipschitz coefficients suffices.
1.1. A PRIMER ON STOCHASTIC CALCULUS 15
irregularity of Brownian motion this only works for integrands (𝜂𝑡 )𝑡 ≥0 which are “locally
of bounded variation”4 ; in particular not all continuous integrands are allowed. To get
around this, a standard idea is to approximate a continuous integrand (𝜂𝑡 )𝑡 ≥0 with a
sequence of integrands {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} which have bounded variation. For each 𝑘, it
is no trouble to define 0 𝜂𝑡(𝑘) d𝐵𝑡 , and then we can define 0 𝜂𝑡 d𝐵𝑡 as a suitable limit.
∫𝑇 ∫𝑇
This idea can be carried out, but the notion of limit that is used is 𝐿 2 . In particular, the
limit may not exist in an almost sure sense.
Now suppose that (𝜂𝑡 )𝑡 ≥0 is a stochastic process. Then, it becomes technically challeng-
ing to carry out the requisite approximations; another idea is required. The new insight
is that if we consider integrands which are adapted to a filtration, then the stochastic
integrals 𝑡 ↦→ 0 𝜂𝑠 d𝐵𝑠 are martingales, and we can leverage their powerful convergence
∫𝑡
𝑘−1
𝐻𝑖 1{𝑡 ∈ (𝑡𝑖 , 𝑡𝑖+1 ]} ,
∑︁
𝜂𝑡 = (1.1.24)
𝑖=0
where 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 and for each 𝑖, 𝐻𝑖 is bounded and ℱ𝑡𝑖 -measurable.
Lemma 1.1.25 (approximation via elementary processes). For any square integrable
process (𝜂𝑡 )𝑡 ∈[0,𝑇 ] , there is a sequence {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} of elementary processes with
∫ 𝑇
k𝜂 − 𝜂 (𝑘) k 𝐿2 2 (P𝑇 ) =E |𝜂𝑡 − 𝜂𝑡(𝑘) | 2 d𝑡 → 0 as 𝑘 → ∞ .
0
1. If (𝜂𝑡 )𝑡 ≥0 is an elementary process of the form (1.1.24), we define the Itô integral
of (𝜂𝑡 )𝑡 ≥0 on [0,𝑇 ] via
𝑘−1
∑︁
I[0,𝑇 ] (𝜂) B 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) . (1.1.27)
𝑖=0
The second part of the definition requires some justification. We checked in (1.1.5)
and (1.1.6) that the Itô isometry holds for elementary processes: the mapping I[0,𝑇 ] is an
isometry from elementary processes equipped with the 𝐿 2 (P𝑇 ) norm, to the space 𝐿 2 (P)
of square integrable random variables. Through this isometry, we deduce from the fact
that {𝜂 (𝑘) : 𝑘 ∈ N} is Cauchy that {I[0,𝑇 ] (𝜂 (𝑘) ) : 𝑘 ∈ N} is also Cauchy. Since 𝐿 2 (P) is a
complete metric space, there exists a limit I[0,𝑇 ] (𝜂) of the latter sequence. We can then
deduce that the Itô isometry (1.1.8) holds for square integrable processes as well.
Upon trying to view 𝑡 ↦→ I[0,𝑡] (𝜂) as a stochastic process, we encounter the usual
measure-theoretic difficulty: for fixed 𝑡, I[0,𝑡] (𝜂) is well-defined outside of a measure
zero event, but we have to contend with uncountably many values of 𝑡 and the measure
zero events may accumulate. Overcoming this issue requires some principle that holds
uniformly over 𝑡 ∈ [0,𝑇 ]; in our case, this principle is Doob’s maximal inequality from
the theory of martingales.
1.1. A PRIMER ON STOCHASTIC CALCULUS 17
To set the stage, we can verify from the explicit formula (1.1.27) that 𝑡 ↦→ I[0,𝑡] (𝜂)
is a continuous martingale when (𝜂𝑡 )𝑡 ≥0 is an elementary process. Before proceeding
onwards, we need to further develop the theory of martingales.
The class of submartingales is far broader and more useful than the class of martingales.
For example, if (𝑀𝑡 )𝑡 ≥0 is a martingale and 𝜑 : R → R is any convex function with
E|𝜑 (𝑀𝑡 )| < ∞ for all 𝑡 ≥ 0, then Jensen’s inequality for conditional expectations implies
that (𝜑 (𝑀𝑡 ))𝑡 ≥0 is a submartingale.
One of the key facts about submartingales is that they easily converge, which is often
deduced from Doob’s maximal inequality.
Theorem 1.1.29 (Doob’s maximal inequality). Let (𝑀𝑡 )𝑡 ≥0 be a continuous and non-
negative submartingale. Then, for all 𝜆,𝑇 > 0,
E𝑀
P sup 𝑀𝑡 ≥ 𝜆 ≤
𝑇
.
𝑡 ∈[0,𝑇 ] 𝜆
Proof. We prove the theorem for discrete-time submartingales (𝑀𝑛 )𝑛∈N . The result for
continuous-time submartingales can then be obtained via approximation.
Let 𝜏 B min{𝑘 ∈ N : 𝑀𝑘 ≥ 𝜆}. On the event {𝜏 ≤ 𝑁 }, we have 𝑀𝜏 ≥ 𝜆, so
𝑁
max 𝑀𝑘 ≥ 𝜆 = 𝜆 P(𝜏 ≤ 𝑁 ) ≤ E[𝑀𝜏 1{𝜏 ≤ 𝑁 }] = E[𝑀𝑘 1{𝜏 = 𝑘 }] .
∑︁
𝜆P
𝑘=0,1,...,𝑁
𝑘=0
Next, since {𝜏 = 𝑘 } is ℱ𝑘 -measurable, the submartingale property yields
E[𝑀𝑘 1{𝜏 = 𝑘 }] ≤ E E[𝑀𝑁 | ℱ𝑘 ] 1{𝜏 = 𝑘 } = E[𝑀𝑁 1{𝜏 = 𝑘 }] .
Hence,
h 𝑁 i
1{𝜏 = 𝑘 } = E[𝑀𝑁 1{𝜏 ≤ 𝑁 }] ≤ E 𝑀𝑁 ,
∑︁
𝜆P max 𝑀𝑘 ≥ 𝜆 ≤ E 𝑀𝑁
𝑘=0,1,...,𝑁
𝑘=0
where we used the assumption that the submartingale is non-negative.
It yields the following corollary, which we leave as Exercise 1.1.
18 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
sup 𝑀𝑡
𝑝 ≤
𝑝
k𝑀𝑇 k 𝐿𝑝 (P) .
𝐿 (P)
𝑡 ∈[0,𝑇 ] 𝑝 −1
(𝑛 )
In particular, the paths {(𝑋𝑡 𝑗 )𝑡 ≥0 : 𝑗 ∈ N} form a Cauchy sequence in C([0,𝑇 ]) (equipped
with the supremum norm). As C([0,𝑇 ]) is complete, there is a limit (𝑋𝑡 )𝑡 ≥0 which
belongs to C([0,𝑇 ]). A similar argument, this time using Doob’s 𝐿 2 maximal inequality
(𝑛 ) (𝑛 )
(Corollary 1.1.30), shows that 𝑋𝑡 𝑗 → 𝑋𝑡 in 𝐿 2 (P) as well. Since each 𝑡 ↦→ 𝑋𝑡 𝑗 is a
continuous martingale, so is 𝑡 ↦→ 𝑋𝑡 .
(𝑛 )
Finally, for any fixed 𝑡 ∈ [0, 1], on one hand we have I[0,𝑡] (𝜂 (𝑛 𝑗 ) ) = 𝑋𝑡 𝑗 → 𝑋𝑡 in
𝐿 2 (P) as noted above. On the other hand, the Itô isometry implies I[0,𝑡] (𝜂 (𝑛 𝑗 ) ) → I[0,𝑡] (𝜂)
in 𝐿 2 (P). Hence, almost surely, 𝑋𝑡 = I[0,𝑡] (𝜂).
1.2. MARKOV SEMIGROUP THEORY 19
The Markov property and iterated conditioning yields the following lemma (exercise).
𝑃𝑡 𝑓 − 𝑓
ℒ𝑓 B lim ,
𝑡&0 𝑡
Here we pause to warn the reader of some technical issues. The mathematical diffi-
culties of Markov semigroup theory arise in trying to answer the following questions:
on what space of functions is the generator defined, and in what sense is the above limit
taken? As we shall see, a natural space of functions to consider is 𝐿 2 (𝜋), with 𝜋 denoting
20 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
the stationary distribution of the diffusion. However, the generator is usually a differential
operator, and not all functions in 𝐿 2 (𝜋) have enough regularity to lie in the domain of the
generator. The theory of unbounded linear operators on a Hilbert space was developed to
handle this situation, but it is rife with subtle distinctions such as the difference between
symmetric and self-adjoint operators. We will brush over these issues and focus on the
calculation rules.
Example 1.2.4. Let us compute the generator of the Langevin diffusion (1.E.1). In fact,
the following computation is simply a consequence of Itô’s formula (Theorem 1.1.18),
but it does not hurt to derive this result from scratch. We approximate
∫ 𝑡 √ √
𝑍𝑡 = 𝑍 0 − ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡 = 𝑍 0 − 𝑡 ∇𝑉 (𝑍 0 ) + 2 𝐵𝑡 + 𝑜 (𝑡) .
0
Hence,
E[𝑓 (𝑍𝑡 ) | 𝑍 0 = 𝑧] − 𝑓 (𝑧)
ℒ𝑓 (𝑧) = lim = Δ𝑓 (𝑧) − h∇𝑉 (𝑧), ∇𝑓 (𝑧)i .
𝑡&0 𝑡
The Markov semigroup and dynamics. As promised, the Markov semigroup captures
the information that was contained in the original Markov process. One way to demon-
strate this is to prove theorems which show that the Markov process can be completely
recovered from its Markov semigroup. Another approach, which we now take up, is to
show that the dynamics of the Markov process are captured via calculation rules involving
the Markov semigroup.
1.2. MARKOV SEMIGROUP THEORY 21
𝜕𝑡 𝑃𝑡 𝑓 = ℒ𝑃𝑡 𝑓 = 𝑃𝑡 ℒ𝑓 .
Since this has to hold for all functions 𝑓 , we conclude the following.
Here is another illuminating way to express these equations. Let 𝑢𝑡 B 𝑃𝑡 𝑓 , and let
𝜋𝑡 = 𝑃𝑡∗ 𝜋0 . Then:
𝜕𝑡 𝑢𝑡 = ℒ𝑢𝑡 , (Kolmogorov’s backward equation)
𝜕𝑡 𝜋𝑡 = ℒ ∗ 𝜋𝑡 . (Kolmogorov’s forward equation)
The terms “backward” and “forward” are rather confusing, so we will not use them. Instead,
we will refer to the evolution equation for the density (Kolmogorov’s forward equation)
as the Fokker–Planck equation.
Consequently, we obtain characterizations of stationarity. Recall that 𝜋 is stationary
for the Markov process if, when 𝑋 0 ∼ 𝜋, then 𝑋𝑡 ∼ 𝜋 for all 𝑡 ≥ 0.
2. ℒ ∗ 𝜋 = 0.
22 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
Proof. The equivalence between the first two statements is the Fokker–Planck equation.
The third statement is the dual of the second statement.
ℒ ∗𝑔 = Δ𝑔 + div(𝑔 ∇𝑉 ) .
0 = ℒ ∗ 𝜋 = Δ𝜋 + div(𝜋 ∇𝑉 ) = div 𝜋 (∇ ln 𝜋 + ∇𝑉 ) .
Corollary 1.2.9. The stationary distribution of the Langevin diffusion (1.E.1) with
potential 𝑉 is 𝜋 ∝ exp(−𝑉 ).
Definition 1.2.10. The Markov semigroup (𝑃𝑡 )𝑡 ≥0 is reversible w.r.t. 𝜋 if for all
𝑓 , 𝑔 ∈ 𝐿 2 (𝜋) and all 𝑡 ≥ 0,
∫ ∫
𝑃𝑡 𝑓 𝑔 d𝜋 = 𝑓 𝑃𝑡 𝑔 d𝜋 .
1.2. MARKOV SEMIGROUP THEORY 23
follows from stationarity of 𝜋. This shows that 𝑃𝑡 is a contraction on 𝐿 2 (𝜋) (in fact, on
any 𝐿𝑝 (𝜋), 𝑝 ∈ [1, ∞]). Combining this with 𝑃𝑡 = exp(𝑡ℒ) leads us to predict that ℒ is a
negative operator. Below, we will give a direct proof of this fact; unsurprisingly, the proof
of negativity of ℒ still relies on the crucial fact (1.2.11).
Definition 1.2.12. The carré du champ is the bilinear operator Γ defined via
1
Γ(𝑓 , 𝑔) B
{ℒ(𝑓 𝑔) − 𝑓 ℒ𝑔 − 𝑔 ℒ𝑓 } .
2
The Dirichlet energy is the functional ℰ(𝑓 , 𝑔) B Γ(𝑓 , 𝑔) d𝜋.
∫
Proof. Recall from (1.2.11) that 𝑃𝑡 (𝑓 2 ) ≥ (𝑃𝑡 𝑓 ) 2 for all 𝑡 > 0. In terms of ℒ,
𝑓 2 + 𝑡 ℒ(𝑓 2 ) + 𝑜 (𝑡) ≥ [𝑓 + 𝑡 ℒ𝑓 + 𝑜 (𝑡)] 2 = 𝑓 2 + 2𝑡 𝑓 ℒ𝑓 + 𝑜 (𝑡)
and sending 𝑡 & 0 yields ℒ(𝑓 2 ) ≥ 2𝑓 ℒ𝑓 . (This proof does not require reversibility.)
24 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
Theorem 1.2.14 (fundamental integration by parts identity). Suppose that the genera-
tor ℒ and carré du champ Γ are associated with a Markov semigroup which is reversible
w.r.t. 𝜋. Then, for any functions 𝑓 and 𝑔,
∫ ∫ ∫
𝑓 (−ℒ)𝑔 d𝜋 = (−ℒ)𝑓 𝑔 d𝜋 = Γ(𝑓 , 𝑔) d𝜋 = ℰ(𝑓 , 𝑔) .
Since the identity implies that ℒ is symmetric, the integration by parts identity is in
fact equivalent to reversibility.
Proof of Theorem 1.2.14. Since ℒℎ d𝜋 = 0 for all functions ℎ (due to stationarity of 𝜋),
∫
expression which gives it its name. More generally, Γ(𝑓 , 𝑔) = h∇𝑓 , ∇𝑔i.
The identity in Theorem 1.2.14 reads
∫ ∫ ∫
𝑓 (−Δ𝑔 + h∇𝑉 , ∇𝑔i) d𝜋 = 𝑔 (−Δ𝑓 + h∇𝑉 , ∇𝑓 i) d𝜋 = h∇𝑓 , ∇𝑔i d𝜋
Gradient flow of the Dirichlet energy. It turns out that a reversible Markov process
follows the steepest descent of the Dirichlet energy with respect to 𝐿 2 (𝜋). To justify this,
for a curve 𝑡 ↦→ 𝑢𝑡 in 𝐿 2 (𝜋), write 𝑢¤𝑡 B 𝜕𝑡 𝑢𝑡 for the time derivative. The 𝐿 2 (𝜋) gradient of
the functional 𝑓 ↦→ ℰ(𝑓 ) B ℰ(𝑓 , 𝑓 ) at 𝑓 is defined to be the element ∇𝐿2 (𝜋)ℰ(𝑓 ) ∈ 𝐿 2 (𝜋)
such that for all curves 𝑡 ↦→ 𝑢𝑡 with 𝑢 0 = 𝑓 , it holds that
∫
𝑢¤0 ∇𝐿2 (𝜋)ℰ(𝑓 ) d𝜋 .
𝜕𝑡 𝑡=0 ℰ(𝑢𝑡 , 𝑢𝑡 ) =
Spectral gap and convergence. Consider a reversible Markov semigroup (𝑃𝑡 )𝑡 ≥0 and
recall that 𝑃𝑡 𝑓 (𝑥) = E[𝑓 (𝑋𝑡 ) | 𝑋 0 = 𝑥]. We are interested in the long-term behavior of
𝑃𝑡 𝑓 . If the process mixes, then by definition it forgets its initial condition, so∫that 𝑃𝑡 𝑓
converges to a constant; moreover, this constant should be the average value 𝑓 d𝜋 at
stationarity. How do we establish a rate of convergence for 𝑃𝑡 𝑓 → 𝑓 d𝜋?
∫
−ℒ ≥ 𝜆min > 0 ,
This is indeed the case. However, observe that since 𝑃𝑡 1 = 1 for all 𝑡 ≥ 0 and hence
ℒ1 = 0, the spectral gap condition is only supposed to hold on the subspace of 𝐿 2 (𝜋)
which is orthogonal to constants.
Definition 1.2.18. The Markov process is said to satisfy a Poincaré inequality (PI)
with constant 𝐶 PI if for all functions 𝑓 ∈ 𝐿 2 (𝜋),
1 1
∫
𝑓 (−ℒ)𝑓 d𝜋 = ℰ(𝑓 , 𝑓 ) ≥ k𝑓 − proj1⊥ 𝑓 k 𝐿2 2 (𝜋) = var𝜋 𝑓 .
𝐶 PI 𝐶 PI
The Poincaré constant 𝐶 PI corresponds to the inverse of the spectral gap. Based on
the calculus we have developed so far, it is not too difficult to prove the following result
(differentiate 𝑡 ↦→ k𝑃𝑡 𝑓 k 𝐿2 2 (𝜋) ), so we leave it as Exercise 1.7.
2𝑡
k𝑃𝑡 𝑓 k 𝐿2 2 (𝜋) ≤ exp − k𝑓 k 𝐿2 2 (𝜋) .
𝐶 PI
In particular, we can apply this result to the semigroup corresponding to the Langevin
diffusion (1.E.1) to obtain a spectral gap criterion for quantitative convergence. However,
this result is mainly of use when we are interested in a specific test function 𝑓 . More
generally, it is useful to obtain bounds on the rate of convergence of the law 𝜋𝑡 of 𝑋𝑡 to
the stationary distribution 𝜋. Recall (from (1.2.16)) that the relative density 𝜌𝑡 B 𝜋𝑡 /𝜋
solves the equation 𝜕𝑡 𝜌𝑡 = ℒ𝜌𝑡 , i.e., 𝜌𝑡 is given by 𝜌𝑡 = 𝑃𝑡 𝜌 0 . We can therefore apply the
preceding result to 𝑓 B 𝜌 0 − 1.
For a probability measure 𝜇, we define the chi-squared divergence
d𝜇
2 d𝜇
𝜒 2 (𝜇 k 𝜋) B
− 1
𝐿2 (𝜋) = var𝜋 if 𝜇 𝜋 ,
d𝜋 d𝜋
with 𝜒 2 (𝜇 k 𝜋) B ∞ otherwise. The result can be formulated as follows.
Example 1.2.21. For the Langevin diffusion, the Poincaré inequality reads
var𝜋 𝑓 ≤ 𝐶 PI E𝜋 [k∇𝑓 k 2 ]
d𝜇 d𝜇 d𝜇
∫ ∫
KL(𝜇 k 𝜋) B ln d𝜋 = ln d𝜇 if 𝜇 𝜋 ,
d𝜋 d𝜋 d𝜋
and KL(𝜇 k 𝜋) B ∞ otherwise.
Recall the notation 𝜌𝑡 B 𝜋𝑡 /𝜋 for the relative density of the Markov process w.r.t. 𝜋.
Since 𝜕𝑡 𝜌𝑡 = ℒ𝜌𝑡 , we can calculate via the integration by parts identity that
∫ ∫
𝜕𝑡 KL(𝜋𝑡 k 𝜋) = 𝜕𝑡 𝜌𝑡 ln 𝜌𝑡 d𝜋 = (ln 𝜌𝑡 + 1) ℒ𝜌𝑡 d𝜋 = −ℰ(𝜌𝑡 , ln 𝜌𝑡 ) . (1.2.22)
Hence, if ℰ(𝜌𝑡 , ln 𝜌𝑡 ) & KL(𝜋𝑡 k 𝜋), then we obtain exponential ergodicity of the diffusion
in KL divergence.
By linearizing the LSI, i.e., by taking 𝜌 = 1 + 𝜀 𝑓 for small 𝜀 > 0 and expanding both
sides of the LSI in powers of 𝜀, one can prove that the LSI implies a Poincaré inequality
with constant 𝐶 PI ≤ 𝐶 LSI (Exercise 1.8).
2
√︂
𝜇 𝜇 𝜇
2 𝜇
2
KL(𝜇 k 𝜋) ≤ E𝜋 ∇ , ∇ ln = E𝜇
∇ ln
= 4 E𝜋
∇
.
𝐶 LSI 𝜋 𝜋 𝜋 𝜋
The right-hand side of the above expression is important; it is known as the (relative)
Fisher information FI(𝜇 k 𝜋) B E𝜇 [k∇ ln(𝜇/𝜋)k 2 ]. In particular, the Fisher infor-
mation plays a central role in the study of non-log-concave sampling in Chapter 11.
The LSI often appears in many equivalent forms. For example, another formulation
is that for all functions 𝑓 : R𝑑 → R, it holds that
Definition 1.2.26. The iterated carré du champ is the operator Γ2 defined via
1
Γ2 (𝑓 , 𝑔) B {ℒΓ(𝑓 , 𝑔) − Γ(𝑓 , ℒ𝑔) − Γ(𝑔, ℒ𝑓 )} .
2
Γ2 (𝑓 , 𝑓 ) ≥ 𝛼 Γ(𝑓 , 𝑓 ) .
We have not explained yet what a diffusion Markov semigroup is, but for now we can
think of the Langevin diffusion as a fundamental example. The key point is that once
the (iterated) carré du champ operators are known, the curvature-dimension condition
amounts to an algebraic condition which can be easily checked, which in turn implies
the log-Sobolev inequality (and hence the Poincaré inequality by Exercise 1.8). For the
Langevin diffusion, this condition amounts to the following theorem.
Theorem 1.2.29. For the Langevin diffusion (1.E.1), the curvature-dimension condition
CD(𝛼, ∞) holds if and only if the potential 𝑉 is 𝛼-strongly convex.
Although we have deferred the Markov semigroup proofs of these results to Chapter 2,
we will shortly prove these results using the calculus of optimal transport.
30 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
Definition 1.3.1. Let X and Y be complete separable metric spaces, and consider a
cost functional 𝑐 : X × Y → [0, ∞]. The optimal transport cost from 𝜇 ∈ P (X) to
𝜈 ∈ P (Y) with cost 𝑐 is
∫
T𝑐 (𝜇, 𝜈) B inf 𝑐 (𝑥, 𝑦) d𝛾 (𝑥, 𝑦) , (1.3.2)
𝛾 ∈C(𝜇,𝜈)
where C(𝜇, 𝜈) is the space of couplings of (𝜇, 𝜈), i.e. the space of probability measures
𝛾 ∈ P (X × Y) whose marginals are 𝜇 and 𝜈 respectively.
A minimizer in this problem is known as an optimal transport plan.
Theorem 1.3.3. If the cost 𝑐 is lower semicontinuous, then an optimal transport plan
always exists.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 31
Proof. One can show that the functional 𝛾 ↦→ 𝑐 d𝛾 is lower semicontinuous and that
∫
C(𝜇, 𝜈) is compact, where we use the weak topology on P (X × Y). It is a general fact that
lower semicontinuous functions attain their minima on compact sets.
Historically, optimal transport began with Monge who considered the Euclidean cost
𝑐 (𝑥, 𝑦) B k𝑥 − 𝑦 k on R𝑑 × R𝑑 . Moreover, he considered a slightly different problem
in which, rather than searching over all couplings in C(𝜇, 𝜈), he restricted attention to
couplings which are induced by a mapping 𝑇 : R𝑑 → R𝑑 satisfying 𝑇# 𝜇 = 𝜈; this is known
as the Monge problem. In the probabilistic interpretation, this corresponds to a pair of
random variables (𝑋,𝑇 (𝑋 )) with 𝑋 ∼ 𝜇 and 𝑇 (𝑋 ) ∼ 𝜈. The physical interpretation of this
additional constraint is that no mass from 𝜇 be split up before it is transported, which
may be reasonable from a modelling perspective but leads to an ill-posed mathematical
problem. Indeed, there may not even exist any such mappings 𝑇 , as is the case when
𝜇 = 𝛿𝑥 places all of its mass on a single point and 𝜈 does not. Consequently, the solution
to the Monge problem remained unknown for centuries.
The breakthrough arrived when Kantorovich formulated the relaxation of the Monge
problem introduced in Definition 1.3.1, which is therefore known as the Kantorovich
problem. As the product measure 𝜇 ⊗ 𝜈 always belongs to C(𝜇, 𝜈), we at least know that
the constraint set is non-empty, and Theorem 1.3.3 shows that the Kantorovich problem
is well-behaved. Moreover, the Kantorovich problem is actually a convex problem on
P (X × Y); indeed, the objective is linear and the constraint set C(𝜇, 𝜈) is convex. Hence,
one can bring to bear the power of convex duality to study the Kantorovich problem
(historically, this study was actually the origin of linear programming).
Although a large part of optimal transport theory can be developed in a general
framework as above, for the rest of the section we will focus on the case 𝑐 (𝑥, 𝑦) B k𝑥 −𝑦 k 2
on R𝑑 × R𝑑 for the sake of simplicity.
Definition 1.3.4. The 2-Wasserstein distance between 𝜇 and 𝜈, denoted 𝑊2 (𝜇, 𝜈),
is defined via
∫
2
𝑊2 (𝜇, 𝜈) B inf k𝑥 − 𝑦 k 2 d𝛾 (𝑥, 𝑦) . (1.3.5)
𝛾 ∈C(𝜇,𝜈)
Write P2 (R𝑑 ) B {𝜇 ∈ P (R𝑑 ) | k·k 2 d𝜇 < ∞} for the space of probability measures
∫
Duality and optimality. For this section, it will actually be convenient to consider the
cost 𝑐 (𝑥, 𝑦) B 21 k𝑥 − 𝑦 k 2 instead, i.e. we consider 12 𝑊22 (𝜇, 𝜈) instead of 𝑊22 (𝜇, 𝜈).
32 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
The key to solving the Kantorovich problem is duality. First observe that the constraint
that the first ∫marginal of 𝛾 is 𝜇 ∫can be written as follows: for every function 𝑓 ∈ 𝐿 1 (𝜇),
it holds that 𝑓 (𝑥) d𝛾 (𝑥, 𝑦) = 𝑓 (𝑥) d𝜇 (𝑥). Doing the same for the constraint on the
second marginal of 𝛾, we can write the Kantorovich problem as an unconstrained min-max
problem
1 2 n∫ k𝑥 − 𝑦 k 2 ∫ ∫
𝑊 (𝜇, 𝜈) = inf sup d𝛾 (𝑥, 𝑦) + 𝑓 d𝜇 − 𝑓 (𝑥) d𝛾 (𝑥, 𝑦)
2 2 𝛾 ∈M+ (R𝑑 ×R𝑑 ) 𝑓 ∈𝐿 1 (𝜇) 2
𝑔∈𝐿 1 (𝜈)
∫ ∫ o
+ 𝑔 d𝜈 − 𝑔(𝑦) d𝛾 (𝑥, 𝑦) .
Definition 1.3.6. Let 𝜇, 𝜈 ∈ P2 (R𝑑 ). The dual optimal transport problem from 𝜇
to 𝜈 is the optimization problem
n∫ ∫ o
sup 𝑓 d𝜇 + 𝑔 d𝜈 . (1.3.7)
(𝑓 ,𝑔)∈D(𝜇,𝜈)
Since inf sup ≥ sup inf, the value of the dual problem is always at most 12 𝑊22 (𝜇, 𝜈).
On the
∫ other hand, if we find ∫ a transport
∫ plan 𝛾 and feasible dual potentials 𝑓 , 𝑔 such
★ ★ ★
2
that k𝑥 − 𝑦 k d𝛾 (𝑥, 𝑦) = 𝑓 d𝜇 + 𝑔 d𝜈, it implies that the primal and dual values
★ ★ ★
1. (strong duality) The value of the dual optimal transport problem from 𝜇 to 𝜈 equals
1 2
2 𝑊2 (𝜇, 𝜈).
2. (existence of optimal dual potentials) There exists an optimal pair (𝑓 ★, 𝑔★) for the
dual optimal transport problem.
k·k 2 k·k 2
𝑓★ = −𝜑, 𝑔★ = − 𝜑∗ , (1.3.9)
2 2
where 𝜑 : R𝑑 → R ∪ {∞} is a proper, convex, lower semicontinuous function and
𝜑 ∗ is its convex conjugate. If 𝛾 ★ denotes the optimal transport plan, then for 𝛾 ★-a.e.
(𝑥, 𝑦) ∈ R𝑑 × R𝑑 , it holds that 𝜑 (𝑥) + 𝜑 ∗ (𝑦) = h𝑥, 𝑦i, i.e., 𝛾 ★ is supported on the
subdifferential of 𝜑.
Various parts of this theorem can be proven separately; for example, strong duality
can be established by rigorously justifying the interchange of infimum and supremum via
a high-powered minimax theorem. Instead, we will outline a proof of the theorem which
simultaneously establishes all of the above facts.
Outline. In the proof, we abbreviate “proper convex lower semicontinuous function” to
simply “closed convex function”.
1. Optimal transport plans are cyclically monotone. Let 𝛾 ★ be an optimal transport
plan, and suppose that the pairs (𝑥 1, 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) lie in the support of 𝛾 ★. Then,
it should be the case that we cannot “rematch” these points to lower the optimal
transport cost, i.e. for every permutation 𝜎 of [𝑛] we should have
𝑛 𝑛
2
k𝑥𝑖 − 𝑦𝜎 (𝑖) k 2 .
∑︁ ∑︁
k𝑥𝑖 − 𝑦𝑖 k ≤
𝑖=1 𝑖=1
34 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
Equivalently,
𝑛
∑︁ 𝑛
∑︁
h𝑥𝑖 , 𝑦𝑖 i ≥ h𝑥𝑖 , 𝑦𝜎 (𝑖) i . (1.3.10)
𝑖=1 𝑖=1
Indeed, if this condition fails, then it is possible to construct a new transport plan
from 𝛾 ★ by slightly rearranging the mass which has strictly smaller transport cost,
which is a contradiction; see [GM96, Theorem 2.3].
A subset 𝑆 ⊆ R𝑑 × R𝑑 is said to be cyclically monotone if for all 𝑛 ∈ N+ , all
pairs (𝑥 1, 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ), and all permutations 𝜎 of [𝑛], the condition (1.3.10) holds.
Thus, optimal transport plans are supported on cyclically monotone sets.
3. Characterization of dual optimality. Now that we see the connection between con-
vexity and the primal problem, it is time to do the same for the dual problem.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 35
k𝑥 − 𝑦 k 2
𝑔(𝑦) ≤ − 𝑓 (𝑥) .
2
Hence, writing 𝜑 B k·k 2 /2 − 𝑓 , the optimal choice is
k𝑥 − 𝑦 k 2 k𝑦 k 2
𝑔(𝑦) = inf − 𝑓 (𝑥) = − sup {h𝑥, 𝑦i − 𝜑 (𝑥)} .
𝑥 ∈R𝑑 2 2 𝑥 ∈R𝑑
The function 𝜑 ∗ defined by 𝜑 ∗ (𝑦) B sup𝑥 ∈R𝑑 {h𝑥, 𝑦i − 𝜑 (𝑥)} is known as the convex
conjugate of 𝜑. To summarize, we have shown that for fixed 𝑓 = k·k 2 /2 − 𝜑, the
optimal choice for 𝑔 is k·k 2 /2 − 𝜑 ∗ . Similarly, if we fix 𝑔 = k·k 2 /2 − 𝜑 ∗ , the optimal
choice for 𝑓 is k·k 2 /2 − 𝜑 ∗∗ .
We have not yet established existence, but suppose for the moment that an optimal
dual pair (𝑓 ★, 𝑔★) exists. The preceding reasoning shows that 𝑓 ★ = k·k 2 /2 − 𝜑 and
𝑔★ = k·k 2 /2 − 𝜑 ∗ , where 𝜑 ∗∗ = 𝜑; otherwise the dual pair could be improved. Next,
it is known from convex analysis that 𝜑 = 𝜑 ∗∗ if and only if 𝜑 is a closed convex
function. Thus, optimal dual potentials have the representation (1.3.9).
4. Proof of strong duality. Now consider the optimal transport plan 𝛾 ★ (which exists;
see Theorem 1.3.3). We know that 𝛾 ★ is supported on a cyclically monotone set,
which in turn is contained in the subdifferential of a closed convex function 𝜑.
Define the functions 𝑓 ★ B k·k 2 /2 − 𝜑 and 𝑔★ B k·k 2 /2 − 𝜑 ∗ ; these are dual feasible
potentials. Also, it is a standard fact of convex analysis that (𝑥, 𝑦) ∈ 𝜕𝜑 if and only
if 𝜑 (𝑥) + 𝜑 ∗ (𝑦) = h𝑥, 𝑦i. Since the support of 𝛾 ★ is contained in 𝜕𝜑,
1 k𝑥 k 2 k𝑦 k 2
∫ ∫
2
k𝑥 − 𝑦 k d𝛾 (𝑥, 𝑦) = − h𝑥, 𝑦i d𝛾 ★ (𝑥, 𝑦)
★
+
2 2 2
2 k𝑦 k 2
∫
k𝑥 k
− 𝜑 (𝑥) − 𝜑 ∗ (𝑦) d𝛾 ★ (𝑥, 𝑦)
= +
2 2
∫
k·k 2 ∫
k·k 2
− 𝜑 d𝜇 + − 𝜑 ∗ d𝜈
=
∫ 2 ∫ 2
= 𝑓 ★ d𝜇 + 𝑔★ d𝜈 .
This simultaneously proves that strong duality holds and that (𝑓 ★, 𝑔★) is a pair of
optimal dual potentials.
36 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
5. For regular measures, optimal transport plans are induced by transport maps. An-
other fact from convex analysis is that convex functions enjoy some regularity:
a closed convex function 𝜑 is differentiable at Lebesgue-a.e. points of the interior of
its domain. Consequently, if 𝜇 is absolutely continuous w.r.t. Lebesgue measure,
then 𝜑 is differentiable 𝜇-a.e. This says that for 𝜇-a.e. 𝑥 ∈ R𝑑 , the gradient ∇𝜑 (𝑥)
exists and 𝜕𝜑 (𝑥) = {∇𝜑 (𝑥)}. Therefore, we can write 𝛾 ★ = (id, ∇𝜑) # 𝜇. In particular,
(∇𝜑) # 𝜇 = 𝜈, and ∇𝜑 is the optimal transport map from 𝜇 to 𝜈.
In our entire discussion so far, we started off with an arbitrary optimal transport
plan 𝛾 ★. Hence, we have shown that every optimal transport plan is of the form
(id, ∇𝜑) # 𝜇 for some closed convex function 𝜑.
6. Uniqueness of the optimal transport map. So far, we have not discussed uniqueness
of the solution to the Kantorovich problem, and in general uniqueness does not
hold. However, in the setting we are currently dealing with (the cost is the squared
Euclidean distance and 𝜇 is absolutely continuous), we can use additional arguments
to establish uniqueness. We will show that if 𝛾¯★ = (id, ∇𝜑) ¯ # 𝜇 is another optimal
transport plan where 𝜑¯ is a closed convex function, then ∇𝜑 = ∇𝜑¯ (𝜇-a.e.). Note that
in particular, it implies that there is only one gradient of a closed convex function
which pushes forward 𝜇 to 𝜈.
From our above arguments, we see that (k·k 2 /2 − 𝜑, ¯ k·k 2 /2 − 𝜑¯ ∗ ) is a dual optimal
pair. Therefore,
∫ ∫ ∫ ∫ ∫
{𝜑¯ (𝑥) + 𝜑¯ (𝑦)} d𝛾 (𝑥, 𝑦) =
∗ ★
𝜑¯ d𝜇 + 𝜑¯ d𝜈 =
∗
𝜑 d𝜇 + 𝜑 ∗ d𝜈
∫ ∫
= {𝜑 (𝑥) + 𝜑 (𝑦)} d𝛾 (𝑥, 𝑦) = h𝑥, 𝑦i d𝛾 ★ (𝑥, 𝑦) .
∗ ★
On the other hand, by the definition of 𝜑¯ ∗ , we have 𝜑¯ (𝑥) + 𝜑¯ ∗ (𝑦) ≥ h𝑥, 𝑦i for all
𝑥, 𝑦 ∈ R𝑑 , with equality if and only if 𝑦 ∈ 𝜕𝜑¯ (𝑥). So, the integrand of the above
expression is always non-negative but the integral is zero, which combined with
the previous fact shows that ∇𝜑 (𝑥) ∈ 𝜕𝜑¯ (𝑥) for 𝜇-a.e. 𝑥. But for 𝜇-a.e. 𝑥, we also
know that 𝜕𝜑¯ (𝑥) = {∇𝜑¯ (𝑥)}, and we conclude that ∇𝜑 = ∇𝜑¯ (𝜇-a.e.).
We refer to 𝜑 as a Brenier potential. From convex duality, ∇𝜑 ∗ = (∇𝜑) −1 . So, if 𝜈 is
also absolutely continuous, then the optimal transport map from 𝜈 to 𝜇 is ∇𝜑 ∗ . We often
write 𝑇𝜇→𝜈 = ∇𝜑 for the optimal transport map from 𝜇 to 𝜈.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 37
Definition 1.3.12. The space P2,ac (R𝑑 ) is the set of measures in P2 (R𝑑 ) which are
absolutely continuous w.r.t. Lebesgue measure.
Remarks on other costs. Many of the arguments can be generalized to other costs
𝑐. For example, the supports of optimal transport plans can be characterized via 𝑐-
cyclical monotonicity (generalizing cyclical monotonicity) and optimal dual potentials
can be characterized via 𝑐-concavity (generalizing convexity). Arguing that the optimal
transport plan is induced by a transport map requires additional information about the
differentiability of 𝑐.
Lemma 1.3.13 (gluing lemma). If 𝛾 1,2 ∈ P (X1 × X2 ) and 𝛾 2,3 ∈ P (X2 × X3 ) have the
same marginal distribution on X2 , then there exists 𝛾 ∈ P (X1 × X2 × X3 ) such that its
first two marginals are 𝛾 1,2 and its last two marginals are 𝛾 2,3 .
Proof. Let 𝜇 denote the common X2 -marginal of 𝛾 1,2 and 𝛾 2,3 . The idea is to first draw
𝑋 2 ∼ 𝜇. Then, draw 𝑋 1 from its conditional distribution given 𝑋 2 (according to 𝛾 1,2 ), and
similarly draw 𝑋 3 from its conditional distribution given 𝑋 2 (according to 𝛾 2,3 ). Then, take
𝛾 to be the law of the triple (𝑋 1, 𝑋 2, 𝑋 3 ).
The way to formalize this argument is via disintegration of measure.
Proof. Clearly, 𝑊2 is symmetric in its two arguments. It is also clear that 𝜇 = 𝜈 implies
𝑊2 (𝜇, 𝜈) = 0. Conversely, if 𝑊2 (𝜇, 𝜈) = 0, then there exists a coupling (𝑋, 𝑌 ) of (𝜇, 𝜈) such
that k𝑋 − 𝑌 k 2 = 0 a.s., or equivalently 𝑋 = 𝑌 a.s., which gives 𝜇 = 𝜈.
To verify the triangle inequality, we use the gluing lemma. Let 𝜇1, 𝜇2, 𝜇3 ∈ P2 (R𝑑 ), let
★ be optimal for (𝜇 , 𝜇 ), and let 𝛾 ★ be optimal for (𝜇 , 𝜇 ). Let 𝛾 be obtained by gluing
𝛾 1,2 1 2 2,3 2 3
38 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
= 𝑊2 (𝜇1, 𝜇2 ) + 𝑊2 (𝜇2, 𝜇3 ) .
Proposition 1.3.15. The metric space (P2 (R𝑑 ),𝑊2 ) is complete and separable. Also,
2
we have 𝑊2 (𝜇𝑛 , 𝜇) → 0 if and only if 𝜇𝑛 → 𝜇 weakly and k·k d𝜇𝑛 → k·k d𝜇.2
∫ ∫
The continuity equation. Next, we are going to consider dynamics in the space of
measures, i.e., curves of measures 𝑡 ↦→ 𝜇𝑡 . Throughout, we assume these curves are
sufficiently nice, in the following sense.
More generally, the metric derivative can be defined on any metric space and represents
the magnitude of the velocity of the curve, see [AGS08].
It is helpful to adopt a fluid dynamics analogy in which we think of 𝜇𝑡 as the mass
density of a fluid at time 𝑡. There are two complementary perspectives on fluid flows:
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 39
the Lagrangian perspective which emphasizes particle trajectories, and the Eulerian
perspective which tracks the evolution of the fluid density.
Suppose that 𝑋 0 ∼ 𝜇 0 and that 𝑡 ↦→ 𝑋𝑡 evolves according to the ODE 𝑋¤𝑡 = 𝑣𝑡 (𝑋𝑡 ). Here,
(𝑣𝑡 )𝑡 ≥0 is a family of vector fields, i.e. mappings R𝑑 → R𝑑 . Since the ODE describes the
evolution of the particle trajectory, it is the Lagrangian description of the dynamics. The
corresponding Eulerian description is the continuity equation.
Theorem 1.3.17. Let 𝑡 ↦→ 𝑣𝑡 be a family of vector fields and suppose that the random
variables 𝑡 ↦→ 𝑋𝑡 evolve according to 𝑋¤𝑡 = 𝑣𝑡 (𝑋𝑡 ). Then, the law 𝜇𝑡 of 𝑋𝑡 evolves
according to the continuity equation
𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 . (1.3.18)
1. For any family of vector fields 𝑡 ↦→ 𝑣˜𝑡 such that the continuity equation (1.3.18)
holds, we have | 𝜇|(𝑡)
¤ ≤ k𝑣˜𝑡 k 𝐿2 (𝜇𝑡 ) for all 𝑡.
40 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
2. Conversely, there exists a unique choice of vector fields 𝑡 ↦→ 𝑣𝑡 such that the
continuity equation (1.3.18) holds and k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) ≤ | 𝜇|(𝑡)
¤ for all 𝑡. The choice
of vector fields is also characterized by the following property: the continuity
equation (1.3.18) holds and for each 𝑡, 𝑣𝑡 = ∇𝜓𝑡 for a function 𝜓𝑡 : R𝑑 → R.
𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − id
𝑣𝑡 = lim (1.3.20)
𝛿&0 𝛿
Proof. 1. Proof of the first statement. Let 𝛿 > 0 and consider the flow map 𝐹𝑡,𝑡+𝛿 de-
fined as follows. Given any initial point 𝑥𝑡 ∈ R𝑑 , consider the ODE 𝑥¤𝑡 = 𝑣˜𝑡 (𝑥𝑡 )
started at 𝑥𝑡 . Then, 𝐹𝑡,𝑡+𝛿 maps 𝑥𝑡 to the solution 𝑥𝑡+𝛿 of the ODE at time 𝑡 + 𝛿.
If 𝑋𝑡 ∼ 𝜇𝑡 , then the continuity equation implies 𝐹𝑡,𝑡+𝛿 (𝑋𝑡 ) ∼ 𝜇𝑡+𝛿 , i.e., 𝐹𝑡,𝑡+𝛿 is a valid
transport map from 𝜇𝑡 to 𝜇𝑡+𝛿 . Hence, we can estimate
√︄
k𝐹𝑡,𝑡+𝛿 − idk 2
∫
𝑊2 (𝜇𝑡 , 𝜇𝑡+𝛿 )
≤ d𝜇𝑡 .
𝛿 𝛿2
2. Uniqueness of the optimal vector field. Suppose we find 𝑡 ↦→ 𝑣𝑡 satisfying the con-
tinuity equation and such that k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) ≤ | 𝜇|(𝑡).
¤ In light of the first statement, it
implies that the zero vector field is the minimizer of k𝑣𝑡 + 𝑤𝑡 k 𝐿2 (𝜇𝑡 ) among all vector
fields 𝑤𝑡 such that div(𝜇𝑡 𝑤𝑡 ) = 0. This is a strictly convex problem so the minimizer
is unique, meaning that the family 𝑡 ↦→ 𝑣𝑡 is uniquely determined.
3. Gradient vector fields are optimal. Here, we show that if the continuity equation
holds for the family of vector fields 𝑡 ↦→ 𝑣𝑡 and that 𝑣𝑡 = ∇𝜓𝑡 for all 𝑡, then the
vector fields are optimal.
There are at least two ways of seeing why gradient vector fields should be optimal.
First, the continuity equation∫ is equivalent to requiring that for all test functions
𝜑 : R → R, it holds that 𝜕𝑡 𝜑 d𝜇𝑡 = h∇𝜑, 𝑣𝑡 i d𝜇𝑡 . In this expression, the vector
∫
𝑑
field 𝑣𝑡 only enters through inner products with gradients. To put it another way, if
we consider the space 𝑆 B {∇𝜓 | 𝜓 : R𝑑 → R} of gradients, viewed as a subspace
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 41
of 𝐿 2 (𝜇𝑡 ), then we can write 𝐿 2 (𝜇𝑡 ) = 𝑆 ⊕ 𝑆 ⊥ (actually, to make this valid we should
take the closure of 𝑆, but we will ignore this detail). If we decompose 𝑣𝑡 according
to this direct sum, then 𝑣𝑡 = ∇𝜓𝑡 + 𝑤𝑡 for some function 𝜓𝑡 and some 𝑤𝑡 which is
orthogonal (in 𝐿 2 (𝜇𝑡 )) to 𝑆. If we replace 𝑣𝑡 by ∇𝜓𝑡 , then the continuity equation
continues to hold, but we have only made the norm k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) smaller, hence the
optimal choice of 𝑣𝑡 should be a gradient.
The second line of reasoning comes from the proof of the first statement: the reason
why the metric derivative | 𝜇|(𝑡)
¤ was upper bounded than k𝑣˜𝑡 k 𝐿2 (𝜇𝑡 ) is because the
flow map corresponding to 𝑡 ↦→ 𝑣˜𝑡 furnishes a possibly suboptimal transport map.
To fix this, the flow map for the optimal 𝑡 ↦→ 𝑣𝑡 should be approximately equal to the
optimal transport map, i.e. 𝑣𝑡 ≈ (𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − id)/𝛿. From the fundamental theorem of
optimal transport, however, 𝑇𝜇𝑡 →𝜇𝑡 +𝛿 is the gradient of a convex function, so in the
limit 𝑣𝑡 should be as well.
Instead of using these arguments, we will instead provide a proof based on direct
computation. If 𝑣𝑡 = ∇𝜓𝑡 , the continuity equation shows that
𝜓𝑡 d𝜇𝑡+𝛿 − 𝜓𝑡 d𝜇𝑡
∫ ∫ ∫ ∫
= 𝜓𝑡 𝜕𝑡 𝜇𝑡 + 𝑜 (1) = − 𝜓𝑡 div(𝜇𝑡 ∇𝜓𝑡 ) + 𝑜 (1)
𝛿 ∫
= k∇𝜓𝑡 k 2 d𝜇𝑡 + 𝑜 (1) .
= ∇𝜓𝑡 ,
𝛿
√︄∫
𝑊2 (𝜇𝑡 , 𝜇𝑡+𝛿 )
≤ k∇𝜓𝑡 k 2 d𝜇𝑡 + 𝑜 (1) .
𝛿
Our next goal is to interpret 𝑣𝑡 as the velocity vector to the curve, and k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) as its
norm, all through the lens of Riemannian geometry.
Here, 𝛾¤ (𝑡) denotes the tangent vector to the curve at time 𝑡. Note that the norm of the
tangent vector is measured w.r.t. the inner product on the tangent space 𝑇𝛾 (𝑡) M, hence
we write k𝛾¤ (𝑡)k𝛾 (𝑡) . If the infimum is achieved by a curve 𝛾, then 𝛾 is referred to as a
geodesic (a shortest path); if 𝑡 ↦→ k𝛾¤ (𝑡)k𝛾 (𝑡) is constant, then it is called a constant-speed
geodesic. From now on, we will only consider constant-speed geodesics, and the words
“constant speed” will be dropped for brevity.
Given a functional F : M → R, the gradient of F at 𝑝 is defined to be the unique
element ∇F(𝑝) ∈ 𝑇𝑝 M such that for all curves (𝑝𝑡 )𝑡 ∈R passing through 𝑝 at time 0 with
velocity 𝑣 ∈ 𝑇𝑝 M, it holds that 𝜕𝑡 |𝑡=0 F(𝑝𝑡 ) = h∇F(𝑝), 𝑣i𝑝 .
𝐿 2 (𝜇)
𝑑
𝑇𝜇 P2,ac (R ) B {∇𝜓 | 𝜓 ∈ Cc∞ (R𝑑 )} ,
𝐿 2 (𝜇)
𝑇𝜇 P2,ac (R𝑑 ) = {𝜆 (𝑇 − id) | 𝜆 > 0, 𝑇 is an optimal transport map} .
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 43
We equip the tangent space 𝑇𝜇 P2,ac (R𝑑 ) with the 𝐿 2 (𝜇) norm, which gives a Riemannian
metric. This does not define a genuine Riemannian manifold (e.g., it is not locally homeo-
morphic to a Euclidean space or even a Hilbert space), but we will treat it as one for the
purpose of developing calculation rules.
If the continuity equation 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 holds and 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ), then 𝑣𝑡 is
the tangent vector to the curve at time 𝑡. The condition 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ) is equivalent to
saying that 𝑣𝑡 is the optimal vector field considered in Theorem 1.3.19.
There are two questions to address. First, is this Riemannian structure compatible
with the 2-Wasserstein distance? In other words, we know that a Riemannian metric
induces a distance function; is the distance function induced by the Riemannian structure
of P2,ac (R𝑑 ) equal to 𝑊2 ? Second, what are the geodesics? We answer both questions via
the following theorem.
∫1 ∫1
Proof. Suppose that 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0. Then, 0
k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡 ≥ 0
| 𝜇|(𝑡)
¤ d𝑡. For a
partition 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 ≤ 1,
𝑘 𝑘
∑︁ ∑︁ 𝑊2 (𝑡𝑖−1, 𝑡𝑖 )
𝑊2 (𝜇0, 𝜇1 ) ≤ 𝑊2 (𝑡𝑖−1, 𝑡𝑖 ) = (𝑡𝑖 − 𝑡𝑖−1 ) .
𝑖=1 𝑖=1
𝑡𝑖 − 𝑡𝑖−1
∫1
As the size of the partition tends to zero, we obtain 𝑊2 (𝜇 0, 𝜇1 ) ≤ 0 | 𝜇|(𝑡) ¤ d𝑡. This shows
that 𝑊2 (𝜇0, 𝜇1 ) is at most the value of the infimum.
To show that equality holds, let 𝑋𝑡 be defined as in the theorem statement and note that
E[k𝑋¤𝑡 k 2 ] = k𝑣𝑡 k 𝐿2 2 (𝜇 ) by the correspondence of the Lagrangian and Eulerian perspectives.
𝑡
(This can be verified by writing the vector field explicitly as 𝑣𝑡 = (𝑇1 − id) ◦ 𝑇𝑡−1 , where
𝑇𝑡 B (1 − 𝑡) id + 𝑡 𝑇𝜇0 →𝜇1 .) Since E[k𝑋¤𝑡 k 2 ] = E[k𝑋 1 − 𝑋 0 k 2 ] = 𝑊22 (𝜇0, 𝜇1 ) does not depend
∫1
on time, the curve has constant speed, and 0 k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡 = 𝑊2 (𝜇0, 𝜇1 ).
To show uniqueness, again work in the Lagrangian perspective: suppose we have an
evolution 𝑡 ↦→ 𝑋𝑡 of random variables such that 𝑡 ↦→ E[k𝑋¤𝑡 k 2 ] is constant, and 𝑋 0 ∼ 𝜇0 ,
44 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
𝑋 1 ∼ 𝜇1 . Then, we have
h
∫ 1
2 i ∫ 1 ∫ 1 2
𝑊22 (𝜇 0, 𝜇1 ) 2
≤ E[k𝑋 1 − 𝑋 0 k ] = E
𝑋𝑡 d𝑡
≤ E
¤ 2
k𝑋𝑡 k d𝑡 =
¤ k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡
0 0 0
where the last equality follow from the constant-speed assumption. In order for the first
inequality to be equality, (𝑋 0, 𝑋 1 ) is an optimal coupling. In order for the second inequality
to be equality, 2
∫ 1 strict convexity of k·k implies that 𝑋𝑡 is constant in time and equal to its
¤
average 0 𝑋¤𝑡 d𝑡 = 𝑋 1 − 𝑋 0 .
𝛼 𝑡 (1 − 𝑡)
F(𝑝𝑡 ) ≤ (1 − 𝑡) F(𝑝 0 ) + 𝑡 F(𝑝 1 ) − d(𝑝 0, 𝑝 1 ) 2 ,
2
where d is the induced Riemannian distance (1.3.21).
2. For all 𝑝, 𝑞 ∈ M,
d(𝑝, 𝑞) 2 .
𝛼
F(𝑞) ≥ F(𝑝) + h∇F(𝑝), log𝑝 (𝑞)i𝑝 +
2
Here, ∇ denotes the Riemannian gradient.
any constant to the first variation. This does not cause any ambiguity, as we now see.
Recall that 𝑣𝑡 is the tangent vector to the curve of measures at time 𝑡 if the continuity
equation 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 holds and 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ). Using the continuity equation
with a curve such that 𝜇0 = 𝜇,
∫ ∫ ∫
𝜕𝑡 𝑡=0 F(𝜇𝑡 ) = 𝛿F(𝜇) 𝜕𝑡 𝑡=0 𝜇𝑡 = − 𝛿F(𝜇) div(𝑣 0 𝜇) = h∇𝛿F(𝜇), 𝑣 0 i d𝜇 .
(Here, the ∇ is the Euclidean gradient.) Since ∇𝛿F(𝜇) is the gradient of a function, from
our characterization of the tangent space we know that ∇𝛿F(𝜇) ∈ 𝑇𝜇 P2,ac (R𝑑 ). Therefore,
the equation above says that the Wasserstein gradient of F at 𝜇 is ∇𝛿F(𝜇).
Theorem 1.4.1. Let F : P2,ac (R𝑑 ) → R ∪ {∞} be a functional. Then, its Wasserstein
gradient at 𝜇 is
Since we take the Euclidean gradient of the first variation, the fact that the first variation
is only defined up to additive constant does not bother us.
The Wasserstein gradient flow of F is by definition a curve of measures 𝑡 ↦→ 𝜇𝑡 such
that its tangent vector 𝑣𝑡 at time 𝑡 is 𝑣𝑡 = −∇𝑊2 F(𝜇𝑡 ). Substituting this into the continuity
equation (1.3.18), we obtain the gradient flow equation
𝛿F(𝜇) = 𝑉 + ln 𝜇 + constant
1.4. THE LANGEVIN SDE AS A WASSERSTEIN GRADIENT FLOW 47
and therefore
𝜇
∇𝑊2 F(𝜇) = ∇𝑉 + ∇ ln 𝜇 = ∇ ln .
𝜋
The Wasserstein gradient flow of F satisfies
𝜇𝑡
𝜕𝑡 𝜇𝑡 = div 𝜇𝑡 ∇ ln .
𝜋
Comparing with the Fokker–Planck equation 𝜕𝑡 𝜋𝑡 = ℒ ∗ 𝜋𝑡 and the form of the
adjoint generator ℒ ∗ for the Langevin diffusion (see Example 1.2.8), we obtain a
truly remarkable fact: the law 𝑡 ↦→ 𝜋𝑡 of the Langevin diffusion with potential 𝑉 is the
Wasserstein gradient flow of KL(· k 𝜋).
The calculus of optimal transport was introduced by Otto in [Ott01], and it is often
known as Otto calculus; the interpretation of the Langevin diffusion in this context
was put forth in the seminal work [JKO98]. The paper [Ott01] also raises and answers a
salient question: given that we can view dynamics as gradient flows in different ways (e.g.
the Langevin diffusion can be either viewed as the gradient flow of the Dirichlet energy
in 𝐿 2 (𝜋), or the gradient flow of the KL divergence in P2,ac (R𝑑 )), what makes us prefer
one gradient flow structure over another? Otto argues that the Wasserstein geometry
is particularly natural because it cleanly separates out two aspects of the problem: the
geometry of the ambient space, which is reflected in the metric on P2,ac (R𝑑 ), and the
objective functional. Moreover, the objective functional in the Wasserstein perspective is
physically intuitive because it has an interpretation in thermodynamics. From a sampling
standpoint, the Wasserstein geometry is undoubtedly more compelling and useful.
In our exposition, we focused on the calculation rules for Wasserstein gradient flows,
but this is not how they are normally defined. Instead, one usually considers a sequence
of discrete approximations to the gradient flow and proves that there is a limiting curve;
this is called the minimizing movements scheme and it is developed in detail in [AGS08].
𝑓 : R𝑑 → R on Euclidean space and a curve 𝑡 ↦→ 𝑥𝑡 , then 𝜕𝑡 𝑓 (𝑥𝑡 ) = h∇𝑓 (𝑥𝑡 ), 𝑥¤𝑡 i and
Here, instead of just obtaining the Hessian, we have an additional term. However, if
𝑡 ↦→ 𝑥𝑡 is a constant-speed geodesic, then it has no acceleration (𝑥¥𝑡 = 0), and the extra
term vanishes.
In the same way, let (𝜇𝑡 )𝑡 ∈[0,1] denote a Wasserstein geodesic. Explicitly, if 𝑇 denotes
the optimal transport map from 𝜇0 to 𝜇 1 , then 𝜇𝑡 = [(1 − 𝑡) id + 𝑡 𝑇 ] # 𝜇 0 . We will calculate
𝜕𝑡2 |𝑡=0 F(𝜇𝑡 ), as a function of the tangent vector 𝑇 − id ∈ 𝑇𝜇0 P2,ac (R𝑑 ); this is interpreted
as h∇𝑊 2 F(𝜇 ) (𝑇 − id),𝑇 − idi . If we can lower bound this by 𝛼 k𝑇 − idk 2 , for all 𝜇 and
2
0 𝜇0 𝜇0 0
all optimal transport maps 𝑇 , it means that F is 𝛼-strongly convex.
Write E(𝜇) B 𝑉 d𝜇 for the energy and H(𝜇) B 𝜇 ln 𝜇 for the entropy. We deal
∫ ∫
If 𝑉 is 𝛼-strongly convex, then this is lower bounded by 𝛼 k𝑇 − idk 2𝜇0 , which means that E
is 𝛼-strongly convex.
The entropy is slightly trickier. Write 𝑇𝑡 B (1 − 𝑡) id + 𝑡 𝑇 . Since (𝑇𝑡 ) # 𝜇 0 = 𝜇𝑡 , the
change of variables formula shows that
𝜇0
= det ∇𝑇𝑡 .
𝜇𝑡 ◦ 𝑇𝑡
Therefore,
∫ ∫ ∫
𝜇0
H(𝜇𝑡 ) = 𝜇𝑡 ln 𝜇𝑡 = 𝜇0 ln(𝜇𝑡 ◦ 𝑇𝑡 ) = 𝜇0 ln
∫ det ∇𝑇𝑡
= H(𝜇0 ) − 𝜇0 ln det (1 − 𝑡) 𝐼𝑑 + 𝑡 ∇𝑇 .
Already from the fact that − ln det is convex on the space of positive semidefinite matrices,
we can see that 𝜕𝑡2 H(𝜇𝑡 ) ≥ 0. A more careful computation based on the derivatives of
− ln det shows that (exercise!)
∫
2
𝜕𝑡 𝑡=0 H(𝜇𝑡 ) = k∇𝑇 − 𝐼𝑑 k 2HS d𝜇 0 ≥ 0 . (1.4.3)
We now explore the implications of this fact for the gradient flow. If 𝑡 ↦→ 𝜇𝑡 is the gradient
flow for a functional F with inf F = 0, then
𝜕𝑡 F(𝜇𝑡 ) = h∇𝑊2 F(𝜇𝑡 ), −∇𝑊2 F(𝜇𝑡 ) i𝜇𝑡 = −k∇𝑊2 F(𝜇𝑡 )k 2𝜇𝑡 .
| {z }
tangent vector of the curve
This is precisely the log-Sobolev inequality (see Example 1.2.25); we have recovered the
Bakry–Émery theorem (Theorem 1.2.28) that CD(𝛼, ∞) implies an LSI, as well as Theo-
rem 1.2.24 which asserted that an LSI yields exponentially fast decay in the KL divergence.
Next, starting from the strong convexity inequality (1.4.6), we take 𝜇 = 𝜇★ so that
∇𝑊2 F(𝜇★) = 0, and hence
𝛼 2
F(𝜈) ≥ 𝑊2 (𝜈, 𝜇★) for all 𝜈 ∈ P2,ac (R𝑑 ) .
2
This is a quadratic growth inequality; as the name suggests, it asserts that F must grow
quadratically away from its minimizer. For the Langevin diffusion, it says
𝛼 2
KL(𝜇 k 𝜋) ≥ 𝑊 (𝜇, 𝜋) for all 𝜇 ∈ P2,ac (R𝑑 ) . (1.4.8)
2 2
This is known as Talagrand’s T2 inequality and it is an example of a transportation
inequality. Such inequalities have been closely studied in relation to the concentration of
measure phenomenon (see [Han16, Chapter 4]).
It is a general fact that the PL inequality implies the quadratic growth inequality.
When applied to the Langevin diffusion, it says that the LSI implies the T2 inequality,
which is known as the Otto–Villani theorem [OV00]. See Exercise 1.16 for a proof.
Strong convexity also implies another fact: the gradient flow contracts exponentially
fast. Namely, if we have two gradient flows 𝑡 ↦→ 𝜇𝑡 and 𝑡 ↦→ 𝜈𝑡 for a strongly convex
functional F, then
Theorem 1.4.10. Suppose that ∇2𝑉 𝛼𝐼𝑑 for some 𝛼 ∈ R. If (𝑍𝑡 )𝑡 ≥0 and (𝑍𝑡0)𝑡 ≥0
denote two copies of the Langevin diffusion (1.E.1) with potential 𝑉 and driven by the
same Brownian motion, then
Finally, in the case 𝛼 = 0, so that F is weakly convex, we can also obtain a convergence
result by considering the Lyapunov functional L𝑡 B 𝑡 F(𝜇𝑡 ) + 21 𝑊22 (𝜇𝑡 , 𝜇★). In order to
differentiate this Lyapunov functional, we need the following theorem.
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 51
Theorem 1.4.11. For 𝜈 ∈ P2,ac (R𝑑 ), the Wasserstein gradient of 𝜇 ↦→ 𝑊22 (𝜇, 𝜈) at 𝜇 is
given by −2 (𝑇𝜇→𝜈 − id).
• The target 𝜋 satisfies the log-Sobolev inequality (LSI) with constant 1/𝛼 if and only
if for all 𝜋0 ∈ P2,ac (R𝑑 ), along the Langevin dynamics 𝑡 ↦→ 𝜋𝑡 started at 𝜋 0 it holds
that KL(𝜋𝑡 k 𝜋) ≤ exp(−2𝛼𝑡) KL(𝜋0 k 𝜋). The LSI is a gradient domination condition
in Wasserstein space.
• The target 𝜋 satisfies the Poincaré inequality (PI) with constant 1/𝛼 if and only if
for all 𝜋0 ∈ P2,ac (R𝑑 ), along the Langevin dynamics 𝑡 ↦→ 𝜋𝑡 started at 𝜋0 it holds
that 𝜒 2 (𝜋𝑡 k 𝜋) ≤ exp(−2𝛼𝑡) 𝜒 2 (𝜋 0 k 𝜋). The Poincaré inequality is a spectral gap
condition for the generator of the Langevin diffusion.
The conditions are listed from strongest to weakest: 𝛼-strong log-concavity implies
𝛼 −1 -LSI,
which implies 𝛼 −1 -Poincaré. In addition:
52 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
When we turn towards discretization analysis, there are two main ways in which the
continuous-time result affects the analysis: the strength of the continuous-time result,
and the metric in which we must perform the analysis.
Regarding the first point, the first two results are generally more useful because at
initialization, we typically have 𝑊22 (𝜋0, 𝜋), KL(𝜋0 k 𝜋) = 𝑂 (𝑑) (at least when 𝜋 is strongly
log-concave). Hence, exponential convergence in 𝑊22 and KL both imply that the amount
of time it takes for the Langevin diffusion to reach 𝜀 error is 𝑂 (log(𝑑/𝜀)). In contrast, the
chi-squared divergence is typically much larger at initialization: 𝜒 2 (𝜋 0 k 𝜋) = exp(𝑂 (𝑑)).
Therefore, the chi-squared result implies that the Langevin diffusion takes 𝑂 (𝑑 ∨ log(1/𝜀))
time to reach 𝜀 error.
Regarding the second point, the 𝑊2 contraction under strong log-concavity is the
easiest to turn into a sampling guarantee for the discretized algorithm. This is because
to bound the 𝑊2 distance, we can use straightforward coupling techniques. On the other
hand, a continuous-time result in KL or 𝜒 2 often requires the discretization analysis to
also be carried out in KL or 𝜒 2 , which is substantially trickier.
Definition 1.5.1. The total variation (TV) distance between probability measures
𝜇, 𝜈 ∈ P (X) is defined via
∫ ∫
k𝜇 − 𝜈 k TV B sup |𝜇 (𝐴) − 𝜈 (𝐴)| = sup 𝑓 d𝜇 − 𝑓 d𝜈
𝐴⊆X 𝑓 :X→[0,1]
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 53
1 d𝜇 d𝜈
∫ ∫
= inf 1{𝑥 ≠ 𝑦} d𝛾 (𝑥, 𝑦) = − d𝜆 ,
𝛾 ∈C(𝜇,𝜈) 2 d𝜆 d𝜆
The TV metric is indeed a metric on the space P (X) (in fact, it can be extended to a
norm on the space M (X) of signed measures). The TV distance can be thought of as both
a transport metric (with cost (𝑥, 𝑦) ↦→ 1{𝑥 ≠ 𝑦}; in fact, the TV distance is a special case
of the 𝑊1 metric introduced in Exercise 1.11) and an information divergence.
The family of information divergences can be further expanded by introducing the
following definition.
d𝜇
∫
D 𝑓 (𝜇 k 𝜈) B 𝑓 d𝜈 , if 𝜇 𝜈 .
d𝜈
In general, if 𝜇 3 𝜈, we let 𝑝 𝜇 , 𝑝𝜈 denote the respective densities of 𝜇 and 𝜈 w.r.t. a
common dominating measure. Then,
∫
𝑝𝜇
D 𝑓 (𝜇 k 𝜈) B 𝑓 d𝜈 + 𝑓 0 (∞) 𝜇{𝑝𝜈 = 0} .
𝑝 𝜈 >0 𝑝𝜈
For example, the TV distance corresponds to 𝑓 (𝑥) = 12 |𝑥 − 1|, the KL divergence corre-
sponds to 𝑓 (𝑥) = 𝑥 ln 𝑥, and the 𝜒 2 divergence corresponds to 𝑓 (𝑥) = (𝑥 − 1) 2 . When 𝑓
has superlinear growth, then 𝑓 0 (∞) = ∞ and hence D 𝑓 (𝜇 k 𝜈) = ∞ unless 𝜇 𝜈, but the
second more general definition given above is necessary to recover the TV distance.
We always have 𝜒 2 ≥ ln(1 + 𝜒 2 ) ≥ KL ≥ 2 k·k 2TV (the last inequality is Pinsker’s
inequality, see Exercise 2.10), and under a T2 transport inequality with constant 𝛼 −1
(which is implied by 𝛼 −1 -LSI) we have KL ≥ 𝛼2 𝑊22 . This chain of inequalities helps to
explain why, if the KL divergence is of order 𝑑, then the 𝜒 2 divergence is of order exp 𝑑.
In Section 2.2.4, we will also introduce the closely related family of divergences known
as Rényi divergences.
We conclude by stating a few key facts (without complete proofs) about 𝑓 -divergences.
The first is the data-processing inequality.
54 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
D 𝑓 (𝜇𝑃 k 𝜈𝑃) ≤ D 𝑓 (𝜇 k 𝜈) .
Proof sketch. To simplify, we will abuse notation and identify all probability measures
with densities. Then, by Jensen’s inequality,
∫
𝜇 (𝑥) 𝑃 (𝑥, 𝑦)
D 𝑓 (𝜇 k 𝜈) = 𝑓 𝜈 (d𝑥) 𝑃 (𝑥, d𝑦)
𝜈 (𝑥) 𝑃 (𝑥, 𝑦)
∫
𝜇 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈 (d𝑥) 𝑃 (𝑥, 𝑦)
= 𝑓 𝜈𝑃 (d𝑦)
𝜈 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈𝑃 (𝑦)
∫ ∫ ∫
𝜇 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈 (d𝑥) 𝑃 (𝑥, 𝑦) 𝜇𝑃 (𝑦)
≥ 𝑓 𝜈𝑃 (d𝑦) = 𝑓 𝜈𝑃 (d𝑦) .
𝜈 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈𝑃 (𝑦) 𝜈𝑃 (𝑦)
The remaining facts are specific to the KL divergence. The Donsker–Varadhan theorem
expresses the KL divergence via a variational principle.
The theorem asserts that the functionals 𝜇 ↦→ KL(𝜇 k𝜈) and 𝑔 ↦→ ln E𝜈 exp 𝑔 are convex
conjugates of each other. See [DZ10, Lemma 6.2.13] or [RS15, Theorem 5.4] for careful
proofs, or see the remark after Lemma 2.3.4.
Lastly, we have the chain rule for the KL divergence.
Lemma 1.5.5 (chain rule for KL divergence). Let X1 , X2 be Polish spaces and suppose
we are given two probability measures 𝜇, 𝜈 ∈ P (X1 × X2 ) with 𝜇 𝜈. Let 𝜇 1 be the X1
marginal of 𝜇, and let 𝜇2|1 (· | ·) be the conditional distribution for 𝜇 on X2 conditioned
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 55
We invite the reader to prove the chain rule in the discrete case (X1 and X2 are finite
sets), free of measure-theoretic guilt.
Bibliographical Notes
Much of the material in this chapter is foundational, with entire textbooks giving com-
prehensive treatments of the topics. For stochastic calculus, there is of course a long list
of textbooks, but as a starting place we suggest [Ste01; Le 16]. For Markov semigroup
theory, see [BGL14; Han16]. For optimal transport, the core theory is developed in [Vil03;
San15], and for a rigorous development of Otto calculus see [AGS08; Vil09].
The notion of solution used in Section 1.1.3 is more typically called a strong solution to
the SDE, because given any Brownian motion 𝐵 we can find a process 𝑋 which is driven
by 𝐵 and which satisfies the SDE. There is also a notion of weak solution, in which we are
allowed to construct the probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) and the Brownian motion 𝐵
together with the solution 𝑋 . We will not worry about the distinction in this book, since
strong solutions suffice for our purposes.
See [San15, §1.6.3] for an elegant proof of strong duality for the optimal transport
problem via convex duality.
The perspective of the Langevin diffusion as a Wasserstein gradient flow was intro-
duced in [JKO98]; the application of Otto calculus to functional inequalities was given
in [OV00]; and the calculation rules for Otto calculus were set out in [Ott01]. These three
papers are seminal and are worth reading carefully. An alternative (but related) approach
to functional inequalities via optimal transport is given in [Cor02]. The formal proof of
the Otto–Villani theorem in Exercise 1.16 was made rigorous via entropic interpolations
in [Gen+20]; see [BB18] for a generalization.
The Efron–Stein inequality in Exercise 1.2 is just one example of the use of martingales
to derive concentration inequalities; see [BLM13; Han16] for more on this topic. We will
also revisit the martingale method in the next chapter; see Exercise 2.12.
The upper bound (1.E.2) in Exercise 1.10 is surprisingly sharp: it holds that
1 1/2
kΣ0 − Σ1/2
1 k 2HS ≤ 𝑊22 (𝜇0, 𝜇1 ) − k𝑚 0 − 𝑚 1 k 2 ≤ kΣ1/2
0 − Σ1/2
1 k 2HS ,
2
see [CV21, Lemma 3.5].
56 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
The proof of the dynamical formulation of dual optimal transport in Exercise 1.12 is
carried out rigorously in [Vil03, §8.1]. We mention that the Hamilton–Jacobi equation
has a close connection with classical mechanics; in particular, the characteristics of the
Hamilton–Jacobi equation are precisely Hamilton’s equations of motion [Eva10, §3.3].
In the context of optimal transport, the Hamiltonian consists only of kinetic energy (no
potential energy) and hence the characteristics are straight lines traversed at constant
speed; this is of course consistent with the description of Wasserstein geodesics. The
Hamilton–Jacobi equation, the Hopf–Lax semigroup, and their connection with optimal
transport can also be generalized to other costs; see [Vil03, §5.4].
Exercises
A Primer on Stochastic Calculus
2. Let (𝑋𝑖 )𝑛𝑖=1 be independent random variables taking values in some space X, and
suppose that the function 𝑓 : X𝑛 → R is bounded and measurable. Check that if
𝑀𝑘 B E[𝑓 (𝑋 1, . . . , 𝑋𝑛 ) | 𝑋 1, . . . , 𝑋𝑘 ], then the Doob martingale (𝑀𝑘 )𝑛𝑘=1 is indeed
a martingale. Then, using the previous part, prove the following tensorization
property of the variance:
𝑛
∑︁
var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑘 ,
𝑘=1
− inf
0
𝑓 (𝑥 1, . . . , 𝑥𝑘−1, 𝑥𝑘0 , 𝑥𝑘+1, . . . , 𝑥𝑛 ) .
𝑥𝑘 ∈X
1 ∑︁
𝑛
var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E {𝐷𝑘 𝑓 (𝑋 1, . . . , 𝑋𝑛 )}2 .
4
𝑘=1
This inequality, known as the Efron–Stein inequality, expresses the fact that
a function 𝑓 of independent random variables which is not too sensitive to any
individual coordinate has controlled variance. This is a concentration inequality
which has useful consequences in many probabilistic settings, see, e.g., [BLM13].
Hint: First prove that a random variable which takes values in [𝑎, 𝑏] has variance
bounded by 14 (𝑏 − 𝑎) 2 .
Hint: For the a.s. convergence, it suffices to show that for all 𝜀 > 0, the event
has probability zero. To do so, use Doob’s maximal inequality (Theorem 1.1.29) and
orthogonality of martingale increments (Exercise 1.2).
1. when 0 < 𝛼 < 1, there are multiple solutions to the ODE (with initial condition
𝑥 0 = 0) so that uniqueness fails;
3. when 𝛼 > 1, then the solution to the ODE blows up in finite time.
Give an explicit expression for 𝑋𝑡 in terms of 𝑋 0 and an Itô integral involving (𝐵𝑡 )𝑡 ≥0 .
From this expression, can you read off the stationary distribution of this process?
Hint: Apply Itô’s formula to 𝑓 (𝑡, 𝑋𝑡 ) = 𝑋𝑡 exp 𝑡. To find the stationary distribution,
justify the following fact: if (𝜂𝑡 )𝑡 ≥0 is a deterministic function, then 0 𝜂𝑡 d𝐵𝑡 is a Gaussian
∫𝑇
3. Prove that for 𝑓 , 𝑔 ∈ 𝐿 2 (𝜋) in the domain of the carré du champ, we have the
Cauchy–Schwarz inequality Γ(𝑓 , 𝑔) ≤ Γ(𝑓 , 𝑓 ) Γ(𝑔, 𝑔).
√︁
𝜀2
∫
KL (1 + 𝜀 𝑓 ) 𝜋
𝜋 = 𝑓 2 d𝜋 + 𝑜 (𝜀 2 ) . (1.E.1)
2
Using this expression for the semigroup, compute the generator by hand and check
that it agrees with the general formula obtained in Example 1.2.4.
2. Show that for the OU process, ∇𝑃𝑡 𝑓 = exp(−𝑡) 𝑃𝑡 ∇𝑓 . Next, by differentiating the
Dirichlet energy 𝑡 ↦→ ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ), show that ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−2𝑡) ℰ(𝑓 , 𝑓 ). Ex-
plain why this implies a Poincaré inequality for the standard Gaussian distribution.
∫∞
Hint: For a general Markov semigroup, show that var𝜋 𝑓 = 2 0 ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝑡 by
differentiating 𝑡 ↦→ var𝜋 (𝑃𝑡 𝑓 ).
In Chapter 2, we will generalize these calculations to prove Theorem 1.2.28.
Finally, suppose that 𝜈 0 , 𝜈 1 are probability measures, and suppose that 𝜇 0 , 𝜇 1 are
Gaussians whose means and covariances match those of 𝜈 0 and 𝜈 1 respectively. Then,
prove that 𝑊2 (𝜈 0, 𝜈 1 ) ≥ 𝑊2 (𝜇 0, 𝜇1 ).
60 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
Hint: For the last statement, use the fact that the dual potentials for optimal transport
between Gaussians are quadratic functions.
and that
n∫ ∫ o
T𝑐 (𝜇, 𝜈) ≥ sup 𝑓 d𝜇 + 𝑓 d𝜈 .
𝑐
𝑓 ∈𝐿 1 (𝜇)
2. Let X = Y be a metric space with metric d. For all 𝑝 ≥ 1, we can define the
𝑝-Wasserstein distance
∫
𝑊𝑝 (𝜇, 𝜈) = inf d(𝑥, 𝑦) 𝑝 d𝛾 (𝑥, 𝑦) .
𝑝
𝛾 ∈C(𝜇,𝜈)
Let P𝑝 (X)
∫ denote𝑝 the space of probability measures 𝜇 over X such that for some
𝑥 0 ∈ X, d(𝑥 0, ·) d𝜇 < ∞. Show that (P𝑝 (X),𝑊𝑝 ) is a metric space.
Although the problems (1.3.23) and (1.E.5) both identify geodesics in the Wasserstein space,
the latter problem has more favorable properties. Namely, the minimizing curves in (1.3.23)
are geodesics, but they may not have constant speed (indeed, the arc length functional
is invariant under time reparameterization of the curve); in contrast, minimizing curves
in (1.E.5) necessarily have constant speed. Also, we can reparameterize problem (1.E.5)
by introducing the momentum density 𝑝𝑡 B 𝜇𝑡 𝑣𝑡 and write
n∫ 1 ∫ k𝑝 k 2 o
2
𝑊2 (𝜇 0, 𝜇1 ) = inf d𝑡 𝜕𝑡 𝜇𝑡 + div 𝑝𝑡 = 0 , (1.E.6)
𝑡
0 𝜇𝑡
which is now a strictly convex problem in the variables (𝜇, 𝑝). This convenient reformula-
tion is known as the Benamou–Brenier formula [BB99].
Just as (1.E.5) describes the dynamical version of the static optimal transport prob-
lem (1.3.5), there is a dynamical formulation of the dual optimal transport problem (1.3.7),
in which the dual potential evolves according to the Hamilton–Jacobi equation
1
𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k 2 = 0 . (1.E.7)
2
Then, it holds that
1 2 1
n ∫ ∫ o
𝑊2 (𝜇0, 𝜇1 ) = sup 𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0 𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k 2 = 0 . (1.E.8)
2 2
The goal of this exercise is to justify and understand these facts.
1. Show that the mapping R>0 × R𝑑 → R, (𝜇, 𝑝) ↦→ k𝑝 k 2 /𝜇 is strictly convex. Also,
compute the convex conjugate of this mapping. Deduce that the Benamou–Brenier
reformulation (1.E.6) is a strictly convex problem.
2. Ignoring issues of regularity, show that the solution 𝑢𝑡 of the Hamilton–Jacobi
equation with initial condition 𝑢 0 = 𝑓 is described by the Hopf–Lax semigroup
1
𝑢𝑡 (𝑥) = 𝑄𝑡 𝑓 (𝑥) B inf 𝑓 (𝑦) + k𝑦 − 𝑥 k 2 .
𝑦∈R𝑑 2𝑡
3. Following the proof of Theorem 1.3.8, show that the dual optimal transport prob-
lem (1.3.7) can be written
1 2 n∫ ∫ o
𝑊2 (𝜇0, 𝜇1 ) = sup 𝑄 1 𝑓 d𝜇1 − 𝑓 d𝜇0 ,
2 𝑓 ∈𝐿 1 (𝜇 0 )
where 𝑄 1 denotes the Hopf–Lax semigroup at time 1. From this, deduce that the
formula (1.E.8) holds.
62 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME
4. Although the previous part gives a proof of the dynamical formulation (1.E.8), it
is unsatisfactory because it only involves an analysis of the static primal and dual
problems. Here, we present a purely dynamical proof. The continuity constraint
𝜕𝑡 𝜇𝑡 + div 𝑝𝑡 = 0 in (1.E.6) can be reformulated as follows: for any curve of functions
[0, 1] × R𝑑 → R, (𝑡, 𝑥) ↦→ 𝑢𝑡 (𝑥),
∫ ∫ ∫ 1 ∫ ∫ 1 ∫
𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0 = 𝜕𝑡 𝑢𝑡 d𝜇𝑡 d𝑡 = (𝜕𝑡 𝑢𝑡 𝜇𝑡 + 𝑢𝑡 𝜕𝑡 𝜇𝑡 ) d𝑡
0 0
∫ 1 ∫
𝑝𝑡
d𝜇𝑡 d𝑡 .
= 𝜕𝑡 𝑢𝑡 + ∇𝑢𝑡 ,
0 𝜇𝑡
1 2 n∫ 1 ∫ k𝑝 k 2 ∫ ∫
inf sup d𝑡 + 𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0
𝑡
𝑊 (𝜇0, 𝜇1 ) =
2 2 𝜇:[0,1]×R𝑑 →R+ 𝑢:[0,1]×R𝑑 →R 0 2𝜇𝑡
𝑝:[0,1]×R𝑑 →R𝑑
∫ 1 ∫ o
𝑝𝑡
d𝜇𝑡 d𝑡
− 𝜕𝑡 𝑢𝑡 + ∇𝑢𝑡 ,
0 𝜇𝑡
Assume that the infimum and supremum can be interchanged (here, we are invoking
an abstract minimax theorem, which crucially relies on the convexity of the problem
established in the first part). Use this to prove that
1 2 n∫ 1
∫ o
2
𝑊 (𝜇0, 𝜇1 ) = sup 𝑢 1 d𝜇 1 − 𝑢 0 d𝜇0 𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k ≤ 0
2 2 2
and that equality holds only if the Hamilton–Jacobi equation (1.E.7) holds, and
if ∇𝑢𝑡 = 𝑣𝑡 = 𝑝𝑡 /𝜇𝑡 . Note that this also establishes that the optimal vector fields
(𝑣𝑡 )𝑡 ∈[0,1] are gradients of functions.
3. Let F : P2,ac (R𝑑 ) → R be an 𝛼-convex functional, and let (𝜇𝑡 )𝑡 ≥0 , (𝜈𝑡 )𝑡 ≥0 be gradient
flows for F. Prove that
using the following steps. First, compute the derivative of 𝑡 ↦→ 𝑊22 (𝜇𝑡 , 𝜈𝑡 ) us-
ing Theorem 1.4.11. Next, apply the strong convexity inequality (1.4.6) to obtain
two inequalities F(𝜇𝑡 ) ≥ F(𝜈𝑡 ) + · · · and F(𝜈𝑡 ) ≥ F(𝜇𝑡 ) + · · · . Adding these two
inequalities, deduce that 𝜕𝑡𝑊22 (𝜇𝑡 , 𝜈𝑡 ) ≤ −2𝛼 𝑊22 (𝜇𝑡 , 𝜈𝑡 ).
Functional Inequalities
In this chapter, we explore the connection between functional inequalities, such as the
Poincaré and log-Sobolev inequalities, and the concentration of measure phenomenon.
• the log-Sobolev inequality (LSI), as specialized to the Langevin diffusion (see Ex-
ample 1.2.25):
65
66 CHAPTER 2. FUNCTIONAL INEQUALITIES
• Talagrand’s T1 inequality
1
KL(𝜇 k 𝜋) ≥ 𝑊 2 (𝜇, 𝜋) , for all 𝜇 ∈ P1 (R𝑑 ) .
2𝐶 T1 1
In many cases, arguments involving Poincaré and log-Sobolev inequalities hold more
generally in the context of reversible Markov processes, and when this is case we will try
to use notation from Markov semigroup theory (e.g., writing E𝜋 Γ(𝑓 , 𝑓 ) or ℰ(𝑓 , 𝑓 ) instead
of E𝜋 [k∇𝑓 k 2 ]) to indicate that this is the case. However, for clarity of exposition, we do
not dwell on this point, and we urge readers to focus on the case in which the Markov
process is the Langevin diffusion.
Although the Poincaré and log-Sobolev inequalities are stated above for smooth func-
tions 𝑓 : R𝑑 → R, once established they can be extended to a wider class of functions (e.g.,
locally Lipschitz functions) by arguing that smooth functions are dense w.r.t. appropriate
norms. Throughout the chapter, we omit mention of such approximation arguments.
Write PI(𝐶) to denote that the Poincaré inequality holds with constant 𝐶, and similarly
for the other inequalities. We have the following relationships.
• The Otto–Villani theorem (Exercise 1.16) shows that LSI(𝐶) implies T2 (𝐶).
• Since 𝑊1 ≤ 𝑊2 , then T2 (𝐶) obviously implies T1 (𝐶). On the other hand, we will
show below that T2 (𝐶) implies PI(𝐶) as well. Combined with the previous point,
this shows that LSI(𝐶) implies PI(𝐶), which was shown directly in Exercise 1.8.
Then, for any 𝜇 ∈ P (R𝑑 ) and 𝑓 ∈ Cc∞ (R𝑑 ) with 𝑓 d𝜇 = 0, it holds that
∫
2
( 𝑓 2 d𝜇)
∫
1
lim inf 2 T𝑐 𝜇, (1 + 𝜀 𝑓 ) 𝜇 ≥ ∫
.
𝜀&0 𝜀 2 h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i d𝜇 (𝑥)
Proof sketch. Fix 𝜆 ∈ R. Using the dual formulation given in Exercise 1.11,
∫ ∫ ∫
T𝑐 (𝜇, 𝜈) ≥ 𝜆𝜀 𝑓 d𝜇 + (𝜆𝜀 𝑓 ) d𝜈 =
𝑐
(𝜆𝜀 𝑓 )𝑐 d𝜈 .
Here, (𝜆𝜀 𝑓 )𝑐 (𝑥) = inf ℎ∈R𝑑 {𝑐 (𝑥 +ℎ, 𝑥) −𝜆𝜀 𝑓 (𝑥 +ℎ)}, and using the assumption on 𝑐 together
with the compact support of 𝑓 , one can justify that the infimum is attained at a point ℎ
with kℎk = 𝑂 (𝜀). Then,
1
(𝜆𝜀 𝑓 )𝑐 (𝑥) = inf hℎ, 𝐻𝑥 ℎi − 𝜆𝜀 𝑓 (𝑥) − 𝜆𝜀 h∇𝑓 (𝑥), ℎi + 𝑜 (𝜀 2 )
ℎ∈R𝑑 2
𝜆 2𝜀 2
≥ −𝜆𝜀 𝑓 (𝑥) − h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i + 𝑜 (𝜀 2 ) .
2
Hence,
𝜆 2𝜀 2
∫ ∫
2 2
𝑓 d𝜇 − h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i d𝜇 (𝑥)
T𝑐 𝜇, (1 + 𝜀 𝑓 ) 𝜇 ≥ −𝜆𝜀
2
and the result follows by optimizing over 𝜆.
Corollary 2.1.2 (T2 implies PI). If 𝜋 satisfies T2 (𝐶), then it satisfies PI(𝐶).
Proof. Let 𝑓 ∈ Cc∞ (R𝑑 ) and apply the linearization in the preceding proposition to the
quadratic cost 𝑐 (𝑥, 𝑦) = 21 k𝑥 − 𝑦 k 2 with 𝐻𝑥 = 𝐼𝑑 for all 𝑥 ∈ R𝑑 . Then, T2 (𝐶) yields
2 ( 𝑓 2 d𝜋) 2
∫
𝜀
2𝐶 KL (1 + 𝜀 𝑓 ) 𝜋
𝜋 ≥ 𝑊22 𝜋, (1 + 𝜀 𝑓 ) 𝜋 ≥ ∫ + 𝑜 (𝜀 2 ) .
2
k∇𝑓 k d𝜋
68 CHAPTER 2. FUNCTIONAL INEQUALITIES
On the other hand, the linearization (1.E.1) of the KL divergence in Exercise 1.8 yields
𝜀2
∫
KL (1 + 𝜀 𝑓 ) 𝜋 𝜋 =
𝑓 2 d𝜋 + 𝑜 (𝜀 2 ) .
2
Comparing terms proves the result.
In Exercise 2.1, we explore a perhaps more intuitive approach to linearizing the 2-
Wasserstein distance via the Monge–Ampère equation.
We will defer a fuller discussion of Riemannian geometry for later, but for now we can
get a hint at the role of the curvature by observing that the Bochner identity (2.2.2) on R𝑑
follows from the equation1
Hence, the commutator of ∇ and ℒ brings out the curvature of the measure 𝜋 ∝ exp(−𝑉 ),
and the plan is to exploit this in order to prove functional inequalities. The identity (2.2.4)
then yields the following formula for the iterated carré du champ:
E𝜋 [𝑓 2 ] = 2 E𝜋 [𝑓 (−ℒ) 𝑢] − E𝜋 [(ℒ𝑢) 2 ]
≤ 2 E𝜋 h∇𝑓 , ∇𝑢i − Eh∇𝑢, 𝐴 ∇𝑢i
√︁
≤ 2 E𝜋 h∇𝑓 , 𝐴−1 ∇𝑓 i E𝜋 h∇𝑢, 𝐴 ∇𝑢i − Eh∇𝑢, 𝐴 ∇𝑢i ≤ E𝜋 h∇𝑓 , 𝐴−1 ∇𝑓 i .
The point is that the condition (2.2.7) can now be checked with the help of curvature.
Suppose that 𝜋 ∝ exp(−𝑉 ) where 𝑉 is twice continuously differentiable and strictly
convex. Then, using integration by parts (Theorem 1.2.14),
1
E𝜋 [(ℒ𝑢) 2 ] = − E𝜋 h∇𝑢, ∇ℒ𝑢i = E𝜋 Γ2 (𝑢, 𝑢) − ℒ(k∇𝑢 k 2 ) = E𝜋 Γ2 (𝑢, 𝑢)
2 | {z }
because E𝜋 ℒ=0
| {z }
by (2.2.1)
When ∇2𝑉 𝛼𝐼𝑑 0, then this implies that a Poincaré inequality holds for 𝜋 with
constant 𝐶 PI ≤ 1/𝛼. However, the Brascamp–Lieb inequality is much stronger, as it allows
us to take advantage of non-uniform convexity.
In Exercise 2.3, we give another proof of Theorem 2.2.8 by linearizing a transport
inequality. First, we introduce the transport cost.
Definition 2.2.9. The Bregman transport cost for the potential 𝑉 , denoted D𝑉 , is
the transport cost associated with the Bregman divergence
i.e., we set
∫
D𝑉 (𝜇, 𝜈) B inf 𝐷𝑉 (𝑥, 𝑦) d𝛾 (𝑥, 𝑦) .
𝛾 ∈C(𝜇,𝜈)
The Bregman transport cost will also play a key role in Section 10.2, in which we will
prove the following transport inequality.
D𝑉 (𝜇, 𝜋) ≤ KL(𝜇 k 𝜋) .
Actually, convexity of 𝑉 is not necessary for the theorem to hold, although the Bregman
transport cost D𝑉 is only guaranteed to be non-negative when 𝑉 is convex. Notice that
when 𝑉 is strongly convex, ∇2𝑉 𝛼𝐼𝑑 0, then 𝐷𝑉 (𝑥, 𝑦) ≥ 𝛼2 k𝑥 − 𝑦 k 2 , so the Bregman
transport inequality implies T2 (𝛼 −1 ).
identity ∇𝑃𝑡 𝑓 = exp(−𝑡) 𝑃𝑡 ∇𝑓 . In the next result, we show that more generally, CD(𝛼, ∞)
implies Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−2𝛼𝑡) 𝑃𝑡 Γ(𝑓 , 𝑓 ).
Theorem 2.2.11 (local Poincaré inequality). Assume the Markov semigroup (𝑃𝑡 )𝑡 ≥0 is
reversible and let 𝛼 ∈ R. Then, the following are equivalent.
1 − exp(−2𝛼𝑡)
𝑃𝑡 (𝑓 2 ) − (𝑃𝑡 𝑓 ) 2 ≤ 𝑃𝑡 Γ(𝑓 , 𝑓 ) .
𝛼
On the other hand, recall the definition of the carré du champ: ℒ(𝑓 2 ) − 2𝑓 ℒ𝑓 = 2Γ(𝑓 , 𝑓 ).
Using this along with (2),
𝜕𝑠 [𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )] = 2𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) ≤ 2 exp −2𝛼 (𝑡 − 𝑠) 𝑃𝑡 Γ(𝑓 , 𝑓 ) .
Γ(𝜙 ◦ 𝑓 , 𝑔) = 𝜙 0 (𝑓 ) Γ(𝑓 , 𝑔) .
The chain rule is satisfied for the Langevin diffusion whose carré du champ is given by
Γ(𝑓 , 𝑔) = h∇𝑓 , ∇𝑔i, and more generally this assumption encodes the fact that the Markov
process is a diffusion. Since we are mainly interested in diffusion processes, this is not
a restrictive assumption, but it indicates that the following proof will fail for Markov
processes on discrete state spaces.
Proof of the Bakry–Émery theorem (Theorem 1.2.28). ∫ Given a smooth positive function 𝑓
and a function 𝜙 : R+ → R, we differentiate 𝑡 ↦→ 𝜙 (𝑃𝑡 𝑓 ) d𝜋. We are primarily interested
in the case 𝜙 (𝑥) B 𝑥 ln 𝑥, but carrying out the calculation for a general 𝜙 clarifies the
structure of the argument. Using the Markov semigroup calculus,
∫ ∫
𝜕𝑡 𝜙 (𝑃𝑡 𝑓 ) d𝜋 = 𝜙 0 (𝑃𝑡 𝑓 ) ℒ𝑃𝑡 𝑓 d𝜋 = −ℰ(𝜙 0 ◦ 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) .
We now specialize our calculations to the entropy function 𝜙 (𝑥) = 𝑥 ln 𝑥 and use
reversibility of the semigroup.
∫ ∫
0
ℰ(𝜙 ◦ 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) = (ln 𝑃𝑡 𝑓 ) (−ℒ)𝑃𝑡 𝑓 d𝜋 = (ln 𝑃𝑡 𝑓 ) 𝑃𝑡 (−ℒ𝑓 ) d𝜋
∫ ∫
= 𝑃𝑡 ln 𝑃𝑡 𝑓 (−ℒ) 𝑓 d𝜋 = Γ(𝑃𝑡 ln 𝑃𝑡 𝑓 , 𝑓 ) d𝜋
74 CHAPTER 2. FUNCTIONAL INEQUALITIES
√︄∫
Γ(𝑓 , 𝑓 )
∫
≤ d𝜋 𝑓 Γ(𝑃𝑡 ln 𝑃𝑡 𝑓 , 𝑃𝑡 ln 𝑃𝑡 𝑓 ) d𝜋
𝑓
where the last line uses the Cauchy–Schwarz inequality (Exercise 1.6). By the chain rule
for the carré du champ, we have
Γ(𝑓 , 𝑓 )
Γ(ln 𝑓 , 𝑓 ) = .
𝑓
By applying the local Poincaré inequality (Theorem 2.2.11) and the chain rule,
√︄∫ ∫
ℰ(ln 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 𝑓 𝑃𝑡 Γ(ln 𝑃𝑡 𝑓 , ln 𝑃𝑡 𝑓 ) d𝜋
√︄∫ ∫
= exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 𝑃𝑡 𝑓 Γ(ln 𝑃𝑡 𝑓 , ln 𝑃𝑡 𝑓 ) d𝜋
√︄∫ ∫
= exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 Γ(ln 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝜋
Definition 2.2.13. For 𝑞 > 1, the Rényi divergence of order 𝑞 between 𝜇 and 𝜋 is
defined by
1 d𝜇 𝑞
∫
R𝑞 (𝜇 k 𝜋) B ln d𝜋 (2.2.14)
𝑞−1 d𝜋
if 𝜇 𝜋, and R𝑞 (𝜇 k 𝜋) B +∞ otherwise.
Rényi divergences are monotonic in the order: if 1 < 𝑞 ≤ 𝑞0 < ∞, then R𝑞 ≤ R𝑞 0 (this
follows from Jensen’s inequality). Some notable special cases include:
1. For 𝑞 & 1, we have R𝑞 → KL.
2. For 𝑞 = 2, we have R𝑞 = ln(1 + 𝜒 2 ).
d𝜇
3. For 𝑞 % ∞, we have R𝑞 % R∞ , where R∞ (𝜇 k 𝜋) B ln k d𝜋 k 𝐿∞ (𝜋) .
Remarkably, Vempala and Wibisono [VW19] show that a Poincaré inequality or a log-
Sobolev inequality imply convergence of the Langevin diffusion in every Rényi divergence.
We will prove the following theorem.
Theorem 2.2.15 ([VW19]). Let (𝑃𝑡 )𝑡 ≥0 be a reversible diffusion Markov semigroup, and
let (𝜋𝑡 )𝑡 ≥0 denote the law of the Markov process associated with the semigroup.
1. Suppose that a log-Sobolev inequality holds with constant 𝐶 LSI . Then, for all 𝑞 ≥ 1,
2𝑡
R𝑞 (𝜋𝑡 k 𝜋) ≤ exp − R𝑞 (𝜋0 k 𝜋) .
𝑞𝐶 LSI
2. Suppose that a Poincaré inequality holds with constant 𝐶 PI . Then, for all 𝑞 ≥ 2,
2𝑡
R𝑞 (𝜋0 k 𝜋) − 𝑞𝐶 PI , if R𝑞 (𝜋𝑡 k 𝜋) ≥ 1 ,
R𝑞 (𝜋𝑡 k 𝜋) ≤
2𝑡
exp − R𝑞 (𝜋0 k 𝜋) , if R𝑞 (𝜋0 k 𝜋) ≤ 1 .
𝑞𝐶 PI
1 𝜕𝑡 𝜌𝑡 d𝜋 𝜌𝑡 ℒ𝜌𝑡 d𝜋 Γ(𝜌𝑡 , 𝜌𝑡 ) d𝜋
∫ 𝑞 ∫ 𝑞−1 ∫ 𝑞−1
𝑞 𝑞
𝜕𝑡 R(𝜋𝑡 k 𝜋) = = =−
𝑞−1 𝜌𝑡 d𝜋 𝑞−1 𝜌𝑡 d𝜋 𝑞−1 𝜌𝑡 d𝜋
∫ 𝑞 ∫ 𝑞 ∫ 𝑞
76 CHAPTER 2. FUNCTIONAL INEQUALITIES
4 ℰ(𝜌𝑡 , 𝜌𝑡 )
𝑞/2 𝑞/2
=− .
𝜌𝑡 d𝜋
∫ 𝑞
𝑞
Log-Sobolev case. The log-Sobolev inequality reads (due to the chain rule) as
and hence
4 ℰ(𝜌 𝑞/2, 𝜌 𝑞/2 ) 2 2
∫ ∫
≥ 𝜕𝑞 ln 𝜌 d𝜋 −
𝑞
ln 𝜌 𝑞 d𝜋
𝜌 d𝜋
∫
𝑞 𝑞 𝐶 LSI 𝑞𝐶 LSI
2 2 (𝑞 − 1)
= 𝜕𝑞 [(𝑞 − 1) R𝑞 (𝜌𝜋 k 𝜋)] − R𝑞 (𝜌𝜋 k 𝜋)
𝐶 LSI 𝑞𝐶 LSI
2 2 (𝑞 − 1) 2 (𝑞 − 1)
= R𝑞 (𝜌𝜋 k 𝜋) + 𝜕𝑞 R𝑞 (𝜌𝜋 k 𝜋) − R𝑞 (𝜌𝜋 k 𝜋)
𝐶 LSI 𝐶 LSI | {z } 𝑞𝐶 LSI
≥0
2
≥ R𝑞 (𝜌𝜋 k 𝜋)
𝑞𝐶 LSI
where we used the fact that the Rényi divergence is monotonic in the order.
Poincaré case. Next, applying a Poincaré inequality to 𝑓 = 𝜌 𝑞/2 ,
∫ ∫ 2
𝐶 PI ℰ(𝜌 , 𝜌 ) ≥ var𝜋 (𝜌 ) =
𝑞/2 𝑞/2 𝑞/2
𝜌 d𝜋 −
𝑞
𝜌 𝑞/2 d𝜋
∫ h exp((𝑞 − 2) R𝑞/2 (𝜌𝜋 k 𝜋)) i
= 𝜌 𝑞 d𝜋 1 −
exp((𝑞 − 1) R𝑞 (𝜌𝜋 k 𝜋))
∫
𝜌 𝑞 d𝜋 1 − exp −R𝑞 (𝜌𝜋 k 𝜋)
≥
where we used the monotonicity of the Rényi divergence in the order. Hence,
To interpret this theorem, the Poincaré result states that after an initial waiting period
of time 𝑂 (𝑞𝐶 PI R𝑞 (𝜋0 k 𝜋)), the Rényi divergence starts decaying exponentially fast. On
the other hand, the log-Sobolev inequality implies exponentially fast convergence from
the outset. In particular, for 𝑞 = 2, we see that whereas a Poincaré inequality implies
exponential decay of 𝜒 2 , a log-Sobolev inequality implies exponential decay of ln(1 + 𝜒 2 ),
which is substantially stronger.
Proof. The key is to find a variational principle for the variance or for the entropy. For
the variance, for any 𝜈 ∈ P (R𝑑 ) and 𝑓 : R𝑑 → R,
From this,
𝑓
ent𝜈 𝑓 = inf E𝜈 𝑓 ln − 𝑓 + 𝑡
𝑡 >0 𝑡
| {z }
≥0
One can also state a perturbation principle for the T2 inequality, but it is more involved
and the constants are less precise, see [BGL14, Proposition 9.6.3].
respectively.
2.3.3 Convolution
Next, we show that if 𝜋 = 𝜋1 ∗ 𝜋2 is a convolution of two measures, where both 𝜋1 and 𝜋2
satisfy a functional inequality, then so does 𝜋. This is a consequence of the subadditivity
of variance and entropy. We begin with a variational principle for the entropy.
d𝜇
Proof. We may assume that E𝜋 exp 𝑔 = 1, and define 𝜇 via d𝜋 B exp 𝑔. Then,
𝑓 h 𝑓 exp(−𝑔) i
ent𝜋 𝑓 = E𝜋 𝑓 ln = E𝜇 𝑓 exp(−𝑔) ln
+ E𝜋 [𝑓 𝑔] .
E𝜋 𝑓 E𝜇 [𝑓 exp(−𝑔)]
| {z }
=ent𝜇 (𝑓 exp(−𝑔))≥0
𝑛
∑︁
ent 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E ent 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑖 .
𝑖=1
Here, var(· | 𝑋 −𝑖 ) and ent(· | 𝑋 −𝑖 ) denote the conditional variance and entropy respec-
tively when all variables except 𝑋𝑖 are held fixed, i.e.,
Proof. The subadditivity of the variance was established in Exercise 1.2, so we turn towards
the entropy. Let 𝑍 B 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) and
Δ𝑖 = ln E[𝑍 | 𝑋 1, . . . , 𝑋𝑖 ] − ln E[𝑍 | 𝑋 1, . . . , 𝑋𝑖−1 ] ,
so that
𝑛
h ∑︁ i
ent 𝑍 = E[𝑍 (ln 𝑍 − E ln 𝑍 )] = E 𝑍 Δ𝑖 .
𝑖=1
Since
E[E[𝑍 | 𝑋 1, . . . , 𝑋𝑖 ] | 𝑋 −𝑖 ]
E[exp Δ𝑖 | 𝑋 −𝑖 ] = = 1,
E[𝑍 | 𝑋 1, . . . , 𝑋𝑖−1 ]
the variational principle yields
E[𝑍 Δ𝑖 ] = E E[𝑍 Δ𝑖 | 𝑋 −𝑖 ] ≤ E ent(𝑍 | 𝑋 −𝑖 ) .
respectively.
2.3.4 Mixtures
Suppose that 𝜋 is a mixture, 𝜋 = 𝜇𝑃 B 𝑃𝑥 d𝜇 (𝑥), where 𝜇 ∈ P (X) is the mixing measure
∫
Proposition 2.3.8 (PI for mixtures, [Bar+18]). Let 𝜇𝑃 be a mixture and assume that
each 𝑃𝑥 satisfies a Poincaré inequality with constant 𝐶 PI (𝑃). Also, assume that
i.i.d.
Proof. Let 𝑓 : R𝑑 → R, and let 𝑋, 𝑋 0 ∼ 𝜇. By the total law of variance,
The first term is easy to control, because we can apply the Poincaré inequality for 𝑃𝑋
inside the expectation, so the main difficulty lies in the second term. Here,
1 1 h d𝑃𝑋 2 i
∫
2
var E𝑃𝑋 𝑓 = E[|E𝑃𝑋 𝑓 − E𝑃𝑋 0 𝑓 | ] = E 𝑓 − 1 d𝑃𝑋 0
2 2 d𝑃𝑋 0
1 𝐶 2
≤ E[(var𝑃𝑋 0 𝑓 ) 𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 )] ≤ E var𝑃𝑋 𝑓 .
𝜒
2 2
Hence,
𝐶𝜒2 𝐶𝜒2
var𝜇𝑃 𝑓 ≤ 1 + E var𝑃𝑋 𝑓 ≤ 1 + 𝐶 PI (𝑃) E E𝑃𝑋 Γ(𝑓 , 𝑓 ) .
2 2 | {z }
=E𝜇𝑃 Γ(𝑓 ,𝑓 )
82 CHAPTER 2. FUNCTIONAL INEQUALITIES
Our aim is to extend this idea to the log-Sobolev inequality, which will require a few
preliminaries. Rather than aiming to directly prove a log-Sobolev inequality, we will
instead prove a defective log-Sobolev inequality: for all 𝑓 : R𝑑 → R,
Although the defective LSI involves an extra term on the right-hand side of the inequality,
the extra term can be removed via a Poincaré inequality. The following two results show
how this is achieved.
ent𝜋 (𝑓 + 𝑐) 2 ≤ ent𝜋 (𝑓 2 ) + 2 E𝜋 [𝑓 2 ] .
Lemma 2.3.12 (tightening a defective LSI). Suppose that 𝜋 satisfies the defective
log-Sobolev inequality (2.3.10), together with a Poincaré inequality. Then, 𝜋 satisfies an
log-Sobolev inequality with constant
𝐷
𝐶 LSI ≤ 𝐶 + 𝐶 PI 1 + .
2
Proof. Using the Rothaus lemma, the defective log-Sobolev inequality, and the Poincaré
inequality, we obtain
≤ 2𝐶 + 𝐶 PI (2 + 𝐷) E𝜋 Γ(𝑓 , 𝑓 ) .
Lemma 2.3.13 ([CCN21]). Suppose that 𝜇, 𝜈 are probability measures and 𝑓 is a positive
function. Then,
E𝜇 (𝑓 )
E𝜇 (𝑓 ) ln ≤ ent𝜇 (𝑓 ) + E𝜇 (𝑓 ) ln 1 + 𝜒 2 (𝜇 k 𝜈) .
E𝜈 (𝑓 )
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 83
1 1 d𝜇
ln = E𝜇 𝑓 ln ≤ E𝜇 𝑓 ln
.
E𝜈 𝑓 E𝜈 𝑓 d𝜈
Next, applying the Donsker–Varadhan principle a second time with 𝜂 = 𝑓 𝜇, 𝜂 0 = 𝜇,
d𝜇
and 𝑔 = ln d𝜈 yields
d𝜇 d𝜇 d𝜇
E𝜇 𝑓 ln = E𝜂 ln ≤ KL(𝜂 k 𝜇) + ln E𝜇 = ent𝜇 𝑓 + ln 1 + 𝜒 2 (𝜇 k 𝜈) ,
d𝜈 d𝜈 d𝜈
which is what we wanted to show.
We can now prove the log-Sobolev inequality for mixtures.
Proposition 2.3.14 (LSI for mixtures, [CCN21]). Let 𝜇𝑃 be a mixture and assume that
each 𝑃𝑥 satisfies a log-Sobolev inequality with constant 𝐶 LSI (𝑃). Also, assume that
Proof. We begin, as in the proof of Proposition 2.3.8, with a decomposition of the entropy:
E𝑃 (𝑓 2 )
ent E𝑃𝑋 (𝑓 2 ) = E E𝑃𝑋 (𝑓 2 ) ln 𝑋 2 ≤ E ent𝑃𝑋 (𝑓 2 ) + E𝑃𝑋 (𝑓 2 ) ln 1 + 𝜒 2 (𝑃𝑋 k 𝜇𝑃)
E𝜇𝑃 (𝑓 )
≤ E ent𝑃𝑋 (𝑓 2 ) + ln(1 + 𝐶 𝜒 2 ) E𝜇𝑃 (𝑓 2 ) .
Hence,
where we have applied the log-Sobolev inequality for 𝑃𝑋 . This is a defective log-Sobolev in-
equality for 𝜇𝑃; by applying the Poincaré inequality from Proposition 2.3.8 and tightening
the inequality via Lemma 2.3.12, we conclude the proof.
Example 2.3.15 (LSI for Gaussian mixtures). Suppose that 𝜇 is supported on a ball
B(0, 𝑅), and that for each 𝑥 ∈ R𝑑 , 𝑃𝑥 = normal(𝑥, 𝜎 2𝐼𝑑 ). Then, 𝜇𝑃 is the convolution
𝜇 ∗ normal(0, 𝜎 2𝐼𝑑 ). Since 𝐶 LSI (𝑃) = 𝜎 2 and
k𝑥 − 𝑥 0 k 2 4𝑅 2
𝜒 2 (𝑃𝑥 k 𝑃𝑥 0 ) = exp − 1 ≤ exp −1
𝜎2 𝜎2
for 𝑥, 𝑥 0 ∈ B(0, 𝑅), we deduce that 𝜇𝑃 satisfies a log-Sobolev inequality with constant
4𝑅 2
𝐶 LSI (𝜇𝑃) . 𝜎 2 ∨ 𝑅 2 exp .
𝜎2
Hence, Gaussian convolutions of measures with bounded support satisfy a log-Sobolev
inequality. The exponential dependence on 𝑅 2 /𝜎 2 is unavoidable in general.
2.3.5 Tensorization
A key feature of these functional inequalities which makes them crucial for the study of
high-dimensional (or even infinite-dimensional phenomena) is that they often hold with
dimension-free constants, as demonstrated in the next result.
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 85
respectively.
Proof. The proof goes by induction on 𝑁 , with 𝑁 = 1 being trivial. So, assume that the
result is true in dimension 𝑁 , and let us prove it for dimension 𝑁 + 1.
Let 𝜈 ∈ P (X) = P (X1 × · · · × X𝑁 +1 ), let 𝜈 1:𝑁 denote its X1 × · · · × X𝑁 marginal,
and let 𝜈 𝑁 +1|1:𝑁 denote the corresponding conditional kernel (and similarly for 𝜋). Let
K denote the set of conditional kernels 𝑦1:𝑁 ↦→ 𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) such that for 𝜈 1:𝑁 -a.e.
𝑦1:𝑁 ∈ X1 × · · · × X𝑁 , it holds that 𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) ∈ C(𝜋 𝑁 +1, 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 )). Instead
of minimizing over all 𝛾 ∈ C(𝜈, 𝜋), we can minimize over couplings 𝛾 such that for all
bounded 𝑓 ∈ C(X × X),
∫ ∫ ∫
𝑓 d𝛾 = 𝑓 (𝑥 1:𝑁 +1, 𝑦1:𝑁 +1 ) 𝛾 𝑁 +1|1:𝑁 (d𝑥 𝑁 +1, d𝑦𝑁 +1 | 𝑦1:𝑁 ) 𝛾 1:𝑁 (d𝑥 1:𝑁 , d𝑦1:𝑁 ) ,
4 Suppose 𝑁 = 2 and (𝑋 1, 𝑋 2 ) ∼ 𝜋 and (𝑌1, 𝑌2 ) ∼ 𝜈. Observe that a general coupling 𝑝 ∈ C(𝜋, 𝜈) factorizes
as 𝑝 (𝑥 1, 𝑥 2, 𝑦1, 𝑦2 ) = 𝑝𝑋1 (𝑥 1 ) 𝑝𝑋2 (𝑥 2 ) 𝑝𝑌1,𝑌2 |𝑋1,𝑋2 (𝑦1, 𝑦2 | 𝑥 1, 𝑥 2 ). In contrast, we are restricting to couplings
of the form 𝑝 (𝑥 1, 𝑥 2, 𝑦1, 𝑦2 ) = 𝑝𝑋1 (𝑥 1 ) 𝑝𝑌1 |𝑋1 (𝑦1 | 𝑥 1 ) 𝑝𝑋2 (𝑥 2 ) 𝑝𝑌2 |𝑋2,𝑌1 (𝑦2 | 𝑥 2, 𝑦1 ).
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 87
∫ ∫
= inf 𝜑 𝑐 𝑁 +1 d𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 𝑁 +1|1:𝑁 ∈C(𝜋 𝑁 +1,𝜈 𝑁 +1|1:𝑁 (·|𝑦1:𝑁 ))
∫
2
≤ 2𝜎 KL 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 )
𝜋 𝑁 +1 d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
∫
2
= 2𝜎 KL 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 )
𝜋 𝑁 +1 d𝜈 1:𝑁 (𝑦1:𝑁 ) ,
where we used the assumption. On the other hand, the inductive hypothesis is
𝑁 ∫
𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 ) ≤ 2𝜎 2 KL(𝜈 1:𝑁 k 𝜋1:𝑁 ) .
∑︁
inf 𝜑
𝛾 1:𝑁 ∈C(𝜋 1:𝑁 ,𝜈 1:𝑁 )
𝑖=1
𝑁
∫ ∑︁
≥ inf 𝛼𝑖 d𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 , 𝑦1:𝑁 ) ,
𝛾 ∈C(𝜈,𝜋)
𝑖=1
where we used the Cauchy–Schwarz inequality. This is a T1 inequality for the weighted
distance d𝛼 (𝑥 1:𝑁 , 𝑦1:𝑁 ) B 𝑖=1
Í𝑁
𝛼𝑖 d𝑖 (𝑥𝑖 , 𝑦𝑖 ).
Together with results from the next section, this tensorization result is already powerful
enough to recover the bounded differences concentration inequality (see Exercise 2.12),
but it is not fully satisfactory as it yields a transport inequality for a weighted metric.
A more satisfactory result is obtained by applying Marton’s tensorization (Theo-
rem 2.3.17) with the convex function 𝜑 (𝑥) B 𝑥 and cost functions 𝑐𝑖 B d𝑖2 , with d𝑖 a lower
semicontinuous metric on X𝑖 . This immediately yields the following corollary.
Definition 2.4.1. For a Borel subset 𝐴 ⊆ X and 𝜀 > 0, we let 𝐴𝜀 denote the 𝜀-blow-up
2.4. CONCENTRATION OF MEASURE 89
of 𝐴, defined by
𝐴𝜀 B {𝑥 ∈ X | d(𝑥, 𝐴) < 𝜀} .
Theorem 2.4.2 (blow-up and Lipschitz functions). Suppose that (X, d, 𝜋) has concen-
tration function 𝛼 𝜋 . Then, for any 1-Lipschitz function 𝑓 : X → R and 𝜀 ≥ 0,
𝜋 {𝑓 ≥ med 𝑓 + 𝜀} ≤ 𝛼 𝜋 (𝜀) .
𝜋 {𝑓 ≥ med 𝑓 + 𝜀} ≤ 𝛽 (𝜀) .
Proof. ( =⇒ ) Consider the set 𝐴 B {𝑓 ≤ med 𝑓 }. We claim that 𝐴𝜀 ⊆ {𝑓 − med 𝑓 < 𝜀}.
To prove this, let 𝑥 ∈ 𝐴𝜀 . By definition, there exists 𝑦 ∈ 𝐴 such that d(𝑥, 𝑦) < 𝜀, so
𝑓 (𝑥) − med 𝑓 = 𝑓 (𝑦) − med 𝑓 + 𝑓 (𝑥) − 𝑓 (𝑦) ≤ d(𝑥, 𝑦) < 𝜀. Hence,
More broadly, it is a general principle that many statements about sets have an
equivalent reformulation in terms of functions. We will see more instances of this idea
throughout the book.
Some statements regarding concentration, such as the theorem above, are more easily
phrased in terms of concentration around the median rather than around the mean. The
following result shows that, up to numerical constants, the mean and the median are
equivalent. To state the result in generality, we introduce the idea of an Orlicz norm.
Examples of Orlicz functions include 𝜓 (𝑥) = 𝑥 𝑝 for 𝑝 ≥ 1, for which the corresponding
Orlicz norm is the 𝐿𝑝 (P) norm, and 𝜓 2 (𝑥) B exp(𝑥 2 ) − 1 for which the Orlicz norm k𝑋 k𝜓2
captures the sub-Gaussianity of 𝑋 .
Lemma 2.4.4 (mean and median). Let 𝜓 be an Orlicz function and let 𝑋 be a real-valued
random variable. Then,
1
k𝑋 − E 𝑋 k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 ≤ 3 k𝑋 − E 𝑋 k𝜓 .
2
Proof. We can assume that 𝑋 is not constant; from the properties of Orlicz functions,
𝜓 −1 (𝑡) is well-defined for any 𝑡 > 0. Then,
k𝑋 − E 𝑋 k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 + kmed 𝑋 − E 𝑋 k𝜓
= k𝑋 − med 𝑋 k𝜓 + |med 𝑋 − E 𝑋 | k1k𝜓
≤ k𝑋 − med 𝑋 k𝜓 + E|𝑋 − med 𝑋 | k1k𝜓 .
Since
|𝑋 − med 𝑋 | E|𝑋 − med 𝑋 | 1
E𝜓 ≥𝜓 =𝜓 = 1,
E|𝑋 − med 𝑋 | k1k𝜓 E|𝑋 − med 𝑋 | k1k𝜓 k1k𝜓
it implies E|𝑋 − med 𝑋 | k1k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 .
Next, assume that med 𝑋 ≥ E 𝑋 (or else replace 𝑋 by −𝑋 ). Then,
1
≤ P{𝑋 ≥ med 𝑋 } ≤ P{|𝑋 − E 𝑋 | ≥ med 𝑋 − E 𝑋 }
2
2.4. CONCENTRATION OF MEASURE 91
1
≤ ,
𝜓 ((med 𝑋 − E 𝑋 )/k𝑋 − E 𝑋 k𝜓 )
so that
|med 𝑋 − E 𝑋 | ≤ 𝜓 −1 (2) k𝑋 − E 𝑋 k𝜓 .
Therefore,
Note, however, that k1k𝜓 = 1/𝜓 −1 (1). Since 𝜓 (𝜓 −1 (2)/2) ≤ 1 by convexity (and the
property 𝜓 (0) = 0), it implies 𝜓 −1 (2) ≤ 2𝜓 −1 (1), and we obtain the result.
𝜆 2𝜎 2
ent exp(𝜆𝑋 ) ≤ E exp(𝜆𝑋 ) for all 𝜆 ≥ 0 .
2
Then, it holds that
𝜆 2𝜎 2
E exp{𝜆 (𝑋 − E 𝑋 )} ≤ exp for all 𝜆 ≥ 0 .
2
92 CHAPTER 2. FUNCTIONAL INEQUALITIES
𝑡2
P{𝑋 ≥ E 𝑋 + 𝑡 } ≤ exp − for all 𝑡 ≥ 0 .
2𝜎 2
Proof. Let 𝜏 (𝜆) B 𝜆 −1 ln E exp{𝜆 (𝑋 − E 𝑋 )}. We leave it to the reader to check the
calculus identity
1 ent exp(𝜆𝑋 )
𝜏 0 (𝜆) = . (2.4.7)
𝜆 2 E exp(𝜆𝑋 )
Since 𝜏 (𝜆) → 0 as 𝜆 & 0, the assumption of the lemma yields 𝜏 (𝜆) ≤ 𝜆𝜎 2 /2.
The calculation in (2.4.5) shows that the assumption of the Herbst argument is satisfied
for all 1-Lipschitz functions 𝑓 , with 𝜎 2 = 𝐶 LSI . Hence, we deduce a concentration inequality
for Lipschitz functions, which we formally state in the next theorem together with the
corresponding result under a Poincaré inequality. The Poincaré case is left as Exercise 2.9.
Example 2.4.9. Suppose that 𝛾 is the standard Gaussian measure on R𝑑 . From the
Bakry–Émery theorem (Theorem 1.2.28), 𝛾 satisfies the log-Sobolev inequality with
2 ] = 𝑑, the Poincaré inequality applied to the norm
𝐶 LSI = 1. For 𝑍 ∼ 𝛾, since E[k𝑍 k√ √
k·k shows that var k𝑍 k ≤ 1, i.e., 𝑑 − 1 ≤ Ek𝑍 k ≤ 𝑑.
The concentration result above
√ now shows that the standard Gaussian “lives” on
a thin spherical shell of radius 𝑑 and width 𝑂 (1).
2.4. CONCENTRATION OF MEASURE 93
Proof. Let Lip1 (X) denote the space of 1-Lipschitz and mean-zero functions on X. Lipschitz
concentration can be stated as
n ∫ 𝜆 2𝜎 2 o
sup sup ln exp(𝜆𝑓 ) d𝜋 − ≤ 0.
𝜆∈R 𝑓 ∈Lip1 (X) 2
where we recall that 𝑓 d𝜋 = 0 for 𝑓 ∈ Lip1 (X). If we first evaluate the supremum over
∫
If we next evaluate the supremum over functions 𝑓 ∈ Lip1 (X) using the Kantorovich
duality formula (1.E.4) for 𝑊1 from Exercise 1.11, we obtain
n𝑊 2 (𝜈, 𝜋) o
1
sup − KL(𝜈 k 𝜋) ≤ 0 ,
𝜈 ∈P (X) 2𝜎 2
probability theory, Hoeffding’s inequality and Pinsker’s inequality, are in fact equivalent
to each other (see Exercise 2.10).
Although the T1 inequality implies sub-Gaussian concentration for all Lipschitz func-
tions, it is in fact equivalent to sub-Gaussian concentration of a single function, the
distance function d(·, 𝑥 0 ) for some 𝑥 0 ∈ X. The next theorem is not used often because
the quantitative dependence of the equivalence can be crude, but it is worth knowing.
1. 𝜋 satisfies a T1 inequality.
Transport inequalities offer a flexible and powerful method for characterizing and
proving concentration inequalities, as we will see in the next section. Before doing so,
however, we wish to also demonstrate how concentration of measure, formulated via
blow-up of sets, can be deduced directly from a T1 inequality.
Suppose that T1 (𝜎 2 ) holds, i.e.,
𝑊12 (𝜇, 𝜋) ≤ 2𝜎 2 KL(𝜇 k 𝜋) for all 𝜇 ∈ P1 (X), 𝜇 𝜋 .
For any disjoint sets 𝐴, 𝐵, with 𝜋 (𝐴) 𝜋 (𝐵) > 0, if we let 𝜋 (· | 𝐴) (resp. 𝜋 (· | 𝐵)) denote
the distribution 𝜋 conditioned on 𝐴 (resp. 𝐵), then
d(𝐴, 𝐵) ≤ 𝑊1 𝜋 (· | 𝐴), 𝜋 (· | 𝐵) ≤ 𝑊1 𝜋 (· | 𝐴), 𝜋 + 𝑊1 𝜋 (· | 𝐵), 𝜋
√︃
√︃
≤ 2𝜎 KL 𝜋 (· | 𝐴)
𝜋 + 2𝜎 2 KL 𝜋 (· | 𝐵)
𝜋 .
2
However,
1 1
∫
𝜋 (d𝑥)
ln = ln
KL 𝜋 (· | 𝐴)
𝜋 = ,
𝐴 𝜋 (𝐴) 𝜋 (𝐴) 𝜋 (𝐴)
so that
1 1
√︂ √︂
d(𝐴, 𝐵) ≤ 2𝜎 2 ln + 2𝜎 2 ln .
𝜋 (𝐴) 𝜋 (𝐵)
In particular, if we take 𝐵 = (𝐴𝜀 ) c where 𝜋 (𝐴) ≥ 12 , then d(𝐴, 𝐵) ≥ 𝜀. Hence, for all
√ √︃
1
𝜀 ≥ 2 2𝜎 2 ln 2, it holds that 2𝜀 ≤ 2𝜎 2 ln 𝜋 (𝐵) , or
𝜀2 √
𝜋 (𝐴𝜀 ) c ≤ exp − 2 for all 𝜀 ≥ 8 ln 2 𝜎 . (2.4.12)
8𝜎
2.4. CONCENTRATION OF MEASURE 95
i.i.d.
2. (Wasserstein law of large numbers) Suppose that (𝑋𝑖 )𝑖=1
∞
∼ 𝜋, and that for some
6 Here,the word “stronger” is not precisely defined but it means something akin to “more useful” or
“produces more surprising consequences”.
96 CHAPTER 2. FUNCTIONAL INEQUALITIES
1 ∑︁
𝑁
𝛿 𝑋𝑖 , 𝜋 → 0 as 𝑁 → ∞ .
E𝑊2
𝑁 𝑖=1
The second result, Sanov’s theorem, is a foundational theorem from large deviations.
Although Sanov’s theorem is of fundamental importance in its own right, it would take
us too far afield to develop large deviations theory here, so we invoke it as a black box.
i.i.d.
Theorem 2.4.14 (Sanov’s theorem). Let (𝑋𝑖 )𝑖=1 ∞
∼ 𝜋 and let 𝜋 𝑁 B 𝑁 −1
Í𝑁
𝑖=1 𝛿𝑋𝑖
denote the empirical measure. Then, for any Borel set 𝐴 ⊆ P (X), it holds that
1
− inf KL(· k 𝜋) ≤ lim inf ln P{𝜋 𝑁 ∈ 𝐴}
int 𝐴 𝑁 →∞ 𝑁
1
≤ lim sup ln P{𝜋 𝑁 ∈ 𝐴} ≤ − inf KL(· k 𝜋) .
𝑁 →∞ 𝑁 𝐴
Proof. It remains to prove the converse implication. Fix 𝑡 > 0 and apply the assumption
statement to the 𝑁 −1/2 -Lipschitz function (𝑥 1, . . . , 𝑥 𝑁 ) ↦→ 𝑊2 (𝑁 −1 𝑖=1 𝛿𝑥𝑖 , 𝜋). It implies
Í𝑁
𝑁 {𝑡 − E𝑊 (𝜋 , 𝜋)}2
2 𝑁
P{𝑊2 (𝜋 𝑁 , 𝜋) > 𝑡 } ≤ exp − ,
2𝜎 2
i.i.d.
where 𝜋 𝑁 B 𝑁 −1 𝑖=1 𝛿𝑋𝑖 , with (𝑋𝑖 )𝑖∈N+ ∼ 𝜋. On the other hand, the lower semicon-
Í𝑁
tinuity of 𝑊2 implies that {𝜈 ∈ P (X) | 𝑊2 (𝜇, 𝜈) > 𝑡 } is open. By Sanov’s theorem
(Theorem 2.4.14), we obtain
1
− inf {KL(𝜈 k 𝜋) | 𝑊2 (𝜈, 𝜋) > 𝑡 } ≤ lim inf ln P{𝑊2 (𝜋 𝑁 , 𝜋) > 𝑡 }
𝑁 →∞ 𝑁
{𝑡 − E𝑊2 (𝜋 𝑁 , 𝜋)}2 𝑡2
≤ − lim sup = − ,
𝑁 →∞ 2𝜎 2 2𝜎 2
2.5. ISOPERIMETRIC INEQUALITIES 97
where the last inequality comes from the Wasserstein law of large numbers (our assump-
tion implies that 𝜋 has sub-Gaussian tails, which in particular means E[d(𝑥, 𝑋 1 ) 𝑝 ] < ∞
for any 𝑥 ∈ X and any 𝑝 ≥ 1).
We have proven that 𝑊2 (𝜈, 𝜋) > 𝑡 implies KL(𝜈 k 𝜋) ≥ 𝑡 2 /(2𝜎 2 ), which is seen to be
equivalent to the T2 inequality.
Observe in particular that this theorem implies the Otto–Villani theorem (Exercise 1.16):
due to tensorization (Theorem 2.3.16) and the Herbst argument (Lemma 2.4.6), a log-
Sobolev inequality implies high-dimensional sub-Gaussian concentration of Lipschitz
functions, which by Gözlan’s theorem is equivalent to a T2 inequality.
𝜀2 √
𝛼 𝜋 (𝜀) ≤ exp − for all 𝜀 ≥ 8 ln 2 𝜎 .
8𝜎 2
Observe that this provides no information when 𝜀 is small, whereas by definition we know
that 𝛼 𝜋 (0) = 21 . In this section, we will study finer questions about the concentration
function. More generally, for 𝑝 ∈ (0, 1) and 𝜀 > 0, we introduce the quantity
and we can ask about the deviation of 𝜔 𝜋 (𝑝, 𝜀) from 𝑝 when 𝜀 is small. In some special
cases, we can even determine the function 𝜔 𝜋 exactly. The study of this question will bring
us to the classical geometric problem of isoperimetry. In its simplest guise, it asks: among
all plane curves which enclose an area of a prescribed area, which ones have the least
perimeter? Unsurprisingly, among regular curves, it is well-known that circles provide
the answer to this question. As we shall see, the isoperimetric question, once generalized
to abstract spaces, contains a wealth of information about concentration phenomena.
Theorem 2.5.1 (spherical isoperimetry). Let 𝜎𝑑 denote the uniform measure on the
𝑑-dimensional unit sphere S𝑑 , and let 𝐴 ⊆ S𝑑 be a Borel subset with 𝜇 (𝐴) ∉ {0, 1}. Let
𝐶 be a spherical cap with the same measure as 𝐴. Then, for all 𝜀 > 0,
𝜎𝑑 (𝐴𝜀 ) ≥ 𝜎𝑑 (𝐶 𝜀 ) .
We give a few reminders about spherical geometry. We equip S𝑑 with its geodesic
metric d, so that the distance between two points 𝑥, 𝑦 ∈ S𝑑 is equal to the angle between
𝑥 and 𝑦. A spherical cap is a geodesic ball, that is, it is a set of the form B(𝑥 0, 𝑟 ) for some
𝑥 0 ∈ S𝑑 and 𝑟 > 0, where the balls are defined w.r.t. d.
Note that Theorem 2.5.1 identifies the exact function 𝜔 𝜎𝑑 .
To see how an isoperimetric result naturally leads to a concentration result, suppose
that 𝐴 has measure 12 . Then, the corresponding spherical cap 𝐶 can be taken to be half of
the sphere, 𝐶 = B(𝑥 0, π2 ), and so
π
𝜎𝑑 (𝐴𝜀 ) ≥ 𝜎𝑑 (𝐶 𝜀 ) = 𝜎𝑑 B 𝑥 0, + 𝜀 .
2
To obtain an upper bound on the concentration function 𝛼𝜎𝑑 , it therefore suffices to lower
bound the volume of the spherical cap. It leads to the following result, which we leave as
an exercise (Exercise 2.14).
Theorem 2.5.2 (concentration on the sphere). Let 𝜎𝑑 be the uniform measure on the
unit sphere S𝑑 in dimension 𝑑 ≥ 2, equipped with the geodesic distance d. Then,
(𝑑 − 1) 𝜀 2
𝛼𝜎𝑑 (𝜀) ≤ exp − , for all 𝜀 > 0 .
2
𝐻𝑥 0,𝑡 B {𝑥 ∈ R𝑑 | h𝑥, 𝑥 0 i ≤ 𝑡 } .
Theorem 2.5.3 (Gaussian isoperimetry). Let 𝛾𝑑 denote the standard Gaussian measure
on R𝑑 , and let 𝐴 ⊆ R𝑑 be a Borel subset with 𝛾𝑑 (𝐴) ≠ {0, 1}. Let 𝐻𝑥 0,𝑡 be a half-space
2.5. ISOPERIMETRIC INEQUALITIES 99
We can write this result more explicitly as follows. By rotational invariance of the
Gaussian, we can take 𝑥 0 to be any unit vector 𝑒, in which case the measure of 𝐻𝑒,𝑡 is
𝛾𝑑 (𝐻𝑒,𝑡 ) = Φ(𝑡), where Φ is the Gaussian CDF. Since 𝐻𝑒,𝑡
𝜀 =𝐻
𝑒,𝑡+𝜀 , then
1 𝜀2
𝛼𝛾𝑑 (𝜀) ≤ Φ(−𝜀) ≤ exp − .
2 2
We now pause to give a remark on proofs. Since these isoperimetric inequalities require
a detailed understanding of the measure (including the optimal sets in the inequality), they
are considerably more difficult to prove than the other results we have seen so far (e.g.,
a log-Sobolev inequality). In particular, usually they can only established for measures
which are simple in some regard, e.g., they enjoy many symmetries. Hence, we will not
prove them here.
It is often convenient to pass to a differential form of the isoperimetric inequality,
which is obtained by sending 𝜀 & 0. This is formalized as follows.
Definition 2.5.5 (Minkowski content). Given a non-empty Borel set 𝐴 and a measure
𝜋 on a Polish space (X, d), the Minkowski content of 𝐴 under 𝜋 is
𝜋 (𝐴𝜀 ) − 𝜋 (𝐴)
𝜋 + (𝐴) B lim inf .
𝜀&0 𝜀
Definition 2.5.6 (isoperimetric profile). For a measure 𝜋 on a Polish space (X, d),
the isoperimetric profile of 𝜋, denoted I𝜋 : [0, 1] → R+ , is the function
where 𝜙 = Φ0 is the Gaussian density. Hence, we can identify the isoperimetric profile of
the standard Gaussian as
Actually, the result (2.5.7) is equivalent to the Gaussian isoperimetric inequality (2.5.4).
Here, (2.5.4) is called the integral form of the inequality, whereas (2.5.7) is called the
differential form. The following theorem shows how to convert between the two forms.
Theorem 2.5.8 ([BH97]). Let I : (0, 1) → R>0 . Define the increasing function
𝐹 such that 𝐹 (0) = 21 and 𝑓 ◦ 𝐹 −1 = 𝐼 , where 𝑓 = 𝐹 0; equivalently, we can take
𝐹 −1 (𝑝) = 1/2 I (𝑡) −1 d𝑡. Then, the following statements are equivalent.
∫𝑝
1. For all 𝜀 > 0 and all Borel 𝐴 with 𝜋 (𝐴) ∉ {0, 1},
𝜋 (𝐴𝜀 ) ≥ 𝐹 𝐹 −1 𝜋 (𝐴) + 𝜀 .
𝜋 + (𝐴) ≥ I 𝜋 (𝐴) .
Proof sketch. Let 𝜔¯ (𝑝, 𝜀) B 𝐹 (𝐹 −1 (𝑝) + 𝜀). Then, 𝜔¯ satisfies the semigroup property
𝜔¯ (𝜔¯ (𝑝, 𝜀), 𝜀 0) = 𝜔¯ (𝑝, 𝜀 + 𝜀 0), and using this one can show that to prove 𝜋 (𝐴𝜀 ) ≥ 𝜔 (𝜋 (𝐴))
it suffices to consider 𝜀 & 0. A Taylor expansion yields
from which we deduce that 𝜋 + (𝐴) ≥ I (𝜋 (𝐴)) for all 𝐴 if and only if 𝜋 (𝐴𝜀 ) ≥ 𝜔¯ (𝜋 (𝐴), 𝜀)
for all 𝐴 and all 𝜀 > 0.
For the two-sided exponential density 𝑥 ↦→ 𝜇 (𝑥) B 12 exp(−|𝑥 |), the isoperimetric
profile is known to be I𝜇 (𝑝) = min(𝑝, 1 − 𝑝). Hence, the Cheeger isoperimetric inequality
roughly asserts that the isoperimetric properties of 𝜋 are at least as good as those of 𝜇.
The inequality in (2.5.10) is the differential form of the inequality. By applying Theo-
rem 2.5.8, one shows that the inequality (2.5.10) implies, for any 𝜀 ∈ [0, Ch],
𝜀
𝜋 (𝐴𝜀 ) − 𝜋 (𝐴) ≥ 𝜋 (𝐴) 𝜋 (𝐴c ) . (2.5.11)
2 Ch
For all 𝜀 > 0, Theorem 2.5.8 also implies that
𝜀
𝛼 𝜋 (𝜀) ≤ exp − ,
Ch
so 𝜋 enjoys at least subexponential concentration.
Such isoperimetric inequalities will play a key role when we study Metropolis-adjusted
sampling algorithms in Chapter 7. For now, however, our goal is to establish an equivalence
between the Cheeger isoperimetric inequality and a functional version of it.
To pass from a functional inequality to an inequality involving sets, we can usually
apply the functional inequality to the indicator of a set. To go the other way around, we
need to represent a function via its level sets, which is achieved via the coarea inequality.
|𝑓 (𝑥) − 𝑓 (𝑦)|
k∇𝑓 k(𝑥) B lim sup .
𝑦∈X, d(𝑥,𝑦)&0 d(𝑥, 𝑦)
In “nice” spaces, the coarea inequality is actually an equality, but we will not need this.
Proof. By an approximation argument we may assume that 𝑓 is bounded, and by adding
a constant to 𝑓 we may suppose 𝑓 ≥ 0. Let 𝑓𝜀 (𝑥) B supd(𝑥,·)<𝜀 𝑓 and 𝐴𝑡 B {𝑓 > 𝑡 }. We
102 CHAPTER 2. FUNCTIONAL INEQUALITIES
∫∞
can check that 𝐴𝑡𝜀 = {𝑓𝜀 > 𝑡 }, and for 𝑔 ≥ 0 we have the formula 𝑔 d𝜋 = 0 𝜋 {𝑔 > 𝑡 } d𝑡.
∫
Theorem 2.5.14. Let 𝜋 ∈ P1 (X) and let Ch > 0. The following are equivalent.
E𝜋 |𝑓 − E𝜋 𝑓 | ≤ 2 Ch E𝜋 k∇𝑓 k . (2.5.15)
Proof sketch. (2) =⇒ (1): Apply (2.5.15) to an approximation 𝑓 of the indicator function
1𝐴 , so that E𝜋 |𝑓 − E𝜋 𝑓 | ≈ 2 𝜋 (𝐴) (1 − 𝜋 (𝐴)) and E𝜋 k∇𝑓 k ≈ 𝜋 + (𝐴).
(1) =⇒ (2): Let 𝐴𝑡 B {𝑓 > 𝑡 }. Applying the coarea inequality and the Cheeger
isoperimetric inequality,
∫ ∞
2 Ch E𝜋 k∇𝑓 k ≥ 2 Ch 𝜋 + {𝑓 > 𝑡 } d𝑡
∫ ∞ −∞ ∫ ∞
≥2 𝜋 (𝐴𝑡 ) 𝜋 (𝐴𝑡 ) d𝑡 =
c
E𝜋 | 1𝐴𝑡 − 𝜋 (𝐴𝑡 )| d𝑡
−∞ −∞
∫ ∞ ∫
≥ sup 𝑔 {1𝐴𝑡 − 𝜋 (𝐴𝑡 )} d𝜋 d𝑡
k𝑔k 𝐿∞ (𝜋 ) ≤1 −∞
∫ ∞ ∫ ∫
= sup {𝑔 − E𝜋 𝑔} 1𝐴𝑡 d𝜋 d𝑡 = sup {𝑔 − E𝜋 𝑔} 𝑓 d𝜋
k𝑔k 𝐿∞ (𝜋 ) ≤1 −∞ k𝑔k 𝐿∞ (𝜋 ) ≤1
= E𝜋 |𝑓 − E𝜋 𝑓 | .
Definition 2.5.16 (𝐿𝑝 –𝐿𝑞 Poincaré inequality). For 𝑝, 𝑞 ∈ [1, ∞] with 𝑞 ≥ 𝑝, the
𝐿𝑝 –𝐿𝑞 Poincaré inequality asserts that for all smooth 𝑓 : R𝑑 → R,
k 𝑓 − E𝜋 𝑓 k 𝐿𝑝 (𝜋) ≤ 𝐶𝑝,𝑞
k∇𝑓 k
𝐿𝑞 (𝜋) .
In this new
√ notation, the usual Poincaré inequality1is an 𝐿 2 –𝐿 2 Poincaré inequality
with 𝐶 2,2 = 𝐶 PI , whereas the inequality (2.5.15) is an 𝐿 –𝐿 1 Poincaré inequality.
These inequalities form a hierarchy via Hölder’s inequality.
𝑝ˆ
𝐶𝑝,ˆ 𝑞ˆ . 𝐶𝑝,𝑞 .
𝑝
Proof. Let 𝑓 satisfy med𝜋 𝑓 = 0, which we can arrange by adding a constant. Define the
ˆ
function 𝑔 B (sgn 𝑓 ) |𝑓 |𝑝/𝑝 , which still satisfies med𝜋 𝑔 = 0. By using the equivalence be-
tween the mean and the median (Lemma 2.4.4) and applying the 𝐿𝑝 –𝐿𝑞 Poincaré inequality
to 𝑔 together with Hölder’s inequality,
ˆ
k𝑓 − med𝜋 𝑓 k = k𝑔 − med𝜋 𝑔k 𝐿𝑝 (𝜋) . k𝑔 − E𝜋 𝑔k 𝐿𝑝 (𝜋) ≤ 𝐶𝑝,𝑞
k∇𝑔k
𝑞
𝑝/𝑝
𝐿𝑝ˆ (𝜋) 𝐿 (𝜋)
𝑝ˆ
ˆ
= 𝐶𝑝,𝑞
|𝑓 |𝑝/𝑝−1 k∇𝑓 k
𝐿𝑞 (𝜋)
𝑝
𝑝ˆ ˆ
= 𝐶𝑝,𝑞 k 𝑓 − med𝜋 𝑓 k 𝐿𝑝ˆ (𝜋)
k∇𝑓 k
𝐿𝑞ˆ (𝜋) ,
𝑝/𝑝−1
𝑝
where we leave it to the reader to check that the exponents work out correctly. If we
rearrange this inequality and apply Lemma 2.4.4 again, then
𝑝ˆ
k 𝑓 − E𝜋 𝑓 k 𝐿𝑝ˆ (𝜋) . k 𝑓 − med𝜋 𝑓 k 𝐿𝑝ˆ (𝜋) .
𝐶𝑝,𝑞
k∇𝑓 k
𝐿𝑞ˆ (𝜋) .
𝑝
Thus, we have the following implications: for any 𝑝 ∈ (2, ∞),
(𝐿 1 –𝐿 1 ) =⇒ (𝐿 2 –𝐿 2 ) =⇒ · · · =⇒ (𝐿𝑝 –𝐿𝑝 ) .
In particular, the first implication together with the equivalence in Theorem 2.5.14 shows
that the Cheeger isoperimetric inequality implies the Poincaré inequality.
104 CHAPTER 2. FUNCTIONAL INEQUALITIES
Also, given any 𝐿𝑝 –𝐿𝑞 Poincaré inequality, by Jensen’s inequality we can trivially
make it weaker by decreasing 𝑝 or increasing 𝑞; hence, every 𝐿𝑝 –𝐿𝑞 Poincaré inequality
implies an 𝐿 1 –𝐿 ∞ Poincaré inequality. On the other hand, for any 1 ≤ 𝑝 ≤ 𝑞 < ∞, an
𝐿 1 –𝐿 1 Poincaré implies an 𝐿𝑝 –𝐿𝑝 Poincaré, which trivially implies an 𝐿𝑝 –𝐿𝑞 Poincaré
inequality. We conclude that among these inequalities, the 𝐿 1 –𝐿 1 inequality is the strongest
and the 𝐿 1 –𝐿 ∞ inequality is the weakest.
We now wish to sketch the proof of a deep result by E. Milman, which states that for
log-concave measures, the hierarchy can be reversed. The formal statement is as follows.
𝐶 1,1 . 𝐶 1,∞ .
The proof of Milman’s theorem will require some preparations. The first fact that
we need is a deep result in its own right. Typically it is proven with geometric measure
theory by studying the isoperimetric problem, and we omit the proof.
1
I𝜋 (𝑝) ≥ I𝜋 𝑝. (2.5.21)
2
Hence, in order to prove Cheeger’s isoperimetric inequality, we need only find a suitable
lower bound for I𝜋 ( 21 ).
The next idea is that instead of applying the 𝐿 1 –𝐿 ∞ directly to an indicator function
1𝐴 , we will first regularize 1𝐴 using the Langevin semigroup (𝑃𝑡 )𝑡 ≥0 with stationary
distribution 𝜋. We start with a semigroup calculation.
2.5. ISOPERIMETRIC INEQUALITIES 105
Proposition 2.5.22. Assume that the Markov semigroup (𝑃𝑡 )𝑡 ≥0 is reversible and
satisfies the curvature-dimension condition CD(𝛼, ∞) for some 𝛼 ∈ R.
exp(2𝛼𝑡) − 1
𝑃𝑡 (𝑓 2 ) − (𝑃𝑡 𝑓 ) 2 ≥ Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) , (2.5.23)
𝛼
exp(2𝛼𝑡)−1
where we interpret 𝛼 = 2𝑡 when 𝛼 = 0.
Proof. We recall the calculation that we performed for the local Poincaré inequality
(Theorem 2.2.11):
exp(2𝛼𝑡) − 1
∫ 𝑡
2 2
𝑃𝑡 (𝑓 ) − (𝑃𝑡 𝑓 ) ≥ 2Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) exp(2𝛼𝑠) d𝑠 = Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) .
0 𝛼
This establishes the first inequality. In particular, for 𝛼 = 0,
1 √︁
𝑃𝑡 (𝑓 2 )
√︁
Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ √
2𝑡
106 CHAPTER 2. FUNCTIONAL INEQUALITIES
so that
1 1
≤ √ {E𝜋 𝑃𝑡 (|𝑓 |𝑝 )}1/𝑝 ≤ √ k𝑓 k 𝐿𝑝 (𝜋) .
√︁
Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 )
𝐿𝑝 (𝜋)
2𝑡 2𝑡
The last inequality is the dual of the 𝑝 = ∞ case. Indeed, for 𝑔 with k𝑔k 𝐿∞ (𝜋) ≤ 1,
∫ ∫ ∫
𝜕𝑡 (𝑓 − 𝑃𝑡 𝑓 ) 𝑔 d𝜋 = − 𝑃𝑡 ℒ𝑓 𝑔 d𝜋 = Γ(𝑓 , 𝑃𝑡 𝑔) d𝜋
and hence, by the Cauchy–Schwarz inequality for the carré du champ (Exercise 1.6),
∫ ∫ 𝑡 ∫ ∫ 𝑡 ∫ √︁
(𝑓 − 𝑃𝑡 𝑓 ) 𝑔 d𝜋 = Γ(𝑓 , 𝑃𝑠 𝑔) d𝜋 d𝑠 ≤ Γ(𝑓 , 𝑓 ) Γ(𝑃𝑠 𝑔, 𝑃𝑠 𝑔) d𝜋 d𝑠
0 0
∫ 𝑡 √︁
√︁
≤
Γ(𝑓 , 𝑓 )
𝐿1 (𝜋)
Γ(𝑃𝑠 𝑔, 𝑃𝑠 𝑔)
∞ d𝑠
𝐿 (𝜋)
0
1 √
√︁
∫ 𝑡
√︁
≤
Γ(𝑓 , 𝑓 )
𝐿1 (𝜋) k𝑔k 𝐿∞ (𝜋) √ d𝑠 ≤ 2𝑡
Γ(𝑓 , 𝑓 )
𝐿1 (𝜋) .
0 2𝑠
Remark 2.5.26. The inequality (2.5.23) can be regarded as a “reverse Poincaré” inequality
because it upper bounds the size of the gradient via a variance term. Similarly to Theo-
rem 2.2.11, the inequality (2.5.23) can also be shown to be equivalent to CD(𝛼, ∞).
We are now ready to prove Milman’s theorem.
𝐶 1,∞ 𝐶 1,∞
k𝑃𝑡 1𝐴 − 𝜋 (𝐴)k 𝐿1 (𝜋) ≤ 𝐶 1,∞
k∇𝑃𝑡 1𝐴 k
𝐿∞ (𝜋) ≤ √ k 1𝐴 k 𝐿∞ (𝜋) ≤ √ .
2𝑡 2𝑡
2.5. ISOPERIMETRIC INEQUALITIES 107
Hence, we have
√ 𝐶 1,∞
2𝑡 𝜋 + (𝐴) ≥ 2 𝜋 (𝐴) 𝜋 (𝐴c ) − √
.
2𝑡
√
Now choose 2𝑡 = 2𝐶 1,∞ /(𝜋 (𝐴) 𝜋 (𝐴c )) to obtain
1
𝜋 + (𝐴) ≥ 𝜋 (𝐴) 2 𝜋 (𝐴c ) 2 .
2𝐶 1,∞
1
This inequality is not fully satisfactory, but if we take 𝜋 (𝐴) = 2 then we deduce from this
that I𝜋 ( 21 ) ≥ 1/(32𝐶 1,∞ ), and from (2.5.21) we conclude.
As with the Cheeger isoperimetric inequality, isoperimetry of Gaussian type can also
be captured via a functional inequality. The following result is due to Bobkov.
As this formulation suggests, Theorem 2.5.27 has a proof via Markov semigroup theory,
for which we refer readers to [BGL14, §8.5.2]. By converting the functional inequality
back into an isoperimetric statement, one can deduce the following comparison theorem.
define the gradient flow of 𝑓 to be a curve (𝑝𝑡 )𝑡 ≥0 such that the velocity 𝑝¤𝑡 of the curve
equals −∇𝑓 (𝑝𝑡 ) for all 𝑡 ≥ 0.
We pause to give a simple example. Suppose that M = R𝑑 , and we pick a smooth
mapping 𝑝 ↦→ 𝐴𝑝 where 𝐴𝑝 is a positive definite 𝑑 × 𝑑 matrix for each 𝑝 ∈ R𝑑 . This
induces a Riemannian metric via h𝑢, 𝑣i𝑝 B h𝑢, 𝐴𝑝 𝑣i, where h·, ·i (without a subscript)
denotes the usual Euclidean inner product. If (𝑝𝑡 )𝑡 ∈R is a smooth curve in R𝑑 and its usual
time derivative is (𝑝¤𝑡 )𝑡 ∈R , then we know that 𝜕𝑡 𝑓 (𝑝𝑡 ) = h∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i = h𝐴𝑝−1𝑡 ∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 .
Hence, the manifold gradient ∇M 𝑓 is given by ∇M 𝑓 (𝑝) = 𝐴𝑝−1 ∇𝑓 , where ∇𝑓 is the
Euclidean gradient. When 𝐴𝑝 = ∇2𝜙 (𝑝) is obtained as the Hessian of a mapping 𝜙, then
M is called a Hessian manifold.
A vector field on M is a mapping 𝑋 : M → 𝑇 M such that 𝑋 (𝑝) ∈ 𝑇𝑝 M for all 𝑝 ∈ M.7
A single vector 𝑣 ∈ 𝑇𝑝 M can be thought of as a differential operator; for 𝑓 ∈ C ∞ (M), we
can define the action of 𝑣 on 𝑓 via 𝑣 (𝑓 ) B (𝑑 𝑓 ) 𝑝 𝑣. Similarly, a vector field 𝑋 acts on 𝑓 and
produces a function 𝑋 𝑓 : M → R, defined by 𝑋 𝑓 (𝑝) = 𝑋 (𝑝)𝑓 for all 𝑝 ∈ M. For example,
on R𝑑 , a vector field can be identified with a mapping R𝑑 → R𝑑 , and it differentiates
functions via 𝑋 𝑓 (𝑝) = h∇𝑓 (𝑝), 𝑋 (𝑝)i.
We would also like to differentiate vector fields along other vector fields, and there are
two main ways of doing so. The first is called the Lie derivative, and it can be defined on
any smooth manifold without the need for a Riemannian metric, and is consequently less
important for our discussion. The second is the Levi–Civita connection, which given
vector fields 𝑋 and 𝑌 , outputs another vector field ∇𝑋 𝑌 . This connection is characterized
by various properties, including compatibility with the Riemannian metric: for all vector
fields 𝑋 , 𝑌 , and 𝑍 , we have the chain rule
Here, the vector field 𝑍 is differentiating the scalar function 𝑝 ↦→ h𝑋 (𝑝), 𝑌 (𝑝)i𝑝 . Since we
do not aim to perform many Riemannian calculations here, we omit most of the other
properties for simplicity. However, we mention one key fact, which is that for any smooth
function 𝑓 , it holds that ∇ 𝑓 𝑋 𝑌 = 𝑓 ∇𝑋 𝑌 , where 𝑓 𝑋 is the vector field (𝑓 𝑋 )(𝑝) = 𝑓 (𝑝) 𝑋 (𝑝).
This property implies that the mapping (𝑋, 𝑌 ) ↦→ ∇𝑋 𝑌 is tensorial in its first argument,
that is, (∇𝑋 𝑌 )(𝑝) only depends on the value 𝑋 (𝑝) of 𝑋 at 𝑝.
The tensorial property of the Levi–Civita connection allows us to compute the deriva-
tive of a vector field 𝑌 along a curve 𝑐 : R → M. Namely, for 𝑡 ∈ R, we can define
𝐷𝑐 𝑌 (𝑡) B (∇𝑐¤(𝑡) 𝑌 )(𝑐 (𝑡)), which makes sense because we can extend 𝑐¤ to a vector field
𝑋 on M and deduce that (∇𝑋 𝑌 )(𝑐 (𝑡)) only depends on 𝑋 (𝑐 (𝑡)) = 𝑐¤ (𝑡) (and not on the
choice of extension 𝑋 ). Then, 𝐷𝑐 𝑌 is called the covariant derivative of 𝑌 along the curve
7 Geometers would say that 𝑋 is a section of the tangent bundle.
110 CHAPTER 2. FUNCTIONAL INEQUALITIES
𝑐. From there, we can define the parallel transport of a vector 𝑣 0 ∈ 𝑇𝑐 (0) M along the
curve 𝑐 to be the unique vector field (𝑣 (𝑡))𝑡 ∈R defined along the curve 𝑐 with 𝑣 (0) = 𝑣 0
such that the covariant derivative vanishes: 𝐷𝑐 𝑣 = 0. The parallel transport is a canonical
way of identifying two different tangent spaces on M. Due to compatibility with the
metric, it has the property that if 𝑐 (0) = 𝑝, 𝑐 (1) = 𝑞, and 𝑃𝑐 𝑣 ∈ 𝑇𝑞 M denotes the parallel
transport of 𝑣 ∈ 𝑇𝑝 M along 𝑐 for time 1, then 𝑃𝑐 : 𝑇𝑝 M → 𝑇𝑞 M is an isometry.
We already have seen the idea of a length-minimizing curve, or a geodesic. Recall
that the Riemannian metric induces a distance on M via
n∫ 1 o
d(𝑝, 𝑞) = inf k𝛾¤ (𝑡)k𝛾 (𝑡) d𝑡 𝛾 (0) = 𝑝, 𝛾 (1) = 𝑞 .
0
Even though we have not given all of the definitions precisely, we will now work through
one example to give the reader the flavor of the computations. Suppose that (𝑝𝑡 )𝑡 ∈R is a
curve on M; then we know that 𝜕𝑡 𝑓 (𝑝𝑡 ) = h∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 . If 𝑔(𝑝) B h∇𝑓 (𝑝), 𝑝i ¤ 𝑝 , then by
definition we have 𝜕𝑡2 𝑓 (𝑝𝑡 ) = 𝑝¤𝑡 (𝑔)(𝑝𝑡 ). By compatibility of the Levi–Civita connection
with the metric (2.6.1), this equals h∇𝑝¤𝑡 ∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 + h∇𝑓 (𝑝𝑡 ), ∇𝑝¤𝑡 𝑝¤𝑡 i𝑝𝑡 . The first term is
8 Thetrace is defined as usual, namely if 𝐴 is a linear mapping on 𝑇𝑝 M, then after choosing an arbitrary
orthonormal basis 𝑒 1, . . . , 𝑒𝑑 of 𝑇𝑝 M (w.r.t. the Riemannian metric), we have tr 𝐴 = 𝑑𝑖=1 h𝑒𝑖 , 𝐴 𝑒𝑖 i𝑝 .
Í
2.6. METRIC MEASURE SPACES 111
∇2 𝑓 (𝑝𝑡 ) [𝑝¤𝑡 , 𝑝¤𝑡 ]. For the second term, if 𝑝 is a geodesic, then the term ∇𝑝¤𝑡 𝑝¤𝑡 vanishes. This
shows that it is convenient to pick geodesic curves when computing Hessians.9
For 𝑓 ∈ C ∞ (M), the Laplacian of 𝑓 is the function Δ𝑓 : M → R defined by
Δ𝑓 B tr ∇2 𝑓 . In the Riemannian setting, Δ is usually called the Laplace–Beltrami
operator. The Riemannian metric induces a volume measure, which we always denote
via 𝔪. Throughout, when we abuse notation to refer to the density of an absolutely
continuous measure 𝜇 ∈ P (M), we always refer to the density w.r.t. the volume measure,
d𝜇
i.e., d𝔪 . We have the integration by parts formula
∫ ∫ ∫
Δ𝑓 𝑔 d𝔪 = 𝑓 Δ𝑔 d𝔪 = − h∇𝑓 , ∇𝑔i d𝔪 ,
Riem(𝑊 , 𝑋, 𝑌, 𝑍 ) B h∇𝑋 ∇𝑊 𝑌 − ∇𝑊 ∇𝑋 𝑌 + ∇ [𝑊 ,𝑋 ] 𝑌, 𝑍 i .
Here, [𝑊 , 𝑋 ] is the Lie bracket of 𝑊 and 𝑋 , which is the vector field 𝑈 defined as the
commutator: 𝑈 𝑓 B 𝑊 𝑋 𝑓 − 𝑋𝑊 𝑓 . This tensor is obviously an unwieldy object, and it is
unclear whether anyone fully understands its complexities. Nevertheless, we may begin
to get a handle on it by observing that at its core, it measures the lack of commutativity of
certain differential operators, which we stated was the basis for curvature in Section 2.2.1.
9 When computing first-order derivatives, it is only important that the first-order behavior of the curve is
correct (i.e., the curve has the correct tangent vector). When computing second-order derivatives, it should
come at no surprise that the second-order behavior of the curve begins to matter.
Incidentally, if ∇𝑓 (𝑝𝑡 ) = 0, i.e., we are at a stationary point, then the second term vanishes regardless
of the curve 𝑝. Hence, the Hessian of 𝑓 can be defined on any smooth manifold without the need for a
Riemannian metric, provided that we restrict ourselves to stationary points of 𝑓 . This observation is used
heavily in Morse theory.
112 CHAPTER 2. FUNCTIONAL INEQUALITIES
On Euclidean space, it vanishes: Riem = 0. Also, the Riemann curvature tensor is fully
determined by the sectional curvatures of M: given a two-dimensional subspace 𝑆 of
𝑇𝑝 M, the sectional curvature of 𝑆 can be defined as the Gaussian curvature of the two-
dimensional surface obtained by following geodesics with directions in 𝑆. Thus, we can
view the Riemann curvature tensor as collecting together all of the curvature information
from two-dimensional slices.
Luckily, the Riemann curvature tensor contains information that is too detailed for our
purposes. With an eye towards probabilistic applications, we focus mainly on properties
such as the distortion of volumes of balls along geodesics, which only requires looking at
certain averages of the Riemann curvature. More specifically, for 𝑢, 𝑣 ∈ 𝑇𝑝 M, let
The tensor Ric is called the Ricci curvature tensor. It is a powerful fact that many
useful geometric and probabilistic consequences, such as diameter bounds and functional
inequalities, are consequences of lower bounds on the Ricci curvature.
We also mention that one can further take the trace of the Ricci curvature tensor to
arrive at a single scalar function, known as the scalar curvature, but we shall not use it
in this book.
Recall from Exercise 1.11 that the optimal transport problem can be posed with other costs;
in particular, we take the cost to be 𝑐 (𝑥, 𝑦) = d(𝑥, 𝑦) 2 , where d is the distance induced by
the Riemannian metric. Suppose, for simplicity, that M is compact and that 𝜇 is absolutely
continuous (w.r.t. the volume measure). Then, there is a unique optimal transport map 𝑇
from 𝜇 to 𝜈, which is of the form 𝑇 (𝑥) = exp𝑥 (∇𝜓 (𝑥)), where −𝜓 is d2 /2-concave.
Moreover, there is a formal Riemannian structure on P2,ac (M). We can formally define
the tangent space at 𝜇 to be
𝐿 2 (𝜇)
𝑇𝜇 P2,ac (M) B {∇𝜓 | 𝜓 ∈ C ∞ (M)} ,
√︃∫
equipped with the norm k∇𝜓 k 𝜇 B k∇𝜓 k 2 d𝜇. Also, curves of measures are again
characterized by the continuity equation
𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 ,
where the equation is to be interpreted in a weak sense: for any test function 𝜑 ∈ C ∞ (M),
for a.e. 𝑡, it holds that
∫ ∫
𝜕𝑡 𝜑 d𝜇𝑡 = h∇𝜑, 𝑣𝑡 i d𝜇𝑡 .
In short, aside from new technicalities introduced in the Riemannian setting (such as
the presence of a cut locus10 ), most of the facts familiar to us from the Euclidean setting
continue to hold when generalized appropriately. We refer to [Vil09] for more details.
We can check that this definition agrees with the usual notion of length on R𝑑 . By the
triangle inequality, if 𝛾 (0) = 𝑝 and 𝛾 (1) = 𝑞, then d(𝑝, 𝑞) ≤ len 𝛾.
10 Looselyspeaking, the presence of a cut locus means that there are multiple minimizing geodesics
connecting two points. Think for instance of the two poles of a sphere.
114 CHAPTER 2. FUNCTIONAL INEQUALITIES
Definition 2.6.3. We say that (X, d) is a geodesic space if for all 𝑝, 𝑞 ∈ X, there is a
constant-speed curve 𝛾 : [0, 1] → X such that 𝛾 (0) = 𝑝, 𝛾 (1) = 𝑞, and d(𝑝, 𝑞) = len 𝛾.
Here, “constant speed” implies that for all 𝑠, 𝑡 ∈ [0, 1],
Geodesic spaces are a broader class of spaces than Riemannian manifolds. In particular,
they do not have to have a smooth structure, and they can have “kinks”. For example,
the Wasserstein space (P2,ac (R𝑑 ),𝑊2 ) is not truly a Riemannian manifold, as it is infinite-
dimensional (along with other issues, e.g., it is not locally homeomorphic to a Hilbert
space), but from the considerations in Section 1.3.2 it follows that the Wasserstein space
is a geodesic space. The study of geodesic spaces is called metric geometry, and a
comprehensive treatment of this subject can be found in [BBI01].
There is a way to generalize the idea of a uniform bound on the sectional curvature to
the setting of geodesic spaces. It is based on comparing the sizes of triangles in X with
the corresponding sizes in a model space.
Definition 2.6.4 (model space). Let 𝜅 ∈ R. The model space M𝜅2 of curvature 𝜅 is the
standard two-dimensional Riemannian manifold with constant sectional curvature
equal to 𝜅, that is:
1. the hyperbolic plane H2 of√curvature 𝜅 (that is, the usual hyperbolic plane but
with metric rescaled by 1/ −𝜅) if 𝜅 < 0;
Definition 2.6.5 (Alexandrov curvature). Let (X, d) be a geodesic space and let 𝜅 ∈ R.
We say that (X, d) has Alexandrov curvature bounded from below by 𝜅 (resp.
from above by 𝜅) if the following holds. For any triple of points 𝑎, 𝑏, 𝑐 ∈ X, and any
¯ 𝑐¯ in the model space M2 such that
¯ 𝑏,
corresponding triple of points 𝑎, 𝜅
d(𝑎, 𝑏) = d(𝑎, ¯ ,
¯ 𝑏) ¯ 𝑐)
d(𝑎, 𝑐) = d(𝑎, ¯ , ¯ 𝑐)
d(𝑏, 𝑐) = d(𝑏, ¯ ,
2.6. METRIC MEASURE SPACES 115
for any 𝑝 ∈ X in the geodesic joining 𝑎 to 𝑐, and any 𝑝¯ ∈ M𝜅2 in the geodesic joining
𝑎¯ to 𝑐¯ with d(𝑎, 𝑝) = d(𝑎,
¯ 𝑝), ¯ 𝑝)
¯ it holds that d(𝑏, 𝑝) ≥ d(𝑏, ¯ 𝑝)).
¯ (resp. d(𝑏, 𝑝) ≤ d(𝑏, ¯
If such a curvature bound holds, then (X, d) is called an Alexandrov space.
Thus, triangles in X are thicker (resp. thinner) than their counterparts in M𝜅2 . The
advantage of this definition is that it can be stated using only the metric (and geodesic)
structure of X. For the case when 𝜅 = 0, there is another useful reformulation.
Proposition 2.6.6. Let (X, d) be a geodesic space. Then, (X, d) has Alexandrov curva-
ture bounded below by 0 (resp. bounded above by 0) if and only if the following holds.
For any constant-speed geodesic (𝑝𝑡 )𝑡 ∈[0,1] in X, any 𝑞 ∈ X, and any 𝑡 ∈ [0, 1],
We saw in Exercise 1.13 that (P2,ac (R𝑑 ),𝑊2 ) has non-negative Alexandrov curvature.
One can show that a Riemannian manifold has section curvature bounded by 𝜅 if and
only if the corresponding Alexandrov curvature bound holds.
Alexandrov curvature bounds enforce enough regularity that a satisfactory infinites-
imal theory can be developed for Alexandrov spaces. For instance, one can define the
notion of a tangent cone11 , and in the case of the Wasserstein space, its tangent cone
coincides with the definition of the tangent space that we gave in Section 1.3.2; see [AGS08,
§12.4] for details.
1
= {Δ(k 𝑓 k 2 ) − 2 h∇𝑓 , ∇Δ𝑓 i} + h∇𝑓 , ∇2𝑉 ∇𝑓 i .
2
Unlike in Section 2.2.1, however, we now have to apply the Bochner identity
1
Δ(k∇𝑓 k 2 ) = h∇𝑓 , ∇Δ𝑓 i + k∇2 𝑓 k 2HS + Ric(∇𝑓 , ∇𝑓 ) (2.6.7)
2
which shows that
Observe that in this formula, the curvature of the ambient space and the curvature of the
measure are placed on an equal footing through the tensor Ric + ∇2𝑉 . If Ric + ∇2𝑉 𝛼,
in the sense that Ric(𝑋, 𝑋 ) + h𝑋, ∇2𝑉 𝑋 i ≥ 𝛼 k𝑋 k 2 for any vector field 𝑋 on M, then
the curvature-dimension condition Γ2 ≥ 𝛼 Γ holds. Since the proof of the Bakry–Émery
theorem (Theorem 1.2.28) only relied on the CD(𝛼, ∞) condition (together with calculus
rules for the Markov semigroup, such as the chain rule), the theorem continues to hold in
the setting of weighted Riemannian manifolds.
Actually, we can refine the condition further as follows. If dim M = 𝑑, then
1 2 1
k∇2 𝑓 k 2HS ≥ (tr ∇2 𝑓 ) = (Δ𝑓 ) 2 .
𝑑 𝑑
This observation motivates the following definition.
1. CD(𝛼, 𝑑) holds.
As an example, one can show that the unit sphere S𝑑 satisfies Ric = 𝑑 − 1, so that the
CD(𝑑 − 1, 𝑑) condition holds. Then, using the Bakry–Émery theorem (Theorem 1.2.28), or
by using Markov semigroup calculus to prove that the curvature-dimension condition
implies Bobkov’s functional form of the Gaussian isoperimetric inequality (Theorem 2.5.27;
see [BGL14, Corollary 8.5.4]), one can now deduce results such as concentration on the
sphere (Theorem 2.5.2).
Besides providing an abstract framework for deriving functional inequalities, it is
worth noting that the condition (2.6.9) no longer makes any mention of the ambient space
except through the Markov semigroup (𝑃𝑡 )𝑡 ≥0 and its associated operators ℒ, Γ, and Γ2 .
This has led to a line of research investigating to what extent we can study the intrinsic
geometry intrinsic associated with a Markov semigroup. Although we do not intend
to survey the literature here, we show one illustrative example to give the flavor of the
results. First, one shows that the CD(𝛼, 𝑑) condition implies a Sobolev inequality.
Theorem 2.6.11. Consider a diffusion Markov semigroup satisfying the CD(𝛼, 𝑑) con-
2𝑑
dition for some 𝛼 > 0 and 𝑑 > 2. Then, for all 𝑝 ∈ [1, 𝑑−2 ] and all functions 𝑓 ,
1 n
∫ 2/𝑝 ∫ o 𝑑 −1∫
2
|𝑓 | d𝜋
𝑝
− 𝑓 d𝜋 ≤ Γ(𝑓 , 𝑓 ) d𝜋 . (2.6.12)
𝑝 −2 𝛼𝑑
From this Sobolev inequality, one can then deduce a diameter bound for the Markov
semigroup. Here, the diameter is defined as follows:
Theorem 2.6.13. Suppose that the Markov semigroup (𝑃𝑡 )𝑡 ≥0 satisfies the Sobolev
inequality (2.6.12). Then,
√︂
𝑑 −1
diam (𝑃𝑡 )𝑡 ≥0 ≤ π
.
𝛼
The diameter bound is sharp, as it is attained by the sphere, and together with Theo-
rem 2.6.11 it recovers the classical Bonnet–Myers diameter bound from Riemannian
geometry. Other geometric results obtained in this fashion include volume growth com-
parison results and heat kernel bounds.
118 CHAPTER 2. FUNCTIONAL INEQUALITIES
Definition 2.6.14. Let (X, d, 𝜋) be a metric measure space, where (X, d) is a geodesic
space. Then, we say that (X, d, 𝜋) satisfies the CD(𝛼, ∞) condition if for all measures
𝜇0, 𝜇1 ∈ P2 (X), there exists a constant-speed geodesic (𝜇𝑡 )𝑡 ∈[0,1] joining 𝜇0 to 𝜇1 with
𝛼 𝑡 (1 − 𝑡) 2
KL(𝜇𝑡 k 𝜋) ≤ (1 − 𝑡) KL(𝜇 0 k 𝜋) + 𝑡 KL(𝜇1 k 𝜋) − 𝑊2 (𝜇 0, 𝜇1 ) ,
2
for all 𝑡 ∈ [0, 1].
We now pause to discuss the motivation behind the introduction of this definition.
Unlike the statement Ric 𝛼, which only makes sense on Riemannian manifolds (and
hence requires a smooth structure), the above definition makes sense on a wider class
of spaces, including non-smooth spaces. The question of to what extent the concept of
curvature makes sense on non-smooth spaces is perhaps an interesting question in its
own right, but it also arises even when one is solely interested in smooth Riemannian
manifolds. Suppose, for instance, that we have a sequence of Riemannian manifolds
(M𝑘 )𝑘∈N that is converging in some sense to a limit space M; what properties of the
sequence are preserved in the limit?
If we want to pass to the limit in the condition RicM𝑘 𝛼, then typically we would
need the Ricci curvature tensors RicM𝑘 to be converging in the limit. Since curvature
involves two derivatives of the metric, this holds if the sequence converges in a C 2 sense.
2.6. METRIC MEASURE SPACES 119
However, for some applications, this notion of convergence is too strong. Instead, it is
common to work with Gromov–Hausdorff convergence, which is based on a notion
of distance between metric spaces. More specifically, it metrizes the space12 of compact
metric spaces. Moreover, this notion of convergence is weak enough that it admits a
useful compactness theorems.
As a consequence of the compactness theorem, a sequence of Riemannian manifolds
(M𝑘 )𝑘∈N with a uniform upper bound on the diameter and a uniform lower bound on
the Ricci curvature converges to a limit space M in the Gromov–Hausdorff topology.
However, in this topology, the space of Riemannian manifolds with diameter ≤ 𝐷 and
with Ric 𝛼 is not closed; the limit space M is not necessarily a Riemannian manifold.
So what then is M? It is a geodesic space, but understanding whether it can be said to
satisfy “RicM 𝛼” requires developing a theory of Ricci curvature lower bounds that
makes sense on such spaces.
An analogy is in order. For a function 𝑓 : R𝑑 → R, convexity can be described via the
Hessian, ∇2 𝑓 0, or via the property
for all 𝑥, 𝑦 ∈ R𝑑 , 𝑡 ∈ [0, 1] .
𝑓 (1 − 𝑡) 𝑥 + 𝑡 𝑦 ≤ (1 − 𝑡) 𝑓 (𝑥) + 𝑡 𝑓 (𝑦) ,
The former definition only makes sense for C 2 functions, whereas the latter definition
makes sense for any function. The former is called the analytic definition, whereas the
definition is called synthetic definition. Although the analytic definition is often more
intuitive, the synthetic definition is more general and more useful for technical arguments.
For example, from the synthetic definition is apparent that convexity is preserved under
pointwise convergence, whereas from the analytic definition one needs the stronger
notion of C 2 convergence.
From this perspective, the definition of Alexandrov curvature bounds in Section 2.6.2
is the synthetic counterpart to sectional curvature bounds from Riemannian geometry.
However, as we have already seen, sectional curvature bounds are often too strong for
geometric purposes, as we can obtain a wide array of geometric consequences (spectral
gap estimates, log-Sobolev and Sobolev inequalities, diameter bounds, volume growth
estimates, heat kernel bounds, etc.) from Ricci curvature lower bounds. Here, the curvature-
dimension condition provides us with synthetic Ricci curvature lower bounds.
By deducing geometric facts from the CD(𝛼, ∞) condition, one shows that spaces
satisfying the CD(𝛼, ∞) condition, despite the lack of smoothness, enjoy many of the
good properties shared by Riemannian manifolds satisfying Ric 𝛼. To complete the
program described in this section, we should ask whether synthetic Ricci curvature lower
bounds are preserved under a weak notion of convergence. The correct notion to consider
12 Thespace of all compact metric spaces is too large to be a set (it is a proper class). However, if we
choose one representative from each isometry class of metric spaces, then this is a bona fide set.
120 CHAPTER 2. FUNCTIONAL INEQUALITIES
3. (𝑓𝑘 ) # 𝜋𝑘 → 𝜋 weakly.
The following stability result is a key achievement of the theory of synthetic Ricci
curvature, arrived at simultaneously by Lott and Villani [LV09] and Sturm [Stu06a; Stu06b].
Theorem 2.6.16 (stability of synthetic Ricci curvature bounds). Let (X𝑘 , d𝑘 , 𝜋𝑘 )𝑘∈N →
(X, d, 𝜋) in the measured Gromov–Hausdorff topology. Let 𝛼 ∈ R and 𝑑 ≥ 1. If each
(X𝑘 , d𝑘 , 𝜋𝑘 ) satisfies CD(𝛼, 𝑑), then so does (X, d, 𝜋).
Note that we have not defined the CD(𝛼, 𝑑) condition for 𝑑 < ∞ in this context; we
refer readers to the original sources for the full treatment.
2.6.5 Discussion
A remark on the settings of the results. Throughout this chapter, we have not been
careful to state in what generality the various results hold. Certainly the results hold
on the Euclidean space R𝑑 , and with appropriate modifications they continue to hold on
weighted Riemannian manifolds.
The results based on optimal transport (e.g., results on transport inequalities) typically
hold on general Polish spaces. The theory of synthetic Ricci curvature makes sense on
geodesic spaces (with mild regularity conditions).
The results based on Markov semigroup theory only require an abstract space X on
which there is a Markov semigroup (𝑃𝑡 )𝑡 ≥0 satisfying various properties (e.g., a chain rule
for the carré du champ). Although this usually arises from a diffusion on a Riemannian
manifold, one can also start with a Dirichlet energy functional on a metric space and
develop a theory of non-smooth analysis. See [AGS15] for further discussion on how the
two approaches may be reconciled in a quite general setting.
2.6. METRIC MEASURE SPACES 121
Comparison between the two approaches. The discussion thus far has been rather
abstract, and it may be difficult to grasp how the two main approaches (Bakry–Émery
theory and optimal transport) can capture geometric information such as the curvature.
Here, we will briefly provide some intuition for this connection following [Vil09, §14].
Starting with the optimal transport perspective, fix 𝑥 0 ∈ M and a mapping ∇𝜓 . For
𝑡 ≥ 0, let 𝑥𝑡 B exp(𝑡 ∇𝜓 (𝑥 0 )), and let 𝛿 > 0. If 𝑒 1, . . . , 𝑒𝑑 be an orthonormal basis of 𝑇𝑥 0 M,
in an abuse of notation let 𝑥 0 + 𝛿𝑒𝑖 denote a point obtained by travelling along a curve
emanating from 𝑥 0 with velocity 𝑒𝑖 for time 𝛿. The points (𝑥 0 + 𝛿𝑒𝑖 )𝑑𝑖=1 form the vertices
of a parallelepiped 𝐴𝛿0 . On the other hand, for 𝑡 > 0, we can consider pushing the point
𝑥 0 + 𝛿𝑒𝑖 along the exponential map to obtain a new point exp𝑥 0 +𝛿𝑒𝑖 (𝑡 ∇𝜓 (𝑥 0 + 𝛿𝑒𝑖 )). These
points form the vertices of a new parallelepiped 𝐴𝛿𝑡 .
In terms of measures, let 𝜇𝛿0 denote the uniform measure on 𝐴𝛿0 , and 𝜇𝑡𝛿 = exp(𝑡 ∇𝜓 ) # 𝜇𝛿0 ,
so that 𝜇𝑡𝛿 is approximately the uniform measure on 𝐴𝛿𝑡 . Then, the displacement convexity
of entropy states that
1 1 1
ln ≤ (1 − 𝑡) ln + 𝑡 ln + 𝑜 (1)
𝔪(𝐴𝑡 )
𝛿 𝔪(𝐴0 )
𝛿 𝔪(𝐴𝛿1 )
as 𝛿 & 0. On the other hand, the infinitesimal change in volume is governed by the
Jacobian determinant
𝔪(𝐴𝛿𝑡 )
→ J (𝑡, 𝑥) B det 𝐽 (𝑡, 𝑥) ,
𝔪(𝐴𝛿0 )
where 𝐽𝑖 (𝑡, 𝑥) B 𝜕𝛿 |𝛿=0 exp𝑥 0 +𝛿𝑒𝑖 (𝑡 ∇𝜓 (𝑥 0 +𝛿𝑒𝑖 )). Hence, the displacement convexity yields
In Euclidean space, we have the formula J (𝑡, 𝑥) = |det(𝐼𝑑 + 𝑡 ∇2𝜓 (𝑥))|, but the situation
is more complicated on a Riemannian manifold because there is also a change of volume
due to curvature. To account for this, one can derive an equation for 𝐽 , known as the
Jacobi equation:
where 𝑅(𝑡, 𝑥) B Riem𝑥𝑡 (𝑥¤𝑡 , ·, 𝑥¤𝑡 , ·). By taking the trace and performing some computations,
we arrive at
𝜕𝑡2 J (𝑡, 𝑥) = −k𝐽 −1 (𝑡, 𝑥) 𝐽¤(𝑡, 𝑥)k 2HS − Ric𝑥𝑡 (𝑥¤𝑡 , 𝑥¤𝑡 ) . (2.6.18)
122 CHAPTER 2. FUNCTIONAL INEQUALITIES
By comparing (2.6.17) and (2.6.18), we now obtain a hint as to how optimal transport
captures curvature: displacement convexity of the entropy is related to concavity of the
Jacobian determinant, which in turn is tied to Ricci curvature lower bounds.
The calculations above are performed with the Lagrangian description of fluid flows,
as they follow a single trajectory 𝑡 ↦→ 𝑥𝑡 . If we switch to the Eulerian perspective, then
we are led to define the vector field ∇𝜓𝑡 as follows: ∇𝜓𝑡 (𝑥) is the velocity 𝑥¤𝑡 of the
curve 𝑡 ↦→ exp𝑥 (𝑡 ∇𝜓 (𝑥)) at time 𝑡. By reformulating the Jacobi equation in the Eulerian
perspective, we arrive precisely at the Bochner identity (2.6.7) for 𝜓 which, as we saw
in Section 2.6.3, underlies the curvature-dimension condition from the Bakry–Émery
perspective. In this sense, the two approaches to curvature are dual.
Discrete space. For Markov processes on a discrete space space, we can still define the
Markov semigroup, generator, carré du champ, and Dirichlet form. The main difference
is that the carré du champ is now a finite difference operator, rather than a differential
operator, and consequently it fails to satisfy a chain rule.
Crucially, this difference manifests itself for the log-Sobolev inequality, which we have
written in this chapter as
ent𝜋 (𝑓 2 ) ≤ 2𝐶 ℰ(𝑓 , 𝑓 ) for all 𝑓 . (2.7.1)
On the other hand, recall from Theorem 1.2.20 (which still holds for discrete state spaces)
that the exponential decay of the KL divergence is equivalent to the inequality
𝐶
ent𝜋 (𝑓 ) ≤ ℰ(𝑓 , ln 𝑓 ) for all 𝑓 ≥ 0 . (2.7.2)
2
When the carré du champ satisfies a chain rule, then (2.7.1) and (2.7.2) are equivalent, but
in general the first inequality (2.7.1) is strictly stronger.
See Exercise 2.16. The first inequality (2.7.1) is often simply called the log Sobolev
inequality, whereas the second inequality (2.7.2) is called a modified log-Sobolev in-
equality (MLSI). In many cases, the log-Sobolev inequality is too strong in that it does
2.7. DISCRETE SPACE AND TIME 123
not hold with a good constant 𝐶; hence, the modified log-Sobolev inequality is often the
more appropriate inequality for the discrete setting.
We have already seen a concentration inequality for discrete spaces in Exercise 2.12.
In general, concentration of measure on discrete spaces is a rich subject, with many
applications to computer science and probability, and at the same time subtle, involving
new ideas such as asymmetric transport inequalities or careful use of hypercontractivity.
See, e.g., [BLM13; Han16] for more detailed treatments.
Discrete time. Similarly, for discrete-time Markov chains we can no longer use semi-
group calculus, although the basic principles of Poincaré inequalities (spectral gap inequal-
ities) and modified log-Sobolev inequalities can be adapted to this setting. In addition,
there are new techniques based on the notion of conductance. As we shall need to study
discrete-time Markov chains in detail for sampling algorithms, we defer a fuller discussion
of this theory to Chapter 7.
Discrete curvature. Inspired by the geometric connections in Section 2.6, many re-
searchers have attempted to define notions of curvature on discrete spaces. We do not
attempt to survey this literature here, but we give a few pointers to the literature.
Ollivier [Oll07; Oll09] introduced the following notion of curvature.
Definition 2.7.4. A metric space (X, d) equipped with a Markov kernel 𝑃 is said to
have coarse Ricci curvature bounded below by 𝜅 ∈ [0, 1] if for all 𝑥, 𝑦 ∈ X,
𝑊1 𝑃 (𝑥, ·), 𝑃 (𝑦, ·) ≤ (1 − 𝜅) d(𝑥, 𝑦) .
In other words, the Markov chain with kernel 𝑃 is a 𝑊1 contraction. The definition is
motivated by the following observation: on a 𝑑-dimensional Riemannian manifold with
Ric 𝛼, let 𝑃 (𝑥, ·) be the uniform measure on B(𝑥, 𝜀). Then, provided that d(𝑥, 𝑦) = 𝑂 (𝜀),
it holds that
𝛼𝜀 2 3
𝑊1 𝑃 (𝑥, ·), 𝑃 (𝑦, ·) ≤ 1 −
+ 𝑂 (𝜀 ) d(𝑥, 𝑦) .
2 (𝑑 + 2)
A lower bound on the coarse Ricci curvature is often too strong of an assumption for
the purpose of studying mixing times of Markov chains, although there are refinements
in [Oll07; Oll09]. However, when a lower bound on the coarse Ricci curvature holds,
then it implies a number of useful consequences, such as concentration estimates and
functional inequalities. We mention the following result in particular.
124 CHAPTER 2. FUNCTIONAL INEQUALITIES
Theorem 2.7.5 ([Oll09]). Suppose that 𝑃 is a Markov kernel on a metric space (X, d),
and that 𝑃 has coarse Ricci curvature bounded below by 𝜅. Then, 𝑃 satisfies a Poincaré
inequality with constant at most 1/𝜅.
Refer to Chapter 7 for a precise definition of the Poincaré inequality used here.
Other approaches for studying the curvature of discrete Markov processes include:
studying the displacement convexity of entropy (using different interpolating curves rather
than𝑊2 geodesics) [OV12; Goz+14; Léo17]; using ideas from Bakry–Émery theory [Kla+16;
FS18]; and defining a modified 𝑊2 distance for which the Markov process becomes a
gradient flow of the KL divergence [Maa11; EM12; Mie13].
We emphasize that although we only described the coarse Ricci curvature approach
in any detail, there is not a single approach which supersedes the others in the discrete
setting. Each approach has its own merits and shortcomings.
Bibliographical Notes
The monographs [BGL14; Han16] are excellent sources to learn more about Markov
semigroup theory.
The Monge–Ampère equation introduced in Exercise 2.1, being a fully non-linear
PDE, is fairly difficult to study. See [Vil03, §4] for an overview of rigorous results on the
Monge–Ampère equation, including the celebrated regularity theory of Caffarelli. The
proofs of Proposition 2.1.1 and Exercise 2.3 are taken from [Cor17].
In the proof of Lemma 2.2.6, we assumed the solvability of the Poisson equation; this
can be avoided via a density argument, see [CFM04; BC13]. The proof of the dimensional
Brascamp–Lieb inequality in Exercise 2.4 is taken from the paper [BGG18]. The bound on
var𝜋 𝑉 obtained in the exercise was used in [Che21b] to show that the entropic barrier is an
optimal self-concordant barrier. Note that the Brascamp–Lieb inequality in Theorem 2.2.8
should not be confused with another family of inequalities, which are unfortunately also
known as Brascamp–Lieb inequalities, described in, e.g., [Vil03, §6.3].
Although the device in the proof of Theorem 2.2.11 of differentiating 𝑠 ↦→ 𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )
may seem mysterious at first glance, it forms the basis for a great number of useful
inequalities. The key is that the chain rule for the carré du champ also implies a chain rule
for the generator: ℒ(𝜙 ◦ 𝑓 ) = 𝜙 0 (𝑓 ) ℒ𝑓 + 𝜙 00 (𝑓 ) Γ(𝑓 , 𝑓 ). Using this, one can differentiate
𝑠 ↦→ 𝑃𝑠 𝜙 (𝑃𝑡−𝑠 𝑓 ) for a general function 𝜙 : R → R, and obtain the nice identity
The convergence in Rényi divergence of the Langevin diffusion was obtained earlier,
under stronger assumptions in [CLL19]. A natural question to ask is whether there are
functional inequalities that interpolate between the Poincaré and log-Sobolev inequalities,
which imply intermediate rates of convergence for the Langevin diffusion. One answer is
given by the family of Latała–Oleszkiewicz inequalities (LOI) [LO00]. The convergence
of the Langevin diffusion under an LO inequality is given in [Che+21a]. One can also
consider variants of Sobolev inequalities [Cha04].
In [KLS95], Kannan, Lovász, and Simonovits conjectured that any log-concave mea-
sure 𝜋 on R𝑑 which is isotropic (i.e., if 𝑋 ∼ 𝜋 then cov 𝑋 = 𝐼𝑑 ) satisfies a Poincaré
inequality with a dimension-free constant 𝐶 PI . 1. This is known as the Kannan–
Lovász–Simonovits (KLS) conjecture. By considering linear test functions of the form
𝑥 ↦→ h𝑎, 𝑥i, one has 𝐶 PI ≥ 1, so the conjecture asserts that linear functions nearly saturate
the spectral gap inequality for log-concave measures. The KLS conjecture has inspired
a considerable amount of research (including Theorem 2.5.18), see [GM11; Eld13; LV17;
Che21a], culminating in the current state-of-the-art result of [KL22] which asserts that
𝐶 PI ≤ polylog(𝑑).
Exercise 2.8 essentially contains the main results of [CCN21] (actually the paper
assumes a slightly weaker condition than (2.E.1), namely that the 𝑝-th moment of the
chi-squared divergence is bounded for some 𝑝 > 1, but this is handled with the same
arguments as in Exercise 2.8).
There are many treatments on concentration of measure, e.g., [Led01; BLM13; BGL14;
Han16; Ver18]. The proof of Lemma 2.4.4 is from [Mil09]. A proof of the characterization
of the T1 inequality in Theorem 2.4.11 can be found in, e.g., [BV05].
The proof of Sanov’s theorem can be found in many textbooks on large deviations,
e.g., [DZ10; RS15].
The monographs [BH97; Led01; BGL14] are excellent sources to learn about isoperime-
try. The exposition of the functional form of Cheeger’s inequality (Theorem 2.5.14) as well
as Milman’s theorem (Theorem 2.5.18) were inspired by the treatment in [AB15]. It would
be hard to survey the various developments on this subject here, but we would like to
mention a few nice additions to the story. First, as we saw in Theorem 2.5.14 and Proposi-
tion 2.5.17, isoperimetric inequalities are typically stronger than their functional inequality
counterparts, and often strictly so. In order to obtain inequalities involving sets which
are equivalent to, say, the Poincaré and log-Sobolev inequalities, one should turn towards
measure capacity inequalities, for which we refer the reader to [BGL14, §8]. Also, more
refined “two-level” isoperimetric inequalities have been pioneered by Talagrand in [Tal91],
which has applications in its own right.
The Gaussian isoperimetric inequality is due to Sudakov and Tsirelson [SC74] and
Borell [Bor75]. It has since been extended and refined in various ways, e.g., in the context
126 CHAPTER 2. FUNCTIONAL INEQUALITIES
Exercises
Overview of the Inequalities
Here, k∇𝑢 k 2 d𝜇 = 𝑢 (−ℒ) 𝑢 d𝜇 is the squared Sobolev norm k𝑢 k 𝐻2¤ 1 (𝜇) , where the dot
∫ ∫
is used to distinguish this from the usual Sobolev norm k𝑢 k 𝐻2 1 (𝜇) = k𝑢 k 𝐿2 2 (𝜇) + k𝑢 k 𝐻2¤ 1 (𝜇) .
Similarly, 𝑓 (−ℒ) −1 𝑓 d𝜇 is the squared inverse Sobolev norm k𝑓 k 𝐻2¤ −1 (𝜇) . Therefore,
∫
In light of the spectral gap interpretation of the Poincaré inequality, why does the above
inequality suggest that T2 (𝐶) implies PI(𝐶)?
The astute reader should also work out how the Poisson equation can be obtained
starting with the continuity equation (1.3.18).
cov𝜋 (𝑓 , 𝑉 ) 2
var𝜋 𝑓 ≤ E𝜋 h∇𝑓 , (∇2𝑉 ) −1 ∇𝑓 i − .
𝑑 − var𝜋 𝑉
𝑑 E𝜋 h∇𝑉 , (∇2𝑉 ) −1 ∇𝑉 i
var𝜋 𝑉 ≤ ≤𝑑.
𝑑 + E𝜋 h∇𝑉 , (∇2𝑉 ) −1 ∇𝑉 i
for all 𝑓 is equivalent to the following hypercontractivity statement: for all functions 𝑓 ,
𝑡 ≥ 0, and 𝑝 ≥ 1, if we set 𝑝 (𝑡) B 1 − (𝑝 − 1) exp(2𝑡/𝐶 LSI ), then
This is a strengthening of the fact that the semigroup is a contraction on any 𝐿𝑝 (𝜋) space
0
and shows that in fact the semigroup maps 𝐿𝑝 (𝜋) into 𝐿𝑝 (𝜋) for some 𝑝 0 > 𝑝.
Hint: Differentiate 𝑡 ↦→ ln k𝑃𝑡 𝑓 k 𝐿𝑝 (𝑡 ) (𝜋) .
2.7. DISCRETE SPACE AND TIME 129
2. Next, consider the setting of Proposition 2.3.8 except that we replace the assump-
tion (2.3.9) with the weaker condition
√︃
𝐶 𝜒 2,2 B E[𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 ) 2 ] < ∞ , (2.E.1)
i.i.d.
where 𝑋, 𝑋 0 ∼ 𝜇. Now, rather than writing var E𝑃𝑋 𝑓 = 12 E[|E𝑃𝑋 𝑓 − E𝑃𝑋 0 𝑓 | 2 ],
instead write var E𝑃𝑋 𝑓 = E[|E𝑃𝑋 𝑓 − E𝜇𝑃 𝑓 | 2 ]. By bounding this quantity in two
different ways deduce that
Use this to prove that a Poincaré inequality holds for 𝜇𝑃, and give an upper bound
on 𝐶 PI (𝜇𝑃).
130 CHAPTER 2. FUNCTIONAL INEQUALITIES
3. Now consider the setting of Proposition 2.3.14 except that we again assume the
weaker condition (2.E.1). Previously, we bounded
which relies on 𝐿 1 –𝐿 ∞ duality. This time, we want to use duality between “𝐿 log 𝐿”
and “exp 𝐿”. Namely, use the variational principle for the entropy (Lemma 2.3.4) to
prove that for a suitable constant 𝐶 > 0 (depending on 𝐶 𝜒 2,2 ),
Use this to prove that a log-Sobolev inequality holds for 𝜇𝑃, and give an upper
bound on 𝐶 LSI (𝜇𝑃).
k𝑥 − 𝑥 0 k 2
∬
exp 2
d𝜇 (𝑥) d𝜇 (𝑥 0) ≤ 𝐶 sG .
𝜎sG
Prove that if 𝜎 & 𝜎sG for a sufficiently large implied constant, then the Gaussian
mixture 𝜇𝑃 satisfies a log-Sobolev inequality, and give an upper bound on 𝐶 LSI (𝜇𝑃).
Also, show how this can recover the result of Example 2.3.15.
Concentration of Measure
𝜆 2𝜎 2
var exp(𝜆𝑋 ) ≤ E𝜋 exp(𝜆𝑋 ) .
4
Let 𝜂 (𝜆) B E exp(𝜆𝑋 ) and deduce an inequality for 𝜂 (𝜆) in terms of 𝜂 (𝜆/2). Solve
this recursion to prove that for 𝜆 < 2/𝜎,
2 + 𝜆𝜎
E exp{𝜆 (𝑋 − E 𝑋 )} ≤ .
2 − 𝜆𝜎
2.7. DISCRETE SPACE AND TIME 131
2. Pinsker’s inequality states that for any two probability measures 𝜇 and 𝜈 on the
same space, k𝜇 − 𝜈 k 2TV ≤ 21 KL(𝜇 k 𝜈). Prove this inequality as follows. First, by the
data-processing inequality (Theorem 1.5.3), for any event 𝐴,
3. Apply the Bobkov–Götze theorem (Theorem 2.4.10) to show that Hoeffding’s lemma
and Pinsker’s inequality are equivalent to each other.
2. For the converse direction, let 𝜇 be the exponential distribution on R, so that the
density is 𝜇 (𝑥) = exp(−𝑥) 1{𝑥 > 0}.
∫ 𝑥 Let 𝑓 : R+ → R; we may assume that 𝑓 (0) = 0.
2 2
Now apply the identity 𝑓 (𝑥) = 2 0 𝑓 (𝑠) 𝑓 (𝑠) d𝑠 to the integral 𝑓 d𝜇 and prove
∫
0
Hint: Recall the proof of the Efron–Stein inequality from Exercise 1.2.
3. Next, apply Marton’s tensorization (Theorem 2.3.17) to Pinsker’s inequality from Ex-
ercise 2.10 (see Example 2.3.18) to obtain a transport inequality for the product
space X𝑁 . Using the Bobkov–Götze equivalence (Theorem 2.4.10), give a second
proof of the bounded differences inequality.
Isoperimetric Inequalities
In this chapter, we further expand our toolbox of stochastic analysis. Namely, we introduce
Girsanov’s theorem, which furnishes a formula for the Radon–Nikodym derivative of
the laws of two SDEs w.r.t. to each other, and we discuss the time reversal of an SDE. In
order to highlight the flexibility and power of these ideas, we then study some interesting
applications, not all of which are directly relevant to log-concave sampling but nevertheless
fit within the broader themes of this book.
135
136 CHAPTER 3. ADDITIONAL TOPICS IN STOCHASTIC CALCULUS
The above limit is called the total variation of 𝐴 on [0,𝑇 ]. Under this condition, there is
a signed measure 𝜇𝐴 such that for all 𝑡 ∈ [0,𝑇 ], we have 𝜇𝐴 ([0, 𝑡]) = 𝐴𝑡 − 𝐴0 . Moreover,
we can define a norm k·k TV on the space of signed measures, called the total variation
norm, for which k𝜇𝐴 k TV equals the 1
∫ total variation ∫ of 𝐴 as defined above. In this case,
we can simply define the integral [0,𝑇 ] 𝜂𝑡 d𝐴𝑡 B [0,𝑇 ] 𝜂𝑡 d𝜇𝐴 (𝑡).
Note that if 𝑡 ↦→ 𝐴𝑡 is differentiable, then the total variation of 𝐴 equals [0,𝑇 ] |𝐴¤𝑡 | d𝑡,
∫
On the other hand, if we change the definition slightly, then we expect (heuristically) that
𝑛
𝑇
|𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 | 2 . lim 𝑛 · < ∞ .
∑︁
lim
𝑛→∞
𝑖=1 | {z } 𝑛→∞ 𝑛
𝑇 /𝑛
We say that Brownian motion has finite quadratic variation. We will show in fact that
the above limit is well-defined almost surely.
More generally, for a process of the form
∫ 𝑡 ∫ 𝑡
𝑋𝑡 = 𝑋 0 + 𝑏𝑡 d𝑡 + 𝜎𝑡 d𝐵𝑡 , 𝑡 ∈ [0,𝑇 ] ,
0 0
the second term is a process of finite variation (provided that [0,𝑇 ] |𝑏𝑡 | d𝑡 < ∞ almost
∫
Definition of the quadratic variation. More formally, we have the following theorem.
Theorem 3.1.1 (quadratic variation). Let (𝑀𝑡 )𝑡 ∈[0,𝑇 ] be a continuous local martingale;
then, there is an a.s. unique increasing process 𝑡 ↦→ h𝑀, 𝑀i𝑡 such that 𝑡 ↦→ 𝑀𝑡2 − h𝑀, 𝑀i𝑡
is a continuous local martingale. Also, suppose that for each 𝑛 ∈ N+ , (𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛)
is a partition of [0, 𝑡], with mesh tending to zero as 𝑛 → ∞. Then,
𝑛
(𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 ) 2
∑︁
h𝑀, 𝑀i𝑡 = lim in probability .
𝑛→∞
𝑖=1
Definition 3.1.2 (quadratic variation). The process h𝑀, 𝑀i of Theorem 3.1.1 is called
the quadratic variation of 𝑀.
We will not prove Theorem 3.1.1 in full generality. However, we will verify that the
quadratic variation of one-dimensional Brownian motion (𝐵𝑡 )𝑡 ∈[0,𝑇 ] is h𝐵, 𝐵i𝑇 = 𝑇 , which
gives an idea of the general result. By independence of the Brownian increments,
𝑛
h∑︁ 2 i ∑︁
𝑛
E {(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) 2 − (𝑡𝑖 − 𝑡𝑖−1 )} = E[|(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) 2 − (𝑡𝑖 − 𝑡𝑖−1 )| 2 ]
𝑖=1 𝑖=1
𝑛 𝑛
4
(𝑡𝑖 − 𝑡𝑖−1 ) 2
∑︁ ∑︁
≤ E[(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) ] = 3
𝑖=1 𝑖=1
𝑛
∑︁
≤ 3 mesh(𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛) (𝑡𝑖 − 𝑡𝑖−1 ) → 0 .
𝑖=1
| {z }
=𝑇
as the mesh size tends to zero. This shows that hΔ, Δi = 0, and thus Δ2 is a continuous
local martingale. If we knew that Δ2 were a genuine martingale, then together with Δ0 = 0
it would imply that Δ = 0, establishing uniqueness of the semimartingale decomposition.
We omit the localization argument required to finish the proof.
We can also define the quadratic variation of the semimartingale 𝑋 as h𝑋, 𝑋 i B h𝑀, 𝑀i.
To see why this makes sense, observe that
𝑛
∑︁ 𝑛
2 2
∑︁
(𝑋𝑡𝑖 − 𝑋𝑡𝑖−1 ) − (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 )
𝑖=1 𝑖=1
𝑛
∑︁ 𝑛
(𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) 2 + 2
∑︁
= (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 )
𝑖=1 𝑖=1
v
t
𝑛 𝑛
∑︁ 𝑛
∑︁
2 2 2
∑︁
≤ (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) + 2 (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 ) ,
𝑖=1 𝑖=1 𝑖=1
which tends to zero using the same argument as in the uniqueness of the semimartingale
decomposition: finite variation processes have zero quadratic variation.
where we assume [0,𝑇 ] k𝑏𝑡𝑋 k d𝑡, [0,𝑇 ] k𝑏𝑡𝑌 k d𝑡, [0,𝑇 ] k𝜎𝑡𝑋 k 2HS d𝑡, and [0,𝑇 ] k𝜎𝑡𝑌 k 2HS d𝑡 are
∫ ∫ ∫ ∫
all finite almost surely. Then, 𝑋 and 𝑌 are continuous semimartingales, and
∫ 𝑡
h𝑋, 𝑌 i𝑡 = h𝜎𝑠𝑋 , 𝜎𝑠𝑌 i d𝑠 , for 𝑡 ∈ [0,𝑇 ] .
0
Itô’s formula revisited. Finally, we conclude this section by revisiting Itô’s formula
(Theorem 1.1.18) using our new calculus.
1 ∑︁ 𝑡
𝑑 ∫
∑︁ 𝑡 𝑑 ∫
𝑓 (𝑋𝑡 ) = 𝑓 (𝑋 0 ) + 𝜕𝑖 𝑓 (𝑋𝑠 ) d𝑋𝑠𝑖 + 𝜕𝑖,𝑗 𝑓 (𝑋𝑠 ) dh𝑋 𝑖 , 𝑋 𝑗 i𝑠 .
𝑖=1 0 2 𝑖,𝑗=1 0
If we interpret h𝑋, 𝑋 i as the matrix whose (𝑖, 𝑗)-entry is h𝑋 𝑖 , 𝑋 𝑗 i, then this can be
written in matrix notation as
1 𝑡
2
∫ 𝑡 ∫
h∇𝑓 (𝑋𝑠 ), d𝑋𝑠 i + ∇ 𝑓 (𝑋𝑠 ), dh𝑋, 𝑋 i𝑠 .
𝑓 (𝑋𝑡 ) = 𝑓 (𝑋 0 ) +
0 2 0
For 𝑑-dimensional standard Brownian motion (𝐵𝑡 )𝑡 ∈[0,𝑇 ] , we have h𝐵, 𝐵i𝑡 = 𝑡𝐼𝑑 , so that
dh𝐵, 𝐵i𝑡 = 𝐼𝑑 d𝑡 and we recover the original statement of Itô’s formula in Theorem 1.1.18.
The point is that the quadratic variation is a convenient way of streamlining Itô calculations,
as it formalizes the idea that only the Brownian motion part of a process contributes in
the second-order term in Itô’s formula.
Bibliographical Notes
The introduction to quadratic variation in Section 3.1 is heavily inspired by the treatment
in [Le 16]. The study of functions of bounded variation ad the relationship with absolute
140 CHAPTER 3. ADDITIONAL TOPICS IN STOCHASTIC CALCULUS
continuity and total variation can be found in any standard graduate text on real analysis,
e.g., [Fol99]. See [Ste01, Proposition 8.6] for a simple case of Theorem 3.1.5.
Exercises
Part II
Complexity of Sampling
141
CHAPTER 4
In this chapter, we will provide several analyses of the Langevin Monte Carlo (LMC)
algorithm, i.e., the iteration
√
𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) . (LMC)
143
144 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
Theorem 4.1.1. For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC with step size
ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 .
Then, provided ℎ ≤ 3𝛽1 , for all 𝑁 ∈ N,
𝛽 4𝑑ℎ 2 𝛽 2𝑑ℎ
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 + . (4.1.2)
𝛼3 𝛼2
𝛼𝜀 2
√
In particular, if we initialize at 𝜇0 = 𝛿 0 and take ℎ 𝛽 2𝑑
, then for any 𝜀 ∈ [0, 𝑑] we
√
obtain the guarantee 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 after
𝜅 2𝑑 𝑑
𝑁 =𝑂 log iterations .
𝜀2 𝜀
Remark 4.1.3. We pause to make a few comments about the assumptions and result.
1. Typically we assume ∇𝑉 (0) = 0 without loss of generality, i.e., that the potential 𝑉
is minimized at 0. This is because the complexity of finding the minimizer of 𝑉 via
convex optimization is typically less than the complexity of sampling.
√
2. It is convenient to use the metric 𝛼 𝑊2 instead of 𝑊2 because it is scale-invariant.
Namely, for 𝜆 > 0, if we define the scaling map 𝑠𝜆 : R𝑑 → R𝑑 via 𝑥 ↦→ 𝜆𝑥, then
information divergences such as KL satisfy KL((𝑠𝜆 ) # 𝜇 k (𝑠𝜆 ) # 𝜋) = KL(𝜇 k√𝜋). On
the other hand, 𝑊2 is not invariant, 𝑊2 ((𝑠𝜆 ) # 𝜇, (𝑠𝜆 ) # 𝜋) = 𝜆 𝑊2 (𝜇, 𝜋), but 𝛼 𝑊2 is
(because the distribution (𝑠𝜆 ) # 𝜋 is 𝛼/𝜆 2 -strongly convex).
Recall also that
√ the T2 transport inequality, implied
√ by 𝛼-strong log-concavity,
asserts that 𝛼 𝑊2 (·, 𝜋) ≤ KL(· k 𝜋). Therefore, 𝛼 𝑊2 is a more natural metric.
3. The inequality (4.1.2) is not sharp; in Section 4.3, via a more sophisticated analysis,
we will improve the iteration complexity to 𝑂 e(𝑑𝜅/𝜀 2 ) iterations.
4. The inequality (4.1.2) has the following interpretation: for fixed ℎ > 0, the first term
tends to zero exponentially fast, which reflects the fact that LMC converges to its
stationary distribution 𝜇∞ . However, the stationary distribution is biased, 𝜇 ∞ ≠ 𝜋,
and the second term provides an upper bound on the bias 𝑊2 (𝜇∞, 𝜋).
First, let us see why the first statement of Theorem 4.1.1 implies the second.
Lemma 4.1.4. Let 𝜋 ∝∫exp(−𝑉 ) satisfy ∇2𝑉 𝛼𝐼𝑑 and ∇𝑉 (0) = 0. Then, we have the
second moment bound k·k 2 d𝜋 ≤ 𝑑/𝛼.
4.1. PROOF VIA WASSERSTEIN COUPLING 145
Moreover, if ℎ ≤ 𝛽1 , then the LMC iterates initialized at 𝜇 0 = 𝛿 0 with step size ℎ > 0
have uniformly bounded second moment: sup𝑘∈N E[k𝑋𝑘ℎ k 2 ] . 𝑑/𝛼.
Then,
h
∫ ℎ
2 i
𝑊22 (𝜇ℎ , 𝜋ℎ ) ≤ E[k𝑋ℎ − 𝑍ℎ k ] ≤ E
2
∇𝑉 (𝑍𝑡 ) d𝑡 − ℎ ∇𝑉 (𝑍 0 )
0
∫ ℎ
≤ℎ E[k∇𝑉 (𝑍𝑡 ) − ∇𝑉 (𝑍 0 )k 2 ] d𝑡 .
0
√
Therefore, we just have to bound the movement k𝑍𝑡 − 𝑍 0 k = k− 0 ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡 k
∫𝑡
√
of the Langevin diffusion in time 𝑡. Roughly, we expect k 0 ∇𝑉 (𝑍𝑠 ) d𝑠 k = 𝑂 ( 𝑑 𝑡) if the
∫𝑡
√ √
size of the gradient is 𝑂 ( 𝑑), and k𝐵𝑡 k = 𝑂 ( 𝑑𝑡). For small 𝑡, it is the Brownian motion
term which is dominant, which is a common intuition for discretization proofs.
To rigorously bound this term, we appeal to stochastic calculus, see Lemma 4.1.5. For
𝑡 ≤ 3𝛽1 , it yields the bound
and hence
∫ ℎ
𝑊22 (𝜇ℎ , 𝜋ℎ ) 2
≤𝛽 ℎ E[k𝑍𝑡 − 𝑍 0 k 2 ] d𝑡 ≤ 3𝛽 4ℎ 4 E[k𝑍 0 k 2 ] + 4𝛽 2𝑑ℎ 3 .
0
146 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
Clearly 𝑋 (𝑘+1)ℎ ∼ 𝜇 (𝑘+1)ℎ ; also, since 𝜋 is stationary for the Langevin diffusion, then
𝑍 (𝑘+1)ℎ ∼ 𝜋. We also introduce an auxiliary process: let
∫ 𝑡 √
¯
𝑋𝑡 B 𝑋𝑘ℎ − ∇𝑉 (𝑋¯𝑠 ) d𝑠 + 2 (𝐵𝑡 − 𝐵𝑘ℎ ) for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ]
𝑘ℎ
For the second term, 𝑋 is the LMC algorithm and 𝑋¯ is the continuous-time Langevin
diffusion, both initialized at the same distribution 𝜇𝑘ℎ . Hence, we can apply our one-step
discretization bound from before and deduce that
𝛽 4𝑑ℎ 4
E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] ≤ 3𝛽 4ℎ 4 E[k𝑋𝑘ℎ k 2 ] + 4𝛽 2𝑑ℎ 3 . + 𝛽 2𝑑ℎ 3 ,
𝛼
where we used the bound from Lemma 4.1.4 on the second moment.
Combining everything,
1 𝛽 4𝑑ℎ 4
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ (1 + 𝜆) exp(−2𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑂 1 + + 𝛽 2𝑑ℎ 3 .
𝜆 𝛼
4.1. PROOF VIA WASSERSTEIN COUPLING 147
1 1
Next, if we take 𝜆 = 𝛼ℎ, then (1 + 𝜆) exp(−2𝛼ℎ) ≤ exp(−𝛼ℎ). Since ℎ ≤ 3𝛽 ≤ 3𝛼 ,
𝛽 4𝑑ℎ 3 𝛽 2𝑑ℎ 2
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp(−𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑂 + .
𝛼2 𝛼
After iterating this recursion, it implies
𝛽 4𝑑ℎ 2 𝛽 2𝑑ℎ
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 + .
𝛼3 𝛼2
We finish by presenting the lemma we used in the proof of Theorem 4.1.1. The following
proof is very typical of stochastic calculus arguments, so it is worth internalizing.
Lemma 4.1.5. Let (𝑍𝑡 )𝑡 ≥0 denote the Langevin diffusion and let (𝜋𝑡 )𝑡 ≥0 denote its law.
Assume that ∇𝑉 (0) = 0 and k∇2𝑉 k op ≤ 𝛽. Then, provided that 𝑡 ≤ 3𝛽1 ,
Proof. By definition,
h
∫ 𝑡 √
2 i
2
E[k𝑍𝑡 − 𝑍 0 k ] = E
− ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡
∫ 𝑡0
≤ 2𝑡 E[k∇𝑉 (𝑍𝑠 )k 2 ] d𝑠 + 4 E[k𝐵𝑡 k 2 ] .
0
Using the assumption that ∇𝑉 (0) = 0 and that ∇𝑉 is 𝛽-Lipschitz, we have the bound
k∇𝑉 (𝑍𝑠 ) k ≤ 𝛽 k𝑍𝑠 k. Thus,
∫ 𝑡
2 2
E[k𝑍𝑡 − 𝑍 0 k ] ≤ 2𝛽 𝑡 E[k𝑍𝑠 k 2 ] d𝑠 + 4𝑑𝑡
∫0 𝑡
≤ 4𝛽 2𝑡 E[k𝑍𝑠 − 𝑍 0 k 2 ] d𝑠 + 4𝛽 2𝑡 2 E[k𝑍 0 k 2 ] + 4𝑑𝑡 .
0
Proposition 4.2.2. Let (𝜇𝑡 )𝑡 ≥0 be the law of the interpolated process (4.2.1). Then,
h 𝜇𝑡 i
𝜕𝑡 𝜇𝑡 = div 𝜇𝑡 ∇ ln + E[∇𝑉 (𝑋𝑘ℎ ) − ∇𝑉 (𝑋𝑡 ) | 𝑋𝑡 = ·] .
𝜋
Corollary 4.2.3. Along the law (𝜇𝑡 )𝑡 ≥0 of the interpolated process (4.2.1),
3
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] .
4
Recall that the Fisher information is FI(𝜇 k 𝜋) B E𝜇 [k∇ ln(𝜇/𝜋)k 2 ] if 𝜇 has a smooth
density with respect to 𝜋.
Proof. For the generator ℒ of the Langevin diffusion (with potential 𝑉 ), we can calculate
ℒ𝑉 = Δ𝑉 − k∇𝑉 k 2 . Also, since ∇2𝑉 𝛽𝐼𝑑 , then Δ𝑉 ≤ 𝛽𝑑. Thus, using the fundamental
integration by parts identity (Theorem 1.2.14),
d𝜇 d𝜇
∫ ∫
2
d𝜋 = 𝛽𝑑 + d𝜋
d𝜇
∫
∇𝑉 , ∇ ln
d𝜇
= 𝛽𝑑 +
d𝜋
1 1
≤ 𝛽𝑑 + E𝜇 [k∇𝑉 k 2 ] + FI(𝜇 k 𝜋) .
2 2
Rearranging the inequality yields the result.
Also, recall that an LSI implies KL(· k 𝜋) ≤ 𝐶 LSI
2 FI(· k 𝜋).
Theorem 4.2.5 ([VW19]). For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies LSI and that ∇𝑉 is
𝛽-Lipschitz. Then, for all ℎ ≤ 4𝛽1 , for all 𝑁 ∈ N,
𝑁ℎ
KL(𝜇 𝑁ℎ k 𝜋) ≤ exp − KL(𝜇0 k 𝜋) + 𝑂 (𝐶 LSI 𝛽 2𝑑ℎ + 𝛽 2𝑑ℎ 2 ) .
𝐶 LSI
√ 2
In particular, for all 𝜀 ∈ [0, 𝐶 LSI 𝛽 𝑑], if we take ℎ 𝐶LSI𝜀 𝛽 2𝑑 , then we obtain the guarantee
KL(𝜇 𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁
𝐶 2 𝛽 2𝑑 KL(𝜇0 k 𝜋)
LSI
𝑁 =𝑂 log iterations .
𝜀2 𝜀
Proof. In light of Corollary 4.2.3, we focus our attention on the discretization error term
In order to apply Lemma 4.2.4, it is more convenient to have E[k∇𝑉 (𝑋𝑡 )k 2 ] instead of
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ]. So, we use
1
≤ FI(𝜇𝑡 k 𝜋) + 2𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
4
Combining with our differential inequality from Corollary 4.2.3 and LSI,
1 1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) ≤ − KL(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
2 𝐶 LSI
This implies that
h 𝑡 − 𝑘ℎ i 𝑡 − 𝑘ℎ
𝜕𝑡 exp KL(𝜇𝑡 k 𝜋) ≤ 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) exp
𝐶 LSI 𝐶 LSI
and upon integration,
ℎ
KL(𝜇 (𝑘+1)ℎ k 𝜋) ≤ exp − KL(𝜇𝑘ℎ k 𝜋) + 3𝛽 2𝑑ℎ 2 .
𝐶 LSI
Iterating and splitting into cases based on whether or not ℎ ≤ 𝐶 LSI ,
𝑁ℎ
KL(𝜇 𝑁ℎ k 𝜋) ≤ exp − KL(𝜇 0 k 𝜋) + 𝑂 (max{𝐶 LSI 𝛽 2𝑑ℎ, 𝛽 2𝑑ℎ 2 }) .
𝐶 LSI
Recall from Theorem 2.2.15 that an LSI implies exponential decay in every Rényi
divergence, not just the KL divergence. Working with Rényi divergences of order 𝑞 > 1
introduces substantial new difficulties for the discretization analysis, which is why it is
remarkable that the proof above can be adapted to the Rényi case with the introduction
of some additional tricks; see Chapter 5.
where the two terms are the energy and the (negative) entropy. Accordingly, we break up
the iterates of LMC into the steps
+
𝑋𝑘ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) ,
√
+
𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
The first step is simply a deterministic gradient descent update on the function 𝑉 . If
we write 𝜇𝑘ℎ+ for the law of 𝑋 + , then in the space of measures one can show that 𝜇 +
𝑘ℎ 𝑘ℎ
is obtained from 𝜇𝑘ℎ by taking a gradient step for the energy functional E w.r.t. the
Wasserstein geometry.1 On the other hand, the second step applies the heat flow; in
the space of measures, this is a Wasserstein gradient flow for the entropy functional H.
Since the gradient descent algorithm is sometimes known as the “forwards” method in
optimization (as opposed to a proximal step which is the “backwards” method), this has
led to LMC being dubbed the “forward-flow” algorithm (see [Wib18] for more on this
perspective).
The strategy of the proof is to show that the forward step of LMC dissipates the energy
while not increasing the entropy too much, and that the flow step of LMC dissipates the
entropy while not increasing the energy too much.
The key lemma is the following one-step inequality.
Lemma 4.3.1. Let 𝜋 = exp(−𝑉 ) be the target and assume that 0 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 .
Let (𝜇𝑘ℎ )𝑘∈N denote the iterates of LMC with step size ℎ ∈ [0, 𝛽1 ]. Then,
Theorem 4.3.3 ([DMM19]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝑉 satisfies 0 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . Let (𝜇𝑘ℎ )𝑘∈N denote the law of LMC.
√
1. (weakly convex case) Suppose that 𝛼 = 0. For any 𝜀 ∈ [0, 𝑑], if we take step
𝜀2
size ℎ 𝛽𝑑 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds that
Í𝑁
1 Technicallythis is only true if the step size ℎ is chosen so that ℎ k∇2𝑉 k op ≤ 1. This is because on any
Riemannian manifold in which geodesics cannot be extended indefinitely, the gradient descent steps must
be short enough to ensure that the iterates are still travelling along geodesics.
4.3. PROOF VIA CONVEX OPTIMIZATION 153
𝛽𝑑 𝑊 2 (𝜇 0, 𝜋)
2
𝑁 =𝑂 iterations .
𝜀4
2. (strongly convex case) Suppose that 𝛼 √ > 0 and let 𝜅 B 𝛽/𝛼 denote the con-
𝜀2
dition number. Then, for any 𝜀 ∈ [0, 𝑑], with step size ℎ 𝛽𝑑 we obtain
√
𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 and KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁
√
𝜅𝑑 𝛼 𝑊2 (𝜇0, 𝜋)
𝑁 = 𝑂 2 log iterations ,
𝜀 𝜀
Í2𝑁
where 𝜇¯𝑁ℎ,2𝑁ℎ B 𝑁 −1 𝑘=𝑁 +1 𝜇𝑘ℎ .
1. By summing the inequality (4.3.2) and using the convexity of the KL divergence,
1 ∑︁
𝑁
𝑊 2 (𝜇 0, 𝜋)
KL( 𝜇¯𝑁ℎ k 𝜋) ≤ KL(𝜇𝑘ℎ k 𝜋) ≤ 2 + 𝛽𝑑ℎ .
𝑁 2𝑁ℎ
𝑘=1
2. First, we prove the 𝑊2 guarantee. Using the fact that KL(𝜇 (𝑘+1)ℎ k 𝜋) ≥ 0 and
iterating the inequality (4.3.2) we obtain
𝑁 −1
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ (1 − 𝛼ℎ) 𝑁 𝑊22 (𝜇0, 𝜋) + 2𝛽𝑑ℎ 2
∑︁
(1 − 𝛼ℎ)𝑘
𝑘=0
≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 (𝜅𝑑ℎ) .
√
With our choice of ℎ and 𝑁 , we obtain 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀.
Next, forget about the previous 𝑁 iterations of LMC and consider 𝜇 𝑁ℎ to be the
new initialization to LMC. Applying the weakly convex result now yields the KL
guarantee KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀.
√︁
1. The forward step dissipates the energy. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋𝑘ℎ .
Then, applying the strong convexity and smoothness of 𝑉 ,
E(𝜇𝑘ℎ
+
) − E(𝜋) = E[𝑉 (𝑋𝑘ℎ
+
) − 𝑉 (𝑋𝑘ℎ ) + 𝑉 (𝑋𝑘ℎ ) − 𝑉 (𝑍 )]
h 𝛽
+
≤ E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑋𝑘ℎ i + k𝑋𝑘ℎ +
− 𝑋𝑘ℎ k 2
2 i
𝛼 2
+ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i − k𝑋𝑘ℎ − 𝑍 k
2
h 𝛽 𝛼 i
+
= E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i + k𝑋𝑘ℎ +
− 𝑋𝑘ℎ k 2 − k𝑋𝑘ℎ − 𝑍 k 2 . (4.3.4)
2 2
Next, we have the expansion
k𝑋𝑘ℎ − 𝑍 k 2 = k𝑋𝑘ℎ
+
− 𝑋𝑘ℎ k 2 + k𝑋𝑘ℎ
+
− 𝑍 k 2 − 2 h𝑋𝑘ℎ
+ +
− 𝑋𝑘ℎ , 𝑋𝑘ℎ − 𝑍i
+
= k𝑋𝑘ℎ − 𝑋𝑘ℎ k 2 + k𝑋𝑘ℎ
+
− 𝑍 k 2 + 2ℎ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ
+
− 𝑍i .
= 𝛽𝑑ℎ . (4.3.6)
3. The flow step dissipates the entropy. Let (𝑄𝑡 )𝑡 ≥0 denote the heat semigroup,
√
i.e., 𝑄𝑡 𝑓 (𝑥) B E 𝑓 (𝑥 + 2 𝐵𝑡 ), so that 𝜇 (𝑘+1)ℎ = 𝜇𝑘ℎ
+ 𝑄 . Then, since the heat flow is the
ℎ
Wasserstein gradient flow of H, and the Wasserstein gradient of H is ∇𝑊2 H(𝜇) = ∇ ln 𝜇,
one can show that
𝜕𝑡𝑊22 (𝜇𝑘ℎ
+
𝑄𝑡 , 𝜋) ≤ 2 Eh∇ ln 𝜇 (𝑋𝑘ℎ+𝑡
+ +
), 𝑍 − 𝑋𝑘ℎ+𝑡 i
4.3. PROOF VIA CONVEX OPTIMIZATION 155
where 𝑋𝑘ℎ+𝑡
+ + 𝑄 and 𝑍 ∼ 𝜋 are optimally coupled. This follows from the formula
∼ 𝜇𝑘ℎ 𝑡
for the gradient of the squared Wasserstein distance (Theorem 1.4.11); it may be justified
more rigorously using, e.g., [AGS08, Theorem 10.2.2].
On the other hand, we showed that H is geodesically convex (see (1.4.3)), so
H(𝜋) − H(𝜇𝑘ℎ
+
𝑄𝑡 ) ≥ Eh∇ ln 𝜇 (𝑋𝑘ℎ+𝑡
+ +
), 𝑍 − 𝑋𝑘ℎ+𝑡 i.
Using the fact that 𝑡 ↦→ H(𝜇𝑘ℎ
+ 𝑄 ) is decreasing (which also follows because 𝑡 ↦→ 𝜇 + 𝑄 is
𝑡 𝑘ℎ 𝑡
the gradient flow of H), we then have
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) − 𝑊22 (𝜇𝑘ℎ
+
, 𝜋) ≤ 2ℎ {H(𝜋) − H(𝜇 (𝑘+1)ℎ )} . (4.3.7)
Concluding the proof. Combine (4.3.5), (4.3.6), and (4.3.7) to obtain (4.3.2).
Non-smooth case. The proof via convex optimization can also handle the non-smooth
case in which we only assume that 𝑉 is convex and Lipschitz. As before, we deduce the
convergence result from a key one-step inequality.
Lemma 4.3.8. Let 𝜋 = exp(−𝑉 ) be the target and assume that 𝑉 is convex and 𝐿-
Lipschitz. Let (𝜇𝑘ℎ )𝑘∈N denote the iterates of LMC with step size ℎ > 0. Then,
Theorem 4.3.9 ([DMM19]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝑉 is convex and 𝐿-Lipschitz. Let (𝜇𝑘ℎ )𝑘∈N denote the law of LMC. For any 𝜀 > 0, if
2
we take step size ℎ 𝐿𝜀 2 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds
Í𝑁
𝐿 2 𝑊 2 (𝜇 +, 𝜋)
2 0
𝑁 =𝑂 iterations .
𝜀4
Proof. This follows from Lemma 4.3.8 in exactly the same way that the weakly convex
case of Theorem 4.3.3 follows from Lemma 4.3.1.
Proof of Lemma 4.3.8. The main task here is to obtain dissipation of the energy functional
E under our new assumptions. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋 (𝑘+1)ℎ . Then, similarly
to before, we have
k𝑋 (𝑘+1)ℎ − 𝑍 k 2 = k𝑋 (𝑘+1)ℎ
+
− 𝑋 (𝑘+1)ℎ k 2 + k𝑋 (𝑘+1)ℎ
+
− 𝑍 k 2 + 2ℎ h∇𝑉 (𝑋 (𝑘+1)ℎ ), 𝑋 (𝑘+1)ℎ
+
− 𝑍i .
156 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
Taking expectations,
too large; our guarantee will only imply that the error of LMC is small if 𝑁 lies in some
range. Conceptually, this is unsatisfying because “running the Markov chain too long”
should not be a problem, and it only arises as an artefact of the proof. Nonetheless, it is
worthwhile learning the proof because it is broadly applicable.
Historically, the Girsanov method was one of the first discretization techniques utilized
in the modern quantitative study of sampling (see [DT12]). The argument we present here
is similar in spirit to [DT12], although we have made a few refinements.
Background on Girsanov’s theorem. The law 𝜇 (𝑘+1)ℎ of the LMC iterate is obtained
from 𝜇𝑘ℎ by first applying the mapping id − ℎ ∇𝑉 and then convolving with a Gaussian.
This is a somewhat complicated recursive process, and it is not tractable to write down
and explicitly work with the density 𝜇𝑘ℎ . On the other hand, conditioned on 𝑋𝑘ℎ , the
conditional law of 𝑋 (𝑘+1)ℎ is simply a Gaussian, with an explicit mean (depending on 𝑋𝑘ℎ )
and covariance matrix. Consequently, it is easy to write down an explicit and simple
expression for the joint law of 𝑋 0, 𝑋ℎ , 𝑋 2ℎ , . . . , 𝑋𝑘ℎ .
The Langevin diffusion 𝑡 ↦→ 𝑍𝑡 is not as straightforward because it is the solution to a
stochastic differential equation, but infinitesimally we can think of it in the same way as
the LMC iteration: namely, conditioned on the past, the conditional law of the Langevin
diffusion in the next instant is a Gaussian with some mean and covariance. So, it should
be easier to consider the “joint law” of the Langevin diffusion. In fact, the content of
Girsanov’s theorem is that there is an explicit and simple formula for the “ratio of the
joint laws” of any two processes driven by the same Brownian motion.
To make this idea rigorous, consider the SDEs
√
d𝑋𝑡 = −∇𝑉 (𝑋𝑘ℎ ) d𝑡 + 2 d𝐵𝑡 , for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] ,
√
d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 .
Then, (𝑋𝑡 )𝑡 ≥0 is the interpolation of LMC that we encountered in (4.2.1) and (𝑍𝑡 )𝑡 ≥0 is
the Langevin diffusion. Since (𝑋𝑡 )𝑡 ∈[0,𝑇 ] is a random continuous path [0,𝑇 ] → R𝑑 , its law
is a probability measure P𝑇 on the space C([0,𝑇 ]; R𝑑 ). Similarly, the law of (𝑍𝑡 )𝑡 ∈[0,𝑇 ]
is another probability measure Q𝑇 on C([0,𝑇 ]; R𝑑 ). We will interpret P𝑇 and Q𝑇 as the
“joint laws” of the processes.
The precise way to interpret Girsanov’s theorem can be confusing, so we will carefully
outline the logic. Start with the Langevin diffusion (𝑍𝑡 )𝑡 ∈[0,𝑇 ] defined over a probability
space, so that its law is Q𝑇 . Girsanov’s theorem states that it is possible to define a new
dP̃𝑇
measure P̃𝑇 , defined explicitly through the Radon-Nikodym derivative dQ 𝑇
, such that
under P̃𝑇 the process ( 𝐵˜ 𝑡 )𝑡 ∈[0,𝑇 ] given by d𝐵˜ 𝑡 B d𝐵𝑡 + √ {∇𝑉 (𝑍𝑘ℎ ) − ∇𝑉 (𝑍𝑡 )} d𝑡 is a
1
2
158 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
standard Brownian motion (where 𝑘 is the largest integer with 𝑘ℎ ≤ 𝑡). But that means
that under P̃𝑇 ,
√ √
d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 = −∇𝑉 (𝑍𝑘ℎ ) d𝑡 + 2 d𝐵˜ 𝑡 .
Thus, under P̃𝑇 , the process (𝑍𝑡 )𝑡 ∈[0,𝑇 ] solves the SDE corresponding to the interpolation of
dP𝑇
LMC, and hence P̃𝑇 = P𝑇 . So, this theorem indeed gives us a formula for dQ 𝑇
.
Here is a statement of (a version of) Girsanov’s theorem.
Let P𝑇 and Q𝑇 be their respective laws on C([0,𝑇 ]; R𝑑 ). Then, provided that Novikov’s
condition holds,
1 ∫ 𝑇
E exp
Q𝑇
k𝑏𝑡𝑋 − 𝑏𝑡𝑌 k 2 d𝑡 < ∞ ,
4 0
we have the formula
dP𝑇 1 1
∫ 𝑇 ∫ 𝑇
= exp √ h𝑏𝑡𝑋 − 𝑏𝑡𝑌 , d𝐵𝑡 i − k𝑏𝑡𝑋 − 𝑏𝑡𝑌 k 2 d𝑡 ,
dQ𝑇 2 0 4 0
Theorem 4.4.2. Let 𝜋 ∝ exp(−𝑉 ) be the target and assume 0 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 and
∇𝑉 (0) = 0. Let (𝜇𝑘ℎ )𝑡 ≥0 denote the law of LMC, and let (𝜋𝑡 )𝑡 ≥0 denote the law of the
4.4. PROOF VIA GIRSANOV’S THEOREM 159
dP𝑁ℎ
KL(P𝑁ℎ k Q𝑁ℎ ) = EP𝑁 ℎ ln
dQ𝑁ℎ
−1
1
𝑁
∑︁ ∫ (𝑘+1)ℎ
=E P𝑁 ℎ
√ h∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ ), d𝐵𝑡 i
𝑘=0 2 𝑘ℎ
1 (𝑘+1)ℎ
∫
− k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 d𝑡 .
4 𝑘ℎ
𝑁 −1 ∫
1 ∑︁ (𝑘+1)ℎ P𝑁 ℎ
KL(P𝑁ℎ k Q𝑁ℎ ) = E [k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] d𝑡
4
𝑘=0 𝑘ℎ
𝛽 2 ∑︁ (𝑘+1)ℎ P𝑁 ℎ
𝑁 −1 ∫
≤ E [k𝑋𝑡 − 𝑋𝑘ℎ k 2 ] d𝑡
4
𝑘=0 𝑘ℎ
𝛽 2 ∑︁ (𝑘+1)ℎ
𝑁 −1 ∫
= {(𝑡 − 𝑘ℎ) 2 EP𝑁 ℎ [k∇𝑉 (𝑋𝑘ℎ )k 2 ] + 2𝑑 (𝑡 − 𝑘ℎ)} d𝑡
4 𝑘ℎ
𝑘=0
𝛽 ℎ 3 ∑︁
4 𝛽 2𝑑ℎ 2 𝑁
𝑁 −1
≤ EP𝑁 ℎ [k𝑋𝑘ℎ k 2 ] + .
12 4
𝑘=0
We can use the bound on the second moment of the LMC iterates (Lemma 4.1.4) to get
𝑑
KL(P𝑁ℎ k Q𝑁ℎ ) . 𝛽 4ℎ 3 𝑁 EP𝑁 ℎ [k𝑋 0 k 2 ] + + 𝛽 2𝑑ℎ 2 𝑁 .
𝛼
160 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
The preceding argument is similar to the Wasserstein coupling proof (Theorem 4.1.1),
and indeed in both proofs we used strong log-concavity. However, in the Wasserstein cou-
pling proof, the strong log-concavity assumption is crucial because it implies contraction
in the Wasserstein metric, whereas the preceding discretization argument only requires a
bound on the second moment of the LMC iterates which can be obtained in other ways.
√ What sampling guarantee does Theorem 4.4.2 imply? Unfortunately, neither KL nor
KL satisfy the triangle inequality, which poses a difficulty for bounding the distance
of
√ 𝜇 𝑁ℎ from the target 𝜋. One way to skirt this difficulty is to simply use the fact that
KL & k·k TV (Pinsker’s inequality, Exercise 2.10) and the fact that the total variation
distance satisfies the triangle inequality.
Corollary 4.4.3. Let 𝜋 ∝ exp(−𝑉 ) be the target and assume 0 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑
and ∇𝑉 (0) = 0. Let (𝜇𝑘ℎ )𝑡 ≥0 denote the law of LMC initialized at the distribution
e 𝛼𝜀2 2 ), we obtain the
𝜇0 = normal(0, 𝛽 −1𝐼𝑑 ). Then, for all 𝜀 ∈ (0, 1) and for ℎ = Θ( 𝛽 𝑑
guarantee k𝜇 𝑁ℎ − 𝜋 k TV ≤ 𝜀 provided that
𝜅 2𝑑
𝑁 =Θ iterations .
𝜀2
e
Proof. Since strong log-concavity implies an LSI, which in turn implies exponential con-
vergence of the Langevin√︁ diffusion to its target in KL divergence (Theorem 1.2.24 and Theo-
provided 𝑁ℎ & 𝛼1 log 𝜀 20 . With our choice
KL(𝜇 k𝜋)
rem 1.2.28), we obtain KL(𝜋 𝑁ℎ k 𝜋) ≤ √𝜀
2∫
of initialization, KL(𝜇 0 k 𝜋) . 𝑑 log 𝜅 and k·k 2 d𝜇 0 . 𝑑/𝛽 ≤ 𝑑/𝛼.
We now take the number of iterations to satisfy 𝑁ℎ 𝛼1 log 𝜀 20 . Then, with our
KL(𝜇 k𝜋)
k𝜇 𝑁ℎ − 𝜋 k TV ≤ k𝜇 𝑁ℎ − 𝜋 𝑁ℎ k TV + k𝜋 𝑁ℎ − 𝜋 k TV
1 1
√︂ √︂
≤ KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) + KL(𝜋 𝑁ℎ k 𝜋) ≤ 𝜀 .
2 2
1 KL(𝜇 0 k𝜋)
Finally, plugging in the choice of ℎ into 𝑁ℎ 𝛼 log 𝜀2
yields the result.
Although the quantitative dependence in Corollary 4.4.3 matches prior results (e.g.,
via the interpolation method in Theorem 4.2.5), the final result is unsatisfying because
we have moved to a weaker metric (TV rather than KL) for a seemingly silly reason (the
failure of the triangle inequality for the KL divergence). Indeed, we have a convergence
4.4. PROOF VIA GIRSANOV’S THEOREM 161
result for the Langevin diffusion in KL, and our discretization bound is in KL, yet our final
result is in TV. Can we remedy this?
To address this, we can introduce the Rényi divergences (defined in (2.2.14)); recall
that KL = R1 . We have a continuous-time result for the Langevin diffusion in Rényi
divergence (Theorem 2.2.15), and it turns out that with some additional tricks it is possible
to extend the Girsanov discretization argument to any Rényi divergence. Moreover, the
Rényi divergences satisfy a weak triangle inequality: for any 𝑞 > 1, any 𝜆 ∈ (0, 1), and
any probability measures 𝜇, 𝜈, 𝜋:
𝑞−𝜆
R𝑞 (𝜇 k 𝜋) ≤ R𝑞/𝜆 (𝜇 k 𝜈) + R (𝑞−𝜆)/(1−𝜆) (𝜈 k 𝜋) . (4.4.4)
𝑞−1
This allows us to combine a continuous-time Rényi result with a Rényi discretization
argument to yield a Rényi sampling guarantee. We provide the details for this approach
in Chapter 5.
Bibliographical Notes
Historically, the LMC algorithm, which is called unadjusted because of the lack of a
Metropolis–Hastings filter, was only studied relatively recently in non-asymptotic settings.
Before the work of [DT12], it was more common to study MALA (which we introduce and
study in Chapter 7). The ideas which go into the basic 𝑊2 coupling proof for Theorem 4.1.1
were developed in a series of works on strongly log-concave sampling: [DT12; Dal17a;
Dal17b; DM17; DM19]. The Girsanov argument of Theorem 4.4.2 is also due to [DT12].
There are two other notable proof techniques that we have omitted from this chapter:
reflection coupling [Ebe11; Ebe16] and mean squared analysis [Li+19; Li+22; LZT22].
Reflection coupling uses a carefully chosen coupling of the Brownian motions rather than
just taking the two Brownian motions to be the same as we have done (the latter coupling
is called the synchronous coupling). Mean squared analysis is a general framework
which combines local errors (one-step discretization bounds) into global error bounds.
These two methods are useful for performing discretization analysis under more general
sets of assumptions, but they are limited to providing guarantees in 𝑊1 or 𝑊2 .
Exercises
Proof via Wasserstein Coupling
162 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO
In this chapter, we study convergence guarantees for the LMC algorithm in Rényi di-
vergences. Recall that the Rényi divergences are a family of information divergences,
indexed by a parameter 𝑞, such that the Rényi divergence of order 𝑞 = 1 is the KL diver-
gence, and the Rényi divergence of order 2 is related to the chi-squared divergence via
R2 = ln(1 + 𝜒 2 ). We studied the continuous-time convergence of the Langevin diffusion
in Rényi divergence under either a Poincaré inequality or a log-Sobolev inequality in
Section 2.2.4. Here, we will build upon these results in order to study the discretized
algorithm. Rényi divergence guarantees are stronger than the 𝑊2 or KL guarantees from
Chapter 4, and they have recently found use in areas such as differential privacy [Mir17].
163
164 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE
Proposition 5.1.1. Along the law (𝜇𝑡 )𝑡 ≥0 of the interpolated process (4.2.1),
3 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 𝑞 E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ,
𝑞 E𝜋 (𝜌𝑡 )
d𝜇𝑡
where 𝜌𝑡 B and 𝜓𝑡 B 𝜌𝑡
𝑞−1 𝑞
d𝜋 /E𝜋 (𝜌𝑡 ).
4 E𝜋 [k∇(𝜌 𝑞/2 )k 2 ] d𝜇
FI𝑞 (𝜇 k 𝜋) B , 𝜌B ,
𝑞 E𝜋 (𝜌 𝑞 ) d𝜋
may be considered the “Rényi Fisher information”. As in the proof of Theorem 2.2.15,
under a log-Sobolev inequality, we have
2
FI𝑞 (𝜇 k 𝜋) ≥ R𝑞 (𝜇 k 𝜋) ,
𝑞𝐶 LSI
so the first term in Proposition 5.1.1 provides a decay in the Rényi divergence. In the
discretization analysis, our task is to control the second term.
Note that
𝑞−1
𝜌
E𝜓𝑡 (𝑋𝑡 ) = E𝜇𝑡 𝜓𝑡 = E𝜋 𝜌𝑡 𝑡 𝑞 = 1 ,
E𝜋 (𝜌𝑡 )
so 𝜓𝑡 (𝑋𝑡 ) acts as a change of measure. The main difficulty of the proof is that whereas
we know how to control the term k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 under the original probability
measure P (indeed, this is precisely what we accomplished in Theorem 4.2.5), it is not
dP
straightforward to control this term under the measure P defined by dP = 𝜓𝑡 (𝑋𝑡 ). To-
wards this end, we shall employ change of measure inequalities that allow us to relate
expectations under P to expectations under P. Note that 𝜓𝑡 = 1 when 𝑞 = 1, which is why
these difficulties can be avoided when working with the KL divergence.
The main theorem that we wish to prove is as follows.
Theorem 5.1.2 ([Che+21a]). For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies LSI and that ∇𝑉 is
1
𝛽-Lipschitz. Also, for simplicity, assume that 𝐶 LSI, 𝛽 ≥ 1. Then, for all ℎ ≤ 192𝐶LSI 𝛽 2𝑞 2
,
5.1. PROOF UNDER LSI VIA INTERPOLATION ARGUMENT 165
where 𝑁 0 B d 2𝐶ℎLSI ln(𝑞 − 1)e. In particular, for all 𝜀 ∈ [0, 𝑑/𝑞], if we choose the step
√︁
𝐶 2 𝛽 2𝑑𝑞
LSI
𝑁 =𝑂 log R (𝜇
2 0 k 𝜋) iterations .
𝜀2
e
For clarity of exposition, we begin with a discretization analysis that incurs a worse
dependence on 𝑞. Afterwards, we show how to improve the dependence on 𝑞 via a
hypercontractivity argument.
Proof of Theorem 5.1.2 with suboptimal dependence on 𝑞. As in the proof of Theorem 4.2.5,
our aim is to control the error term E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ], where from the
𝛽-smoothness of 𝑉 and from ℎ ≤ 3𝛽1 we have
d𝜇𝑡
2 4 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
E𝜇𝑡 𝜓𝑡 ∇ ln 𝜓𝑡 (5.1.3)
=
d𝜋 𝑞
E𝜋 (𝜌𝑡 )
which follows from the chain rule from calculus.
For the second error term, we must control the term k𝐵𝑡 − 𝐵𝑘ℎ k 2 under the measure
dP
P, where dP = 𝜓𝑡 (𝑋𝑡 ). The difficulty is that under P, 𝐵 is no longer a standard Brownian
motion, so it is difficult to control this term directly. Instead, we apply the Donsker–
Varadhan variational principle (Theorem 1.5.4) to relate the expectation under P (denoted
E) with the expectation under P. For any random variable 𝜁 , it yields
E𝜁 ≤ KL(P k P) + ln E exp 𝜁 .
166 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE
Applying this to 𝜁 B 𝑐 (k𝐵𝑡 − 𝐵𝑘ℎ k − Ek𝐵𝑡 − 𝐵𝑘ℎ k) 2 for a constant 𝑐 > 0 to be chosen
later, we obtain
2
E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ] ≤ 2 E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ] + E𝜁
𝑐
2
≤ 2𝑑 (𝑡 − 𝑘ℎ) + KL(P k P) + ln E exp 𝑐 (k𝐵𝑡 − 𝐵𝑘ℎ k − Ek𝐵𝑡 − 𝐵𝑘ℎ k) 2 .
𝑐
Under P, 𝐵𝑡 − 𝐵𝑘ℎ ∼ normal(0, (𝑡 − 𝑘ℎ) 𝐼𝑑 ). Applying concentration of measure for the
1
Gaussian distribution (see, e.g., Theorem 2.4.8), if 𝑐 . 𝑡−𝑘ℎ , then
1
In fact, it suffices to take 𝑐 = 8 (𝑡−𝑘ℎ) . Next, using the LSI for 𝜋,
𝑞−1
𝑞−1 𝑞
𝜌𝑡 𝜌𝑡
KL(P k P) = E𝜓𝑡 𝜇𝑡 ln𝜓𝑡 = E𝜓𝑡 𝜇𝑡 ln 𝑞−1
= E𝜓 𝑡 𝜇 𝑡 ln
E𝜇𝑡 (𝜌𝑡 ) 𝑞 𝑞−1 𝑞/(𝑞−1)
E𝜇𝑡 (𝜌𝑡 )
𝑞−1 1
𝑞
𝜌𝑡
ln ln
𝑞−1
= E𝜓𝑡 𝜇𝑡 − E (𝜌 )
E𝜇𝑡 (𝜌𝑡 ) 𝑞 − 1
𝑞−1 𝜇 𝑡 𝑡
𝑞
| {z }
≥0
𝑞−1
≤ KL(𝜓𝑡 𝜇𝑡 k 𝜋)
𝑞
(𝑞 − 1) 𝐶 LSI d𝜇𝑡
2 2 (𝑞 − 1) 𝐶 LSI E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
E𝜓𝑡 𝜇𝑡 ∇ ln 𝜓𝑡
≤
= ,
2𝑞 d𝜋 𝑞 𝑞
E𝜋 (𝜌𝑡 )
All in all, applying Proposition 5.1.1 and collecting the error terms,
3 E𝜋 [k∇(𝜌𝑡 )k 2 ] 2
2 4 E𝜋 [k∇(𝜌𝑡 )k ]
𝑞/2 n 𝑞/2
o
2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 9𝛽 𝑞 (𝑡 − 𝑘ℎ) 𝑞 + 2𝛽𝑑
𝑞 E𝜋 (𝜌𝑡 ) E𝜋 (𝜌𝑡 )
5.1. PROOF UNDER LSI VIA INTERPOLATION ARGUMENT 167
E𝜋 [k∇(𝜌𝑡 )k 2 ] o
n 𝑞/2
2
+ 6𝛽 𝑞 14𝑑 (𝑡 − 𝑘ℎ) + 32𝐶 LSIℎ 𝑞 .
E𝜋 (𝜌𝑡 )
1
From 𝐶 LSI, 𝛽 ≥ 1 and ℎ ≤ 192𝐶LSI 𝛽 2𝑞 2
, we can absorb some of the error terms into the decay
term and apply the LSI for 𝜋 (see Theorem 2.2.15), yielding
1 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 𝑂 (𝛽 3𝑑ℎ 2𝑞 + 𝛽 2𝑑ℎ𝑞)
𝑞 E𝜋 (𝜌𝑡 )
1
≤− R𝑞 (𝜇𝑡 k 𝜋) + 𝑂 (𝛽 2𝑑ℎ𝑞) .
2𝑞𝐶 LSI
𝑡 − 𝑘ℎ 𝑡 − 𝑘ℎ 2
𝜕𝑡 exp R𝑞 (𝜇𝑡 k 𝜋) . exp 𝛽 𝑑ℎ𝑞 . 𝛽 2𝑑ℎ𝑞 .
2𝑞𝐶 LSI 2𝑞𝐶 LSI
ℎ
R𝑞 (𝜇 (𝑘+1)ℎ k 𝜋) ≤ exp − R𝑞 (𝜇𝑘ℎ k 𝜋) + 𝑂 (𝛽 2𝑑ℎ 2𝑞) .
2𝑞𝐶 LSI
We pause to reflect upon the proof. As discussed above, the key steps are to use
change of measure inequalities in order to relate expectations under P to expectations
under P. This is accomplished via the Fisher information duality lemma (Lemma 4.2.4)
and the Donsker–Varadhan variational principle (Theorem 1.5.4). These inequalities yield
an additional error term of the form FI(𝜓𝑡 𝜇𝑡 k 𝜋) (for the latter, this error term appears
after an application of the LSI). The magical part of the calculation is that FI(𝜓𝑡 𝜇𝑡 k 𝜋) is
precisely equal to the Rényi Fisher information (up to constants), and when the step size
ℎ is sufficiently small it can be absorbed into the decay term of the differential inequality
in Proposition 5.1.1.
The proof above implies an iteration complexity whose dependence on 𝑞 scales as
𝑁 = 𝑂 (𝑞 3 ). In order to improve the dependence on 𝑞, we modify the differential in-
equality of Proposition 5.1.1 by making the parameter 𝑞 time-dependent, similarly to the
hypercontractivity principle (Exercise 2.6). The proof is left as Exercise 5.1.
168 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE
𝑞(𝑡)/2 2
1 2 (𝑞(𝑡) − 1) E𝜋 [k∇(𝜌𝑡 )k ]
∫
ln 𝜌𝑡 d𝜋 ≤ −
𝑞(𝑡)
𝜕𝑡 2 𝑞(𝑡)
𝑞(𝑡) 𝑞(𝑡) E𝜋 (𝜌𝑡 )
+ 𝑞(𝑡) − 1 E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ,
d𝜇𝑡
where 𝜌𝑡 B and 𝜓𝑡 B 𝜌𝑡
𝑞(𝑡)−1 𝑞(𝑡)
d𝜋 /E𝜋 (𝜌𝑡 ).
1 1
∫ ∫
ln d𝜋 − ln d𝜋 . 𝛽 2𝑑ℎ 2𝑞¯ .
𝑞((𝑘+1)ℎ) 𝑞(𝑘ℎ)
𝜌 (𝑘+1)ℎ 𝜌𝑘ℎ
𝑞((𝑘 + 1)ℎ) 𝑞(𝑘ℎ)
1 1
∫ ∫
ln 𝜌 𝑁 0ℎ 0 d𝜋 − ln 𝜌 02 d𝜋 . 𝛽 2𝑑ℎ 2𝑞𝑁 e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
¯ 0 ≤𝑂 ¯ .
𝑞(𝑁 ℎ)
𝑞(𝑁 0ℎ) 2
Finishing the convergence analysis. Next, after shifting time indices and applying
the previous proof of Theorem 5.1.2 with 𝑞 = 2,
1 3
∫ ∫
2 e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
R𝑞¯ (𝜇 (𝑁 +𝑁0 )ℎ ln 𝜌 (𝑁 +𝑁 )ℎ d𝜋 ≤ ln d𝜋 + 𝑂 ¯
𝑞(𝑁 0ℎ)
k 𝜋) ≤ 𝜌 𝑁ℎ
𝑞(𝑁 0ℎ) − 1 0 4
3 𝑁ℎ
e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
≤ exp − R2 (𝜇0 k 𝜋) + 𝑂 ¯ .
4 4𝐶 LSI
The proof of Theorem 5.1.2 is rather specific to the LSI case because we use the LSI to
bound the KL term KL(𝜓𝑡 𝜇𝑡 k 𝜋) via the Rényi Fisher information, which is then absorbed
into the differential inequality of Proposition 5.1.1. However, it turns out that rather than
assuming an LSI for 𝜋, it suffices to have an LSI for 𝜇𝑡 for all 𝑡 ≥ 0 (possibly with an LSI
constant that grows with 𝑡). One situation in which this holds is when we initialize LMC
with a measure 𝜇0 that satisfies an LSI, and the potential 𝑉 is convex. Note that this
situation is not included in the case when 𝜋 satisfies an LSI, because 𝑉 may only have
linear growth at infinity (whereas from Theorem 2.4.8, if 𝜋 ∝ exp(−𝑉 ) satisfies an LSI,
then 𝑉 necessarily has quadratic growth at infinity). We explore this in Exercise 5.2.
Bibliographical Notes
Discretization of LMC in Rényi divergence was first considered in [VW19], which proved
convergence of LMC to its biased stationary distribution. However, this does not lead to a
sampling guarantee unless the size of the “Rényi bias” (the Rényi divergence between the
biased stationary distribution and the true target distribution) can be estimated.
Motivated by applications to differential privacy, [GT20] provided the first Rényi
sampling guarantees for LMC under strong log-concavity by using a technique based
on the adaptive composition lemma for Rényi divergences. Then, [EHZ22] improved
the analysis of [GT20] via a two-phase analysis that weakens the assumption of strong
log-concavity to a dissipativity assumption and obtains a sharper bound, but which
still relies on the adaptive composition lemma. Subsequently, building off the earlier
work of [Che+21b] (which essentially does a one-step discretization argument in Rényi
divergence), in [Che+21a] it was realized that the earlier arguments of [GT20; EHZ22]
can be streamlined by replacing the adaptive composition lemma entirely with Girsanov’s
theorem. The proofs of Theorem 5.1.2 and Exercise 5.2 are also from [Che+21a].
Exercises
Proof under LSI via Interpolation Argument
1. First, show that 𝜇𝑡 satisfies an LSI for all 𝑡 ≥ 0, and write down a bound for 𝐶 LSI (𝜇𝑡 )
(the bound should grow linearly with 𝑡).
Hint: See Section 2.3.
2. Follow the proof of Theorem 5.1.2 (the first proof which incurs a suboptimal depen-
dence on 𝑞). Note that in Theorem 5.1.2, we bounded KL(P k P) ≤ 𝑞 KL(𝜓𝑡 𝜇𝑡 k 𝜋)
𝑞−1
and we applied the LSI for 𝜋. This time, use KL(P k P) = KL(𝜓𝑡 𝜇𝑡 k 𝜇𝑡 ) and apply
the LSI for 𝜇𝑡 instead.
Also, instead of using the decay of the Rényi divergence under a LSI, use the decay of
the Rényi divergence under a PI (Theorem 2.2.15). (Since 𝜋 is log-concave, it neces-
sarily satisfies a Poincaré inequality with some constant 𝐶 PI , see the Bibliographical
Notes to Chapter 2.)
Prove that if 𝜀 ≤ 1/𝑞 ∧ 𝐶 PI𝑑/𝛽, then with an appropriate choice of step size ℎ
√︁ √︁
We now move beyond the basic LMC algorithm and consider samplers with better depen-
dence on the dimension and inverse accuracy. There are two main sources of improvement
that we explore in this chapter. The first is to use a more sophisticated discretization
method than the basic Euler–Maruyama discretization. The second is to consider a differ-
ent stochastic process, called the underdamped Langevin diffusion. By combining these
two ideas, we arrive at the state-of-the-art complexity bounds for low-accuracy samplers.
In the Euler discretization, we approximate the second term via −ℎ ∇𝑉 (𝑍𝑘ℎ ). However, if
we want an unbiased estimator of the integral, then we can introduce an auxiliary random
variable 𝑢𝑘 ∼ uniform[0, 1] and use
√
𝑍 (𝑘+1)ℎ ≈ 𝑍𝑘ℎ − ℎ ∇𝑉 (𝑍 (𝑘+𝑢𝑘 )ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
171
172 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Theorem 6.1.1. For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of RM-LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and
𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . Then, provided ℎ . 𝛽𝜅11/2 , for all 𝑁 ∈ N,
Proof. Recall from the proof of Theorem 4.1.1 that we started with a one-step discretization
bound, and then we derived a multi-step discretization bound. In particular, for the one-
step bound, we showed that if 𝜋ℎ is the law of the continuous-time Langevin diffusion
started at 𝜇0 , then
𝑊22 (𝜇ℎ , 𝜋ℎ ) . 𝛽 4ℎ 4 E[k𝑍 0 k 2 ] + 𝛽 2𝑑ℎ 3 . (6.1.2)
It turns out that (6.1.2) still holds for RM-LMC. Since the proof is almost the same as
before, we leave it as an exercise (Exercise 6.1).
The benefits of the randomized midpoint discretization enter once we consider the
multi-step discretization. As in Theorem 4.1.1, we let 𝑋𝑘ℎ ∼ 𝜇𝑘ℎ and 𝑍𝑘ℎ ∼ 𝜋 be optimally
coupled, and we let (𝑋¯𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] and (𝑍𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] denote continuous-time Langevin
diffusions initialized at 𝑋𝑘ℎ and 𝑍𝑘ℎ respectively; all of these processes are coupled by
using the same Brownian motion to drive them. We bound
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ E[k𝑋 (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ]
= E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ]
+ 2 Eh𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ , 𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ i .
Now observe that in the cross term, the only quantity that depends on the uniform random
variable 𝑢𝑘 is 𝑋 (𝑘+1)ℎ . In particular, if we let ℱ(𝑘+1)ℎ denote the 𝜎-algebra generated by
𝑋𝑘ℎ and (𝐵𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] , then for any 𝜆 > 0,
the error term was 𝑂 ( ℎ1 ) E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] = 𝑂 (ℎ 2 ). In this calculation, we have
split the error term into E[k𝑋 (𝑘+1) − 𝑋¯ (𝑘+1)ℎ k 2 ] = 𝑂 (ℎ 3 ) as well as another error term,
which has a factor of 𝑂 ( ℎ1 ) but also has the expectation over the uniform random variable
inside the norm. If we can show that the expectation over the uniform random variable
makes the error smaller order than before (the interpretation being that the randomized
midpoint reduces the bias), then we obtain smaller discretization error.
The one-step discretization bound (6.1.2) yields
Next,
√
E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] = 𝑋𝑘ℎ − ℎ E[∇𝑉 (𝑋 (𝑘+𝑢𝑘 )ℎ ) | ℱ(𝑘+1)ℎ ] + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ )
∫ (𝑘+1)ℎ √
= 𝑋𝑘ℎ − ∇𝑉 (𝑋𝑡 ) d𝑡 + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
𝑘ℎ
Hence,
h
∫ (𝑘+1)ℎ
2 i
¯ 2
{∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋¯𝑡 )} d𝑡
E kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋 (𝑘+1)ℎ k = E
𝑘ℎ
∫ (𝑘+1)ℎ
≤ℎ E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋¯𝑡 )k 2 ] d𝑡
𝑘ℎ
∫ (𝑘+1)ℎ
2
≤𝛽 ℎ E[k𝑋𝑡 − 𝑋¯𝑡 k 2 ] d𝑡 .
𝑘ℎ
√
By definition, 𝑋𝑡 = 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ ), so
where the last inequality uses the movement bound for the Langevin process that we
proved in Lemma 4.1.5.
6.1. RANDOMIZED MIDPOINT DISCRETIZATION 175
Next, recall that by contraction of the Langevin diffusion under strong log-concavity
(Theorem 1.4.10), E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] ≤ exp(−2𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋). Choosing 𝜆 = 𝛼ℎ
2 ,
3𝛼ℎ 2
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp − 𝑊2 (𝜇𝑘ℎ , 𝜋)
2
4 4 2 2 3 𝛽 6ℎ 5 2 𝛽 4𝑑ℎ 4
+ 𝑂 𝛽 ℎ E[k𝑋𝑘ℎ k ] + 𝛽 𝑑ℎ + E[k𝑋𝑘ℎ k ] + .
𝛼 𝛼
At this point, we could bound E[k𝑋𝑘ℎ k 2 ] recursively, similarly to Lemma 4.1.4, but instead
we will use a trick:
𝑑
E[k𝑋𝑘ℎ k 2 ] = 𝑊22 (𝜇𝑘ℎ , 𝛿 0 ) . 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑊22 (𝜋, 𝛿 0 ) . 𝑊22 (𝜇𝑘ℎ , 𝜋) + .
𝛼
1
It implies that if we take ℎ . 𝛽𝜅 1/2
, then
𝛽 4𝑑ℎ 4 𝛽 6𝑑ℎ 5
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp(−𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) 2
+ 𝑂 𝛽 𝑑ℎ + 3
+ .
𝛼 𝛼2
From the analysis, it can be seen that the bottleneck term (at least for dimension de-
pendence) is not from the term involving E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ]. This justifies our earlier
comment that one step of the randomized midpoint procedure already suffices.
The complexity guarantee for RM-LMC is considerably better than that for LMC.
In fact, it is known that the randomized midpoint method is essentially an optimal
discretization method (which is not the same as saying that RM-LMC is an optimal
sampling algorithm); see [CLW21].
One notable downside of the randomized midpoint discretization is that the analysis
seems specific to the Wasserstein coupling approach. In particular, it is currently not
known how to obtain matching guarantees in KL divergence.
In the above result, we proved a slightly weaker complexity bound in order to stream-
line the proof. It is possible to improve the second term of 𝜅 3/2𝑑 1/4 /𝜀 1/2 in the guarantee
of Theorem 6.1.1; see Exercise 6.2.
176 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Since both steps of each iteration preserve 𝝅, the entire algorithm preserves 𝝅. At
this stage, though, this algorithm is still idealized because it assumes the ability to exactly
integrate Hamilton’s equations. This may be possible for very special cases, but it is not
in general, and certainly not within our oracle model. Nevertheless, it is instructive to
first analyze the ideal algorithm.
Theorem 6.2.1 (ideal HMC, [CV19]). Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies
𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . For 𝑘 ∈ N, let 𝜋𝑘𝑇 denote the law of the 𝑘-th iterate 𝑋𝑘𝑇 of ideal HMC
with integration time 𝑇 > 0. Then, if we set 𝑇 = √1 , we obtain
2 𝛽
𝑁 2
𝑊22 (𝜇 𝑁𝑇 , 𝜋) ≤ exp − 𝑊2 (𝜇 0, 𝜋) .
16𝜅
It is known that the convergence rate in this theorem is optimal, see [CV19]. We now
follow the proof, which is a purely deterministic analysis of Hamilton’s equations. First,
we need two lemmas.
178 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Lemma 6.2.2 (a priori bound). Let (𝑥𝑡 , 𝑝𝑡 )𝑡 ≥0 and (𝑥𝑡0, 𝑝𝑡0 )𝑡 ≥0 denote two solutions to
Hamilton’s equations of motion with 𝑝 0 = 𝑝 00 and a potential 𝑉 satisfying ∇2𝑉 𝛽𝐼𝑑 .
Then, for all 𝑡 ∈ [0, √1 ], it holds that
2 𝛽
1
k𝑥 0 − 𝑥 00 k 2 ≤ k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ 2 k𝑥 0 − 𝑥 00 k 2 .
2
Proof. First, note that 𝜕𝑡 k𝑝𝑡 −∫𝑝𝑡0 k ≤ k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k ≤ 𝛽 k𝑥𝑡 − 𝑥𝑡0 k. It follows that
|𝜕𝑡 k𝑥𝑡 − 𝑥𝑡0 k| ≤ k𝑝𝑡 − 𝑝𝑡0 k ≤ 𝛽 0 k𝑥𝑠 − 𝑥𝑠0 k d𝑠 and hence
𝑡
∫ 𝑡∫ 𝑠
0 0
k𝑥𝑡 − 𝑥𝑡 k ≤ k𝑥 0 − 𝑥 0 k + 𝛽 k𝑥𝑟 − 𝑥𝑟0 k d𝑟 d𝑠 .
0 0
√
Applying an ODE comparison lemma, one may deduce that k𝑥𝑡 − 𝑥𝑡0 k ≤ 2 k𝑥 0 − 𝑥 00 k. The
lower bound is similar.
𝛼𝑡 2
k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ exp − k𝑥 0 − 𝑥 00 k 2 .
4
Proof. We compute
1
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 = h𝑥𝑡 − 𝑥𝑡0, 𝑝𝑡 − 𝑝𝑡0 i ,
2
1 2
𝜕 k𝑥𝑡 − 𝑥𝑡0 k 2 = k𝑝𝑡 − 𝑝𝑡0 k 2 − h𝑥𝑡 − 𝑥𝑡0, ∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)i = −𝜌𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 + k𝑝𝑡 − 𝑝𝑡0 k 2 ,
2 𝑡
6.2. HAMILTONIAN MONTE CARLO 179
where we define
h∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0), 𝑥𝑡 − 𝑥𝑡0i
𝜌𝑡 B .
k𝑥𝑡 − 𝑥𝑡0 k 2
To bound k𝑝𝑡 −𝑝𝑡0 k 2 , we use |𝜕𝑡 k𝑝𝑡 −𝑝𝑡0 k| ≤ k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k. Also, by the coercivity
lemma (Lemma 6.2.3),
k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k 2 ≤ 𝛽 h∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0), 𝑥𝑡 − 𝑥𝑡0i = 𝛽𝜌𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ 2𝛽𝜌𝑡 k𝑥 0 − 𝑥 00 k 2
where we used Lemma 6.2.2. Hence, by the Cauchy–Schwarz inequality,
∫ 𝑡 2 𝑡 √︁ 2
∫
0 2
k𝑝𝑡 − 𝑝𝑡 k ≤ 𝜕𝑠 k𝑝𝑠 − 𝑝𝑠 k d𝑠 ≤
0
2𝛽𝜌𝑠 k𝑥 0 − 𝑥 00 k d𝑠
0 0
∫ 𝑡
≤ 2𝛽𝑡 k𝑥 0 − 𝑥 00 k 2 𝜌𝑠 d𝑠 .
0
From this and Lemma 6.2.2, we deduce
∫ 𝑡
2 0 2
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡 k ≤ − 𝜌𝑡 − 4𝛽𝑡 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2 .
0
Integrating and using 𝜌 ≥ 0,
∫ 𝑡 ∫ 𝑡 ∫ 𝑟
0 2
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡 k ≤ − 𝜌𝑠 d𝑠 − 4𝛽 𝑠 𝜌𝑟 d𝑟 d𝑠 k𝑥 0 − 𝑥 00 k 2
0 0
∫ 𝑡 ∫ 𝑡 0
≤− 𝜌𝑠 d𝑠 − 2𝛽𝑡 2 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2
0 0
∫ 𝑡 𝛼𝑡
= −(1 − 2𝛽𝑡 2 ) 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2 ≤ − k𝑥 0 − 𝑥 00 k 2 .
0 2
Integrating again then yields
𝛼𝑡 2 𝛼𝑡 2
k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ k𝑥 0 − 𝑦0 k 2 − k𝑥 0 − 𝑥 00 k 2 ≤ exp − k𝑥 0 − 𝑥 00 k 2 .
4 4
If we choose 𝑇 = √1 , then the contraction factor is exp(− 16𝜅
1
) ≤ 1 − 1/(32𝜅). In
2 𝛽
particular, let 𝑃 denote the transition kernel of one step of ideal HMC with integration
time 𝑇 . We have the 𝑊1 contraction
1
𝑊1 𝑃 ((𝑥, 𝑝), ·), 𝑃 ((𝑥 0, 𝑝 0), ·) ≤ 1 − k(𝑥, 𝑝) − (𝑥 0, 𝑝 0)k ,
32𝜅
which, in the language of Section 2.7, says that the coarse Ricci curvature of 𝑃 is bounded
below by 𝜅/32. Hence, by Theorem 2.7.5, we immediately obtain the following corollary.
180 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Corollary 6.2.5. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . Let
𝑃 be the Markov kernel for ideal HMC with integration time 𝑇 = √1 . Then, 𝑃 satisfies
2 𝛽
a Poincaré inequality with constant at most 32𝜅, where 𝜅 B 𝛽/𝛼.
We conclude this section by observing that, since Hamilton’s equations are a deter-
ministic system of ODEs, we can approximately integrate them using any ODE solver;
unlike for the Langevin diffusion, there is no need to consider any SDE discretization here.
By following this approach, [CV19] also provide the following sampling guarantee.
Theorem 6.2.6 (unadjusted HMC, [CV19]). Assume that the target 𝜋 ∝ exp(−𝑉 )
satisfies 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 and ∇𝑉 (0) = 0. Then, there is a√sampling algorithm based on
a discretization of ideal HMC which outputs 𝜇 satisfying 𝛼 𝑊2 (𝜇, 𝜋) ≤ 𝜀 using
3/2 1/2
e 𝜅 𝑑
𝑂 gradient queries .
𝜀
d𝑋𝑡 = 𝑃𝑡 d𝑡 ,
d𝑃𝑡 = −∇𝑉 (𝑋𝑡 ) d𝑡 − 𝛾𝑃𝑡 d𝑡 + 2𝛾 d𝐵𝑡 .
√︁
Here, 𝛾 > 0 is a parameter known as the friction parameter; as the name suggests, the
physical interpretation is that the Hamiltonian system is damped by friction.
Unlike the Langevin diffusion, the underdamped Langevin diffusion is not a reversible
Markov process. Moreover, it is an example of hypocoercive dynamics, which means that
the Markov semigroup approach based on Poincaré and log-Sobolev inequalities fails,
necessitating the use of more sophisticated PDE analysis.
where 𝝅 (𝑥, 𝑝) ∝ exp(−𝐻 (𝑥, 𝑝)) = exp(−𝑉 (𝑥) − 12 k𝑝 k 2 ). However, taking advantage of
the fact that
−∇ ln 𝝅
div 𝝅 𝑡 𝑝 𝑡
= 0,
∇𝑥 ln 𝝅 𝑡
we have the more interpretable expression
∇𝑥 ln(𝝅 𝑡 /𝝅) 0 1
𝜕𝑡 𝝅 𝑡 = div 𝝅 𝑡 𝑱 𝛾 , 𝑱𝛾 B ,
∇𝑝 ln(𝝅 𝑡 /𝝅) −1 𝛾
or
This shows that the underdamped Langevin diffusion is not interpreted as a gradient flow
of the KL divergence, but rather a “damped Hamiltonian flow” for the KL divergence.
We begin with a contraction result for the continuous-time process based on [Che+18].
Theorem 6.3.2. Let (𝑋𝑡0, 𝑃𝑡0 )𝑡 ≥0 and (𝑋𝑡1, 𝑃𝑡1 )𝑡 ≥0 be two copies of the underdamped
Langevin diffusion, driven by the same Brownian motion. Assume that the potential
satisfies 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . Then, defining the modified norm
√︄
2
2
|||(𝑥, 𝑝)||| 2 B
𝑥 + 𝑝 + k𝑥 k 2
𝛽
𝛼𝑡
|||(𝑋𝑡1, 𝑃𝑡1 ) − (𝑋𝑡0, 𝑃𝑡0 )||| ≤ exp − √︁ |||(𝑋 01, 𝑃01 ) − (𝑋 00, 𝑃00 )||| .
2𝛽
182 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Proof. Write 𝛿𝑋𝑡 B 𝑋𝑡1 − 𝑋𝑡0 and 𝛿𝑃𝑡 B 𝑃𝑡1 − 𝑃𝑡0 . Then, by Itô’s formula (Theorem 1.1.18),
d(𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) = 𝛿𝑃𝑡 − 𝜂 {∇𝑉 (𝑋𝑡1 ) − ∇𝑉 (𝑋𝑡0 )} − 𝛾𝜂 𝛿𝑃𝑡 d𝑡
h ∫ 1 i
= −(𝛾𝜂 − 1) 𝛿𝑃𝑡 − 𝜂 ∇2𝑉 (1 − 𝑠) 𝑋𝑡0 + 𝑠 𝑋𝑡1 d𝑠 𝛿𝑋𝑡 d𝑡
0
| {z }
C𝐻𝑡
h 1 1 i
(𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) + 𝛾 − − 𝜂𝐻𝑡 𝛿𝑋𝑡 d𝑡
= −𝛾−
𝜂 𝜂
as well as
1 1
d(𝛿𝑋𝑡 ) = 𝛿𝑃𝑡 d𝑡 = (𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) − 𝛿𝑋𝑡 d𝑡
𝜂 𝜂
so that
1
𝜕𝑡 {k𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 k 2 + k𝛿𝑋𝑡 k 2 }
2
𝛾 − 𝜂1 1
" #
D 𝛿𝑋 + 𝜂 𝛿𝑃 (𝜂𝐻𝑡 − 𝛾)
𝑡 𝑡 2 𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 E
=− , 1 1 .
𝑋𝑡 2 (𝜂𝐻𝑡 − 𝛾) 𝜂 𝑋𝑡
√︃
2 2
We now check that if 𝛾 = and 𝜂 =
𝜂 𝛽, then the eigenvalues of the matrix above are
lower bounded by 𝛼𝜂/2 = 𝛼/ 2𝛽.
√︁
We check that the new norm we defined is equivalent to the Euclidean norm.
2 1 √2 𝛼𝑡 2
1 0 2 0 2
k𝑋𝑡 − 𝑋𝑡 k + k𝑃𝑡 − 𝑃𝑡 k ≤ 9 exp − √︁ k𝑋 01 − 𝑋 00 k 2 + k𝑃01 − 𝑃00 k 2 .
𝛽 𝛽 𝛽
d𝑋𝑡 = 𝑃𝑡 d𝑡 ,
for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] . (ULMC)
d𝑃𝑡 = −∇𝑉 (𝑋𝑘ℎ ) d𝑡 − 𝛾𝑃𝑡 d𝑡 + 2𝛾 d𝐵𝑡 ,
√︁
Then, the solution to the SDE is given explicitly in the following lemma.
Lemma 6.3.4. Conditioned on (𝑋𝑘ℎ , 𝑃𝑘ℎ ), the law of (𝑋 (𝑘+1)ℎ , 𝑃 (𝑘+1)ℎ ) is explicitly given
as normal(𝑀 (𝑘+1)ℎ , Σ) where
and
2
{ℎ − 𝛾2 (1 − exp(−𝛾ℎ)) + 2𝛾1 (1 − exp(−2𝛾ℎ))}
" #
∗
Σ= 𝛾
2 1 ⊗ 𝐼𝑑 .
𝛾 { 2 − exp(−𝛾ℎ) + exp(−2𝛾ℎ)} 1 − exp(−2𝛾ℎ)
The lemma is an exercise in stochastic calculus (Exercise 6.7). The point is that the
discretization given as ULMC is implementable.
We now proceed to a discretization analysis based on [Che+18].
Theorem 6.3.5. For 𝑘 ∈ N, let 𝝁 𝑘ℎ denote the law of the 𝑘-th iterate of ULMC with
appropriately tuned step size ℎ > 0 and friction parameter 𝛾 > 0. Also, let 𝜇𝑘ℎ denote
the law of 𝑋𝑘ℎ . Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and 𝛼𝐼𝑑
184 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
√
∇2𝑉 𝛽𝐼𝑑 . Then, we obtain the guarantee 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 after
𝜅 2𝑑 1/2
𝑁 =𝑂
e iterations .
𝜀
Remark 6.3.6. This result is not the best possible. Indeed, a refined analysis by [DR20]
e(𝜅 3/2𝑑 1/2 /𝜀) by improving the continuous-time contrac-
obtains the iteration complexity 𝑂
tion result of Theorem 6.3.2.
Proof. One-step discretization bound. As in Theorem 4.1.1, we start with a one-step
bound. Let (𝑋𝑡 , 𝑃𝑡 )𝑡 ≥0 denote ULMC and let (𝑋¯𝑡 , 𝑃¯𝑡 )𝑡 ≥0 denote the continuous-time under-
damped Langevin diffusion, both driven by the same Brownian motion and started at the
same random pair. We want to bound the distance E[|||(𝑋ℎ , 𝑃ℎ ) − (𝑋¯ℎ , 𝑃¯ℎ )||| 2 ]. According
to Lemma 6.3.3, it suffices to bound E[k𝑋ℎ − 𝑋¯ℎ k 2 ] and E[k𝑃ℎ − 𝑃¯ℎ k 2 ] separately.
First,
h
∫ ℎ
2 i ∫ ℎ
¯ 2
E[k𝑋ℎ − 𝑋ℎ k ] = E
¯
{𝑃𝑡 − 𝑃𝑡 } d𝑡
≤ ℎ
E[k𝑃𝑡 − 𝑃¯𝑡 k 2 ] d𝑡 .
0 0
Next,
h
∫ 𝑡
2 i
¯ 2
E[k𝑃𝑡 − 𝑃𝑡 k ] = E
¯ ¯ ¯
{−∇𝑉 (𝑋 0 ) + ∇𝑉 (𝑋𝑠 ) − 𝛾 (𝑃𝑠 − 𝑃𝑠 )} d𝑠
∫ 𝑡0
E[k∇𝑉 (𝑋¯𝑠 ) − ∇𝑉 (𝑋¯0 )k 2 ] + 𝛾 2 E[k𝑃𝑠 − 𝑃¯𝑠 k 2 ] d𝑠 .
.ℎ
0
1
By Grönwall’s inequality, if ℎ ≤ = √1 , then
𝛾 2𝛽
∫ 𝑡 ∫ 𝑡
E[k𝑃𝑡 − 𝑃¯𝑡 k 2 ] . ℎ E[k∇𝑉 (𝑋¯𝑠 ) − ∇𝑉 (𝑋¯0 )k 2 ] d𝑠 ≤ 𝛽 2ℎ E[k𝑋¯𝑠 − 𝑋¯0 k 2 ] d𝑡 .
0 0
Again, we need a movement bound for the underdamped Langevin diffusion, which is
done in Lemma 6.3.7. Substituting this in and assuming ℎ . 𝛽1 ,
1
E[|||(𝑋ℎ , 𝑃ℎ ) − (𝑋¯ℎ , 𝑃¯ℎ )||| 2 ] . E[k𝑋ℎ − 𝑋¯ℎ k 2 ] + E[k𝑃ℎ − 𝑃¯ℎ k 2 ]
𝛽
. 𝛽 ℎ E[|||(𝑋 0, 𝑃0 )||| 2 ] + 𝛽 3/2𝑑ℎ 5 .
2 4
Multi-step discretization bound. Let W 2 denote the coupling cost for the norm
|||·||| 2 . Let 𝝁ˆ (𝑘+1)ℎ denote the law of the continuous-time underdamped Langevin diffusion
started at 𝝁 𝑘ℎ . Then, from Theorem 6.3.2 and the one-step discretization bound,
W (𝝁 (𝑘+1)ℎ , 𝝅) ≤ W ( 𝝁ˆ (𝑘+1)ℎ , 𝝅) + W (𝝁 (𝑘+1)ℎ , 𝝁ˆ (𝑘+1)ℎ )
6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 185
𝛼ℎ
≤ exp − √︁ W (𝝁 𝑘ℎ , 𝝅) + 𝑂 𝛽ℎ 2 W (𝝁 𝑘ℎ , 𝜹 0 ) + 𝛽 3/4𝑑 1/2ℎ 5/2 .
2𝛽
Also, W (𝝁 𝑘ℎ , 𝜹 0 ) ≤ W (𝝁 𝑘ℎ , 𝝅) + W (𝝅, 𝜹 0 ) . W (𝝁 𝑘ℎ , 𝝅) + 𝑑/𝛼, where we used the
√︁
1
moment bound in Lemma 4.1.4. If ℎ . 𝛽 1/2 𝜅
, then we can absorb the W (𝝁 𝑘ℎ , 𝝅) term into
the contraction rate and deduce
𝛼ℎ 𝛽𝑑 1/2ℎ 2
3/4 1/2 5/2
W (𝝁 (𝑘+1)ℎ , 𝝅) ≤ exp − √︁ W (𝝁 𝑘ℎ , 𝝅) + 𝑂 + 𝛽 𝑑 ℎ .
2 𝛽 𝛼 1/2
Iterating,
𝛼𝑁ℎ 𝛽 3/2𝑑 1/2ℎ 𝛽 5/4𝑑 1/2ℎ 3/2
W (𝝁 𝑁ℎ , 𝝅) ≤ exp − √︁ W (𝝁 0, 𝝅) + 𝑂 + .
2 𝛽 𝛼 3/2 𝛼
Choosing the step size appropriately yields the result.
The next lemma provides the movement bound (Exercise 6.8).
Lemma 6.3.7. Let (𝑋¯𝑡 , 𝑃¯𝑡 )𝑡 ≥0 denote the underdamped Langevin diffusion with potential
𝑉 satisfying ∇2𝑉 𝛽𝐼𝑑 and ∇𝑉 (0) = 0. If 𝑡 ≤ 𝛾1 ∧ √1 , then
𝛽
Theorem 6.3.8. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 and
∇𝑉 (0) = 0. Then, the randomized
√ midpoint discretization of the underdamped Langevin
diffusion outputs 𝜇 such that 𝛼 𝑊2 (𝜇, 𝜋) ≤ 𝜀 using
1/3 𝜀 1/3𝜅 1/6
e 𝜅𝑑
𝑂 1 ∨ gradient queries .
𝜀 2/3 𝑑 1/6
With respect to the dimension dependence, this is the current state-of-the-art guaran-
tee for sampling from strongly log-concave distributions.
TODO: Provide a proof.
186 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
Bibliographical Notes
In the analysis of the randomized midpoint discretization of the Langevin diffusion
(Theorem 6.1.1), we have simplified the original proof of [HBE20] at the cost of proving a
slightly weaker result. The sharper argument of [HBE20] is outlined in Exercise 6.2.
Besides the randomized midpoint method, the shifted ODE discretization [FLO21]
e(𝑑 1/3 /𝜀 2/3 ) when applied to the
also achieves a state-of-the-art iteration complexity of 𝑂
underdamped Langevin diffusion.
HMC and its variants are some of the most popular algorithms employed in practice,
especially the no-U-turn sampler (NUTS) [HG14] which adaptively sets the integration
time. In terms of complexity analysis, the paper [CV19] (whose proof we followed
in Theorem 6.2.1) provided the tight analysis of ideal HMC. Other complexity results
obtained for HMC under various assumptions include [MV18; MS19; BEZ20].
The underdamped Langevin diffusion has been studied quantitatively in [Che+18;
EGZ19; DR20; Ma+21].
TODO: Literature on hypocoercivity.
Exercises
Randomized Midpoint Discretization
The main idea behind the improved rate is that throughout the proof of Theorem 6.1.1, we
used the inequality
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] ≤ 𝛽 2 E[k𝑋𝑘ℎ k 2 ] , (6.E.3)
which is wasteful. Instead, we will show that
𝑁 −1
1 ∑︁
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] . 𝛽𝑑 . (6.E.4)
𝑁
𝑘=0
Note that since we expect E[k𝑋𝑘ℎ k 2 ] 𝑑/𝛼, then the new bound (6.E.4) is an improvement
by a factor of 𝜅.
2. Rewrite the proof of Theorem 6.1.1, avoiding the use of the inequality (6.E.3), and
leaving the error bound in terms of 𝑘=0 E[k∇𝑉 (𝑋𝑘ℎ )k 2 ]. As a sanity check, the
Í𝑁 −1
result should imply the rate (6.E.2) once the key inequality (6.E.4) is proven.
3. Applying Itô’s formula (Theorem 1.1.18) to 𝑉 (𝑋¯𝑡 ), write down an expression for
E[𝑉 (𝑋¯ (𝑘+1)ℎ ) − 𝑉 (𝑋𝑘ℎ )]. By bounding the error terms carefully and assuming that
ℎ . 𝛽1 , prove that
{𝑓 , 𝐻 } B h∇𝑥 𝑓 , ∇𝑝 𝐻 i − h∇𝑝 𝑓 , ∇𝑥 𝐻 i = 0 ,
2. (conservation of volume) By differentiating 𝑡 ↦→ det ∇𝐹𝑡 (𝑥, 𝑝) and using the flow
map equation 𝜕𝑡 𝐹𝑡 (𝑥, 𝑝) = 𝑱 ∇𝐻 (𝐹𝑡 (𝑥, 𝑝)), prove that det ∇𝐹𝑡 (𝑥, 𝑝) = 1 for all 𝑡 ≥ 0
and 𝑥, 𝑝 ∈ R𝑑 . This shows that 𝐹𝑡 : R𝑑 → R𝑑 is a volume-preserving map.
3. (time reversibility) Suppose that (𝑥𝑡 , 𝑝𝑡 )𝑡 ∈[0,𝑇 ] solve Hamilton’s equations. Show
that (𝑥𝑇 −𝑡 , −𝑝𝑇 −𝑡 )𝑡 ∈[0,𝑇 ] also solve Hamilton’s equations. In other words, if 𝑅 is the
moment reversal operator, i.e.,
𝐼𝑑 0
𝑅= ,
0 −𝐼𝑑
then 𝐹𝑇−1 = 𝑅 ◦ 𝐹𝑇 ◦ 𝑅.
ℒOU 𝑓 B Δ𝑝 𝑓 − h𝑝, ∇𝑝 𝑓 i ,
Then, check the various calculations leading up to the Fokker–Planck equation (6.3.1).
High-Accuracy Samplers
191
192 CHAPTER 7. HIGH-ACCURACY SAMPLERS
Unlike the other sampling algorithms we have considered thus far, rejection sampling
always terminates with an exact sample from 𝜋.
Theorem 7.1.1. The output of rejection sampling is a sample drawn exactly from 𝜋.
Also, the number of samples drawn from 𝜇 until
∫ a sample is accepted follows a geometric
distribution with mean 𝑍 𝜇 /𝑍 𝜋 , where 𝑍 𝜇 B e𝜇 and 𝑍 𝜋 B 𝜋e.
∫
and clearly the number of samples drawn until acceptance is geometrically distributed.
To show that the output 𝑋 of rejection sampling is drawn exactly according to 𝜋, let
∞ i.i.d. ∞ i.i.d.
(𝑈𝑖 )𝑖=1 ∼ uniform[0, 1] and (𝑋𝑖 )𝑖=1 ∼ 𝜇 be independent. Then, for any event 𝐴,
∞
∑︁ 𝜋e(𝑋𝑖 ) 𝜋e(𝑋𝑛+1 )
P(𝑋 ∈ 𝐴) = P 𝑋𝑛+1 ∈ 𝐴, 𝑈𝑖 > ∀𝑖 ∈ [𝑛], 𝑈𝑛+1 ≤
𝑛=0
𝜇 (𝑋𝑖 )
e 𝜇 (𝑋𝑛+1 )
e
∞
∑︁ 𝜋e(𝑋𝑛+1 ) 𝜋e(𝑋 1 ) 𝑛
= P 𝑋𝑛+1 ∈ 𝐴, 𝑈𝑛+1 ≤ P 𝑈1 >
𝑛=0
𝜇 (𝑋𝑛+1 )
e 𝜇 (𝑋 1 )
e
∞ ∫ ∞
𝜋e 𝑛 𝑍 𝜋
∫
∑︁ 𝜋e ∑︁ 𝑍𝜋 𝑛
= d𝜇 1− d𝜇 = 𝜋 (𝐴) 1− = 𝜋 (𝐴) .
𝑛=0 𝐴
𝜇
e 𝜇
e 𝑍𝜇 𝑛=0
𝑍𝜇
The rejection sampling algorithm requires the construction of the upper envelope e 𝜇.
We now demonstrate how to construct this envelope for our usual class of distributions.
Namely, suppose that 𝜋 ∝ exp(−𝑉 ) satisfies 0 ≺ 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 , and that ∇𝑉 (0) = 0.
We can assume that our unnormalized version 𝜋e = exp(−𝑉 ) of 𝜋 satisfies 𝑉 (0) = 0 (if not,
replace 𝑉 by 𝑉 − 𝑉 (0)). Then, by strong convexity of 𝑉 , we see that 𝜋e ≤ exp(− 𝛼2 k·k 2 ),
and we can take e 𝜇 B exp(− 𝛼2 k·k 2 ), which means that the normalized distribution is
𝜇 = normal(0, 𝛼 −1𝐼𝑑 ). To understand the efficiency of rejection sampling, we need to
bound the ratio 𝑍 𝜇 /𝑍 𝜋 of normalizing constants. By smoothness of 𝑉 ,
Proposition 7.1.2. Let the target 𝜋 ∝ exp(−𝑉 ) on R𝑑 satisfy 0 ≺ 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 ,
𝜇 B exp(− 𝛼2 k·k 2 )
𝑉 (0) = 0, and ∇𝑉 (0) = 0. Then, rejection sampling with the envelope e
returns an exact sample from 𝜋 with a number of iterations that is a geometric random
variable with mean at most 𝜅𝑑/2 , where 𝜅 B 𝛽/𝛼.
The rejection sampling guarantee can be formulated in one of two ways. We can think
of the algorithm as returning an exact sample from 𝜋, with a random number of iterations
(the number of iterations is geometrically distributed). Alternatively, if we place an upper
bound 𝑁 on the number of iterations of the algorithm and output “FAIL” if we have not
terminated by iteration 𝑁 , then the probability of “FAIL” is at most 𝜀 B (1 − 1/𝜅𝑑/2 ) ,
𝑁
and if 𝜇 𝑁 denotes the law of the output of the algorithm, then k𝜇 𝑁 − 𝜋 k TV ≤ 𝜀. If we flip
this around and fix the target accuracy 𝜀, we see that the number of iterations required to
achieve this guarantee is 𝑁 ≥ 𝜅𝑑/2 ln(1/𝜀).
Although this result is acceptable in low dimension, the complexity of this approach
quickly becomes intractable even for moderately high-dimensional problems. In the next
section, we will see that by combining the idea of rejection with local proposals, we can
obtain tractable sampling algorithms in high dimension.
𝜋 (𝑦) 𝑄 (𝑦, 𝑥)
𝐴(𝑥, 𝑦) B 1 ∧ . (7.2.1)
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)
Proof. We want to check that 𝜋 (𝑥) 𝑇 (𝑥, 𝑦) = 𝜋 (𝑦) 𝑇 (𝑦, 𝑥) for all 𝑥, 𝑦 ∈ R𝑑 with 𝑥 ≠ 𝑦. We
can write
𝜋 (𝑦) 𝑄 (𝑦, 𝑥)
𝜋 (𝑥) 𝑇 (𝑥, 𝑦) = 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) 𝐴(𝑥, 𝑦) = 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) min 1,
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)
= min{𝜋 (𝑥) 𝑄 (𝑥, 𝑦), 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)}
Theorem 7.2.5 ([BD01]). Let ℛ(𝜋) denote the space of kernels 𝑇 which are reversible
with respect to 𝜋, and such that for each 𝑥 ∈ R𝑑 , 𝑇 (𝑥, ·) admits a density with respect to
Lebesgue measure (except possibly having an atom at 𝑥). Then,
Proof. Let 𝑇 ∈ ℛ(𝜋), and let 𝑆 B {(𝑥, 𝑦) ∈ R𝑑 | 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) > 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)}. Then,
∫
d(𝑄,𝑇 ) = |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦
(R𝑑 ×R𝑑 )\diag
∫ ∫
= |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 + |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦
𝑆 (R𝑑 ×R𝑑 )\(𝑆∪diag)
∫ ∫
= |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 + |𝑄 (𝑦, 𝑥) − 𝑇 (𝑦, 𝑥)| 𝜋 (d𝑦) d𝑥 .
𝑆 𝑆
Putting this together, d(𝑄,𝑇 ) ≥ 𝑆 |𝜋 (𝑥) 𝑄 (𝑥, 𝑦) −𝜋 (𝑦) 𝑄 (𝑦, 𝑥)| d𝑥 d𝑦, so we have obtained
∫
a lower bound which does not depend on 𝑇 . On the other hand, we can check that the
only inequality (7.2.6) that we used is an equality for 𝑇 = 𝑇 (𝑄).
which is simply one step of the LMC algorithm; this yields the Metropolis-adjusted
Langevin algorithm (MALA). We will carefully study the convergence guarantees for
MALA in this chapter.
Leapfrog Integrator: Pick a step size ℎ > 0 and a total number of iterations 𝐾,
corresponding to the total integration time via 𝑇 = 𝐾ℎ. Let (𝑥 0, 𝑝 0 ) be the initial point.
For 𝑘 = 0, 1, 2, . . . , 𝐾 − 1:
Remarkably, the acceptance probability can be computed in closed form, and this relies on
specific properties of the leapfrog integrator. The full algorithm is summarized as follows.
Metropolized Hamiltonian Monte Carlo (MHMC): Initialize at 𝑋 0 ∼ 𝜇0 . For
iterations 𝑘 = 0, 1, 2, . . . :
3. Accept the trajectory with probability 1 ∧ exp{𝐻 (𝑋𝑘 , 𝑃𝑘 ) − 𝐻 (𝑋𝑘0 , 𝑃𝑘0 )}. If the trajec-
tory is accepted, set 𝑋𝑘+1 B 𝑋𝑘0 ; otherwise, we set 𝑋𝑘+1 B 𝑋𝑘 .
It turns out that when 𝐾 = 1, the MHMC algorithm reduces to MALA (Exercise 7.1).
We next justify why the MHMC algorithm leaves 𝝅 invariant. Actually, although we
have written down the MHMC algorithm in the form which is easiest to implement, it
obscures the underlying structure of the algorithm. The proof of the next theorem will
clarify this point.
Theorem 7.3.1. The augmented target distribution 𝝅 ∝ exp(−𝐻 ) is invariant for the
MHMC algorithm.
Proof. First, we note that the step of refreshing the momentum leaves 𝝅 invariant, so it
suffices to study the proposal and acceptance steps.
198 CHAPTER 7. HIGH-ACCURACY SAMPLERS
For the moment, let us pretend that the proposal actually uses the true flow map
𝐹𝑇 (which exactly integrates Hamilton’s equations for time 𝑇 ) rather than the leapfrog
integrator 𝐹 leap . Then, the proposal kernel is deterministic, 𝑄 ((𝑥, 𝑝), ·) = 𝛿 𝐹𝑇 (𝑥,𝑝) . Up until
now, we have been assuming that the proposal kernel admits a density w.r.t. Lebesgue
measure, which certainly does not hold here, but we will brush over this technicality as it
is not the key point here.
If we naı̈vely apply the Metropolis–Hastings filter, the probability of accepting a
proposal (𝑥 0, 𝑝 0) starting from (𝑥, 𝑝) involves a ratio 𝑄 ((𝑥 0, 𝑝 0), (𝑥, 𝑝))/𝑄 ((𝑥, 𝑝), (𝑥 0, 𝑝 0)),
but this ratio is ill-defined in our setting. The problem is that if (𝑥 0, 𝑝 0) = 𝐹𝑇 (𝑥, 𝑝), then it
is not the case that (𝑥, 𝑝) = 𝐹𝑇 (𝑥 0, 𝑝 0); the proposal is not reversible. Hence, we would be
led to reject every single trajectory.
To fix this, recall from Exercise 6.4 that we have the following time reversibility
property: if 𝑅 denotes the momentum flip operator (𝑥, 𝑝) ↦→ (𝑥, −𝑝), then it holds that
𝐹𝑇−1 = 𝑅 ◦ 𝐹𝑇 ◦ 𝑅. It implies that 𝐹𝑇 ◦ 𝑅 = 𝑅 ◦ 𝐹𝑇−1 = (𝐹𝑇 ◦ 𝑅) −1 , so 𝐹𝑇 ◦ 𝑅 is idempotent. In
other words, if we use the proposal 𝐹𝑇 ◦ 𝑅 (i.e., first flip the momentum before integrating
Hamilton’s equations), then the proposal would be reversible and the above issue does not
arise, as the ratio 𝑄 ((𝑥 0, 𝑝 0), (𝑥, 𝑝))/𝑄 ((𝑥, 𝑝), (𝑥 0, 𝑝 0)) would equal 1. Observe also that
using 𝐹𝑇 ◦ 𝑅 instead of 𝐹𝑇 does not change the algorithm since we refresh the momentum
at each step (and if 𝑃𝑘 ∼ normal(0, 𝐼𝑑 ), then −𝑃𝑘 ∼ normal(0, 𝐼𝑑 ) as well).
Once we use the proposal (𝑥 0, 𝑝 0) = (𝐹𝑇 ◦ 𝑅)(𝑥, 𝑝) = 𝐹𝑇 (𝑥, −𝑝), the Metropolis–
Hastings acceptance probability is calculated to be
𝝅 (𝑥 0, 𝑝 0)
1∧ = 1 ∧ exp{𝐻 (𝑥, 𝑝) − 𝐻 (𝑥 0, 𝑝 0)} . (7.3.2)
𝝅 (𝑥, 𝑝)
When we use the exact flow map 𝐹𝑇 , then the Hamiltonian is conserved (Exercise 6.4) so
the above probability is one; every trajectory is accepted. However, the above expression
is indeed meaningful if we instead use the leapfrog integrator 𝐹 leap .
So far, we have motivated the expression (7.3.2) based on the exact flow map 𝐹𝑇 , but
clearly the above argument holds just as well for the leapfrog integrator 𝐹 leap as soon as
we verify the property 𝐹 leap
−1 = 𝑅 ◦ 𝐹
leap ◦ 𝑅, and this is where we use the specific form of
the leapfrog integrator. We leave the verification as Exercise 7.2.
Remark 7.3.3. The proof shows that the proposal of MHMC should really be thought of
as 𝐹 leap ◦ 𝑅, instead of 𝐹 leap . In fact, if we did not refresh the momentum, then repeatedly
applying the idempotent operator 𝐹 leap ◦ 𝑅 would just cause the algorithm to jump back
and forth between two points (𝑥, 𝑝) and (𝑥 0, 𝑝 0), which is silly; hence one should also
apply another momentum flip after the filter. In symbols, if MH denotes the Metropolis–
Hastings filter step, and Refresh denotes the momentum refreshment step, we should
7.3. AN OVERVIEW OF HIGH-ACCURACY SAMPLERS 199
This is simplified to
Lazy chains. Technically, many of the convergence results actually hold for lazy ver-
sions of the Markov chain. Specifically, for ℓ ∈ [0, 1], the ℓ-lazy version of a Markov
chain replaces its transition kernel 𝑇 with the modified kernel 𝑇ℓ given by
The laziness condition is familiar from the study of discrete-time Markov chains on discrete
state spaces, in which laziness is useful for avoiding periodic behavior. For the remainder
of this section, we will generally be considering 12 -lazy versions of the Metropolis–Hastings
chains without explicitly mentioning this. In any case, this modification only multiplies
the mixing time by a factor of 2, so it does not significantly alter the results.
convergence rates for the feasible and warm start cases, even under the assumption of
strong log-concavity. This is unlike the case of LMC; e.g., the guarantee of Theorem 4.2.5
is not significantly improved by assuming KL(𝜇 0 k 𝜋) = 𝑂 (1).
Theorem 7.3.4 (feasible start case, [Dwi+19; Che+20a; LST20]). Suppose that the
target 𝜋 ∝ exp(−𝑉 ) satisfies 0 ≺ 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 and ∇𝑉 (0) = 0. Consider the
following Metropolis–Hastings algorithms initialized at normal(0, 𝛽 −1𝐼𝑑 ) and with an
appropriately tuned choice of parameters.
e 𝜅 2𝑑 polylog 1 iterations .
𝑁 =𝑂
𝜀
e 𝜅𝑑 polylog 1
𝑁 =𝑂 iterations .
𝜀
√
3𝑉 is bounded and that 𝜅 𝑑. Then, MHMC outputs
3. Assume in addition that ∇
a measure 𝜇 𝑁 satisfying 𝜒 2 (𝜇 𝑁 k 𝜋) ≤ 𝜀 after
√︁
Note that the result for MHMC is not directly comparable because it makes a stronger
second-order smoothness assumption.
Next, we present the results under a warm start.
Theorem 7.3.5 (warm start case, [Che+21b; WSC21]). Suppose that 𝜋 ∝ exp(−𝑉 )
satisfies 0 ≺ 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 . Consider MALA initialized at a distribution satisfying
2
𝜒 (𝜇0 k 𝜋) = 𝑂 (1). Then, MALA outputs a measure 𝜇 𝑁 satisfying KL(𝜇 𝑁 k 𝜋) ≤ 𝜀 (or
√︁
7.4. MARKOV CHAINS IN DISCRETE TIME 201
1
𝑁 = 𝑂 𝜅𝑑 1/2 polylog iterations .
𝜀
Moreover, it is known that the results for MALA in both the feasible and warm start
cases are sharp in a suitable sense. The goal for the rest of the chapter is to prove these
MALA convergence results (up to some technical details).
𝑃). Note that since the operator norm of 𝑃 is at most 1, then 𝑃 − id is always a negative
operator (similarly to the infinitesimal generator ℒ from Section 1.2).
Spectral gap. The spectral gap of 𝑃 is the largest 𝜆 > 0 such that for all 𝑓 ∈ 𝐿 2 (𝜋)
with E𝜋 𝑓 = 0,
In analogy with Section 1.2, we also say that 𝑃 satisfies a Poincaré inequality with
constant 1/𝜆. We already saw in Section 2.7 that a Poincaré inequality is implied by a
lower bound on the coarse Ricci curvature.
The right-hand side of (7.4.1) can be interpreted as a Dirichlet energy,
and the Markov chain can be viewed as an 𝐿 2 (𝜋) gradient descent on the Dirichlet energy;
see Exercise 7.3. We have the following convergence result.
Theorem 7.4.2. Suppose that the spectral gap of 𝑃 is 𝜆 > 0. Then, for the law (𝜇𝑘 )𝑘∈N
of the iterates of the 12 -lazy version of 𝑃, we have
𝜒 2 (𝜇 𝑁 k 𝜋) ≤ exp(−𝜆𝑁 ) 𝜒 2 (𝜇0 k 𝜋) .
𝐶 MLSI
ent𝜋 𝑓 ≤ ℰ(𝑓 , ln 𝑓 ) .
2
We have already encountered this inequality as Definition 1.2.23, although there we simply
called it the log-Sobolev inequality. In the context of discrete Markov processes, however,
since the chain rule fails and the different variants of the log-Sobolev inequality are no
longer equivalent, it is worth being careful about the terminology.
It is trickier to deduce entropy decay from the MLSI in discrete time, and to avoid
this issue we shall work in continuous time instead. The Markov kernel 𝑃 gives rise to
the generator ℒ B 𝑃 − id, which in turn generates a continuous-time semigroup (𝑃𝑡 )𝑡 ≥0
via 𝑃𝑡 B exp(𝑡ℒ). Note that the generator of (𝑃𝑡 )𝑡 ≥0 is also ℒ and hence the Dirichlet
energy for (𝑃𝑡 )𝑡 ≥0 coincides with the Dirichlet energy for (𝑃 𝑘 )𝑘∈N . Now, if we apply the
calculation (1.2.22) to the semigroup (𝑃𝑡 )𝑡 ≥0 , we find that under an MLSI,
2𝑡
KL(𝜇𝑃𝑡 k 𝜋) ≤ exp − KL(𝜇 k 𝜋) ,
𝐶 MLSI
7.4. MARKOV CHAINS IN DISCRETE TIME 203
7.4.2 Conductance
Unfortunately, it is usually quite challenging to prove either a Poincaré inequality or a
modified log-Sobolev inequality for discrete-time Markov chains, which motivates the
use of conductance.
The conductance of 𝑃 is the greatest number 𝔠 > 0 such that for all events 𝐴 ⊆ R𝑑 ,
∫
𝑃 (𝑥, 𝐴c ) 𝜋 (d𝑥) ≥ 𝔠 𝜋 (𝐴) 𝜋 (𝐴c ) .
𝐴
A small conductance implies the presence of bottlenecks in the space: subsets 𝐴 of the
state space from which it is difficult for the Markov chain to exit. On the other hand, it is
a remarkable fact that once the presence of these bottlenecks is eliminated, then there is a
positive spectral gap. This is the content of a celebrated result of Cheeger.
Theorem 7.4.3 (Cheeger’s inequality, [LS88]). The conductance 𝔠 and the spectral gap
𝜆 satisfy the inequalities
1 2
𝔠 ≤ 𝜆 ≤𝔠.
8
204 CHAPTER 7. HIGH-ACCURACY SAMPLERS
Both inequalities are sharp up to constants. The upper bound on 𝜆 is fairly immediate
(see Exercise 7.3), so we focus on the lower bound. We begin by reformulating the
conductance as a functional inequality.
Lemma 7.4.4. Let the conductance of the chain be 𝔠 > 0. Then, for all 𝑓 ∈ 𝐿 1 (𝜋),
1
E𝜋 |𝑓 − E𝜋 𝑓 | ≤ E|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| , (7.4.5)
𝔠
where (𝑋 0, 𝑋 1 ) are two successive iterates of the chain started at stationarity.
E𝜋 |𝑓 − E𝜋 𝑓 | = E|𝑓 (𝑋 00 ) − E 𝑓 (𝑋 0 )| ≤ E|𝑓 (𝑋 00 ) − 𝑓 (𝑋 0 )| .
Compare this with the relationship between the Cheeger isoperimetric inequality and
the 𝐿 1 –𝐿 1 Poincaré inequality in Theorem 2.5.14. Indeed, the trick above of passing to the
level sets of 𝑓 is the discrete version of the coarea inequality (Theorem 2.5.12).
Recall also that an 𝐿 1 –𝐿 1 Poincaré inequality implies an 𝐿 2 –𝐿 2 Poincaré inequality
with 𝐶 2,2 . 𝐶 1,1 , see Proposition 2.5.17. On the other hand, 𝐶 2,2 is the square root of the
7.4. MARKOV CHAINS IN DISCRETE TIME 205
√
usual Poincaré constant, 𝐶 2,2 = 1/ 𝜆, where 𝜆 is the spectral gap. To prove Cheeger’s
inequality, we are going to follow the same principle in discrete time. This is exactly the
source of the square in the lower bound 𝜆 & 𝔠 2 of Cheeger’s inequality.
Proof of Cheeger’s inequality (Theorem 7.4.3). We will prove the lower bound on the spec-
tral gap with a worse constant than 18 in order to make the proof more straightforward;
see [LS88] for a proof with the constant 81 .
Let 𝑓 : R𝑑 → R have 0 as a median and let 𝑔 B 𝑓 2 sgn 𝑓 , so that 0 is a median of 𝑔
as well. Assume that the chain has conductance 𝔠 > 0. Then, recalling the equivalence
between the mean and the median (Lemma 2.4.4) and using Lemma 7.4.4 on 𝑔,
A lower bound on the conductance via overlaps. At this stage, it may not seem that
we have gained anything by moving from the spectral gap to the conductance. We now
introduce a key lemma, which provides a tractable lower bound on the conductance in
terms of two geometric quantities: a Cheeger isoperimetric inequality for target 𝜋 (we
introduced this inequality in Section 2.5.2), and overlap bounds on the Markov chain.
Recall that an 𝛼-strongly
√ log-concave measure 𝜋 satisfies the Cheeger isoperimetric
inequality with Ch . 1/ 𝛼 (Corollary 2.5.19).
2. There exists 𝑟 ∈ [0, Ch] such that for any points 𝑥, 𝑦 ∈ R𝑑 with k𝑥 − 𝑦 k ≤ 𝑟 , it
holds that k𝑃 (𝑥, ·) − 𝑃 (𝑦, ·)k TV ≤ 21 .
206 CHAPTER 7. HIGH-ACCURACY SAMPLERS
1
∫ ∫ ∫ ∫
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) = 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦) = 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦) .
𝐴0 𝐴1 2 𝐴0 𝐴1
1
𝐵 0 B 𝑥 ∈ 𝐴0 𝑃 (𝑥, 𝐴1 ) < ,
4
1
𝐵 1 B 𝑦 ∈ 𝐴1 𝑃 (𝑦, 𝐴0 ) < ,
4
𝐺 B R𝑑 \ (𝐵 0 ∪ 𝐵 1 ) .
We can assume that 𝜋 (𝐵 0 ) ≥ 𝜋 (𝐴0 )/2 and 𝜋 (𝐵 1 ) ≥ 𝜋 (𝐴1 )/2. Indeed, if we have, e.g.,
𝜋 (𝐵 0 ) ≤ 𝜋 (𝐴0 )/2, then 𝜋 (𝐴0 \ 𝐵 0 ) ≥ 𝜋 (𝐴0 )/2, and
1 1
∫ ∫
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) ≥ 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) ≥ 𝜋 (𝐴0 \ 𝐵 0 ) ≥ 𝜋 (𝐴0 ) .
𝐴0 𝐴0 \𝐵 0 4 8
𝑟 𝑟
𝜋 (𝐺) ≥ 𝜋 (𝐵𝑟0 ) − 𝜋 (𝐵 0 ) ≥ 𝜋 (𝐵 0 ) 𝜋 (𝐵 1 ) ≥ 𝜋 (𝐴0 ) 𝜋 (𝐴1 ) .
2 Ch 8 Ch
Hence,
1
∫ ∫
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦)
2 𝐴0 𝐴1
1
∫ ∫
≥ 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦)
2 𝐴0 ∩𝐺 𝐴1 ∩𝐺
1 1 𝑟
≥ 𝜋 (𝐴0 ∩ 𝐺) + 𝜋 (𝐴1 ∩ 𝐺) = 𝜋 (𝐺) ≥ 𝜋 (𝐴0 ) 𝜋 (𝐴1 ) .
8 8 64 Ch
7.4. MARKOV CHAINS IN DISCRETE TIME 207
Observe that if 𝜋 (𝐴) ≤ 𝑠, then the above inequality holds trivially. Hence, this definition
allows us to restrict attention to events which are reasonably probable under 𝜋.
For the conductance, we had Cheeger’s inequality which relates conductance to the
spectral gap and ultimately to convergence. For the 𝑠-conductance, the following theorem
is an appropriate substitute.
Then, the law 𝜇 𝑁 of the 𝑁 -th iterate of a Markov chain with 𝑠-conductance 𝔠𝑠 and
initialized at 𝜇 0 satisfies
Δ𝑠 c2 𝑁
k𝜇 𝑁 − 𝜋 k TV ≤ Δ𝑠 + exp − 𝑠 .
𝑠 2
In particular,
c2 𝑁
√︂
𝜒 2 (𝜇 0 k 𝜋)
≤ 𝑠 𝜒 2 (𝜇0 k 𝜋) +
√︁
k𝜇 𝑁 − 𝜋 k TV exp − 𝑠 .
𝑠 2
Proof. The first statement is from [LS93, Corollary 1.6] and the proof is omitted, as the
proof is not particularly straightforward.
The second statement follows from the first: indeed, for 𝐴 ⊆ R𝑑 with 𝜋 (𝐴) ≤ 𝑠,
∫ ∫ 𝜇0 √︁
|𝜇0 (𝐴) − 𝜋 (𝐴)| = 1𝐴 d(𝜇0 − 𝜋) = 1𝐴 − 1 d𝜋 ≤ 𝜋 (𝐴) 𝜒 2 (𝜇0 k 𝜋)
𝜋
208 CHAPTER 7. HIGH-ACCURACY SAMPLERS
so that Δ𝑠 ≤ 𝑠 𝜒 2 (𝜇 0 k 𝜋).
√︁
This result says that if 𝑠 = 𝜀 2 /(4 𝜒 2 (𝜇0 k 𝜋)), then we obtain k𝜇 𝑁 − 𝜋 k TV ≤ 𝜀 after
1 𝜒 2 (𝜇 0 k 𝜋)
𝑁 = 𝑂 2 log iterations .
𝔠𝑠 𝜀2
Unfortunately, if 𝜒 2 (𝜇0 k 𝜋) = exp 𝑂 (𝑑), then the logarithmic term incurs additional
dimension dependence, which is why we can expect better mixing time bounds under the
warm start condition 𝜒 2 (𝜇0 k 𝜋) = 𝑂 (1).
The key advantage of the 𝑠-conductance is that it allows for a version of the key lemma
with weaker assumptions; the proof is left as Exercise 7.5.
Basic decomposition. The overall plan is to lower bound the 𝑠-conductance using the
key lemma (Lemma 7.4.8), which then upper bounds the mixing time via Theorem 7.4.7.
By strong log-concavity of 𝜋, the first hypothesis of Lemma 7.4.8 is verified, so it remains
to bound the overlaps. For a kernel 𝑇 , we use the shorthand 𝑇𝑥 B 𝑇 (𝑥, ·).
By the triangle inequality, we have the decomposition
The middle term k𝑄 𝑥 − 𝑄𝑦 k TV measures the overlap for the proposal kernel, and we will
shortly see that this term is easy to bound. Then, controlling the first and third terms
essentially amounts to lower bounding the acceptance probability of MALA, since the
only difference between 𝑄 and 𝑇 is the Metropolis–Hastings filter.
k𝑥 − 𝑦 k
k𝑄 𝑥 − 𝑄𝑦 k TV ≤ √ .
2ℎ
This has a very clear interpretation: it is the probability that the proposed
√ move starting
at 𝑥 is rejected. If we let 𝜉 ∼ normal(0, 𝐼𝑑 ) and 𝑌 B 𝑥 − ℎ ∇𝑉 (𝑥) + 2ℎ 𝜉, we want a lower
bound on the quantity E 𝐴(𝑥, 𝑌 ), which comes from Markov’s inequality:
𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥) 𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
E 𝐴(𝑥, 𝑌 ) = E min 1, for all 0 < 𝜆 < 1 .
≥ 𝜆P ≥𝜆
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
The approach now is to write out the ratio more explicitly, and then carefully group
together and bound the terms. (Unfortunately, this is not the most enlightening.)
210 CHAPTER 7. HIGH-ACCURACY SAMPLERS
Explicitly, we have
𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥) k𝑥 − 𝑌 + ℎ ∇𝑉 (𝑌 )k 2 k𝑌 − 𝑥 + ℎ ∇𝑉 (𝑥)k 2
= exp −𝑉 (𝑌 ) − + 𝑉 (𝑥) + .
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 4ℎ 4ℎ
After some careful algebra,
𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
4 ln = ℎ {k∇𝑉 (𝑥)k 2 − k∇𝑉 (𝑌 )k 2 } (7.5.4)
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
− 2 {𝑉 (𝑌 ) − 𝑉 (𝑥) − h∇𝑉 (𝑥), 𝑌 − 𝑥i} (7.5.5)
+ 2 {𝑉 (𝑥) − 𝑉 (𝑌 ) − h∇𝑉 (𝑌 ), 𝑥 − 𝑌 i} . (7.5.6)
Note that the terms are grouped to more easily apply the strong convexity and smoothness
of 𝑉 . It yields
Also, for ℎ ≤ 𝛽1 ,
Therefore,
𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
ln & −𝛽ℎ 2 k∇𝑉 (𝑥)k 2 − 𝛽 k𝑥 − 𝑌 k 2 & −𝛽ℎ 2 k∇𝑉 (𝑥)k 2 − 𝛽ℎ k𝜉 k 2 .
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
At this stage, observe that we cannot lower bound this quantity (with high probability)
uniformly over 𝑥, since k∇𝑉 (𝑥)k → ∞ as k𝑥 k → ∞. This is why it is helpful to restrict to
𝑥 belonging to some high-probability event 𝐸, which is ultimately achieved by working
with 𝑠-conductance rather than conductance.
By standard concentration bounds, k𝜉 k 2 ≤ 2𝑑 with probability at least 1 − exp(−𝑑/2).
Also, let 𝐸𝑅 B {𝑥 ∈ R𝑑 : k∇𝑉 (𝑥)k ≤ 𝛽 𝑅}. It follows that for all 𝑥 ∈ 𝐸𝑅 , if we take
√︁
1
ℎ . 𝛽 (𝑑∨𝑅) with a sufficiently small constant, then
11 n 𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥) 11 o 11 𝑑 5
E 𝐴(𝑥, 𝑌 ) ≥ P ≥ ≥ 1 − exp − ≥ ,
12 𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 12 12 2 6
1
Completing the analysis. We have shown: if the step size is ℎ . 𝛽 (𝑑∨𝑅) , then for all
𝑥, 𝑦 ∈ 𝐸𝑅 with k𝑥 − 𝑦 k ≤ 𝑟 ,
1 𝑟 1 1
k𝑇𝑥 − 𝑇𝑦 k TV ≤ k𝑄 𝑥 − 𝑇𝑥 k TV + k𝑄 𝑥 − 𝑄𝑦 k TV + k𝑄𝑦 − 𝑇𝑦 k TV ≤ +√ + ≤
6 2ℎ 6 2
√
provided we take 𝑟 = 2ℎ/6. Applying Lemma 7.4.8 (assuming ℎ . 𝛼1 ), we deduce that
√ √
𝔠𝑠 & 𝛼ℎ provided 𝜋 (𝐸𝑅 ) ≥ 1 − 𝑐 0𝑠 𝛼ℎ, where 𝑐 0 > 0 is a universal constant. Since we
1
want the step size ℎ to be as large as possible, we take ℎ 𝛽 (𝑑∨𝑅) , where 𝑅 is chosen to
√︃
1
satisfy 𝜋 (𝐸𝑅c ) . 𝑠 𝜅 (𝑑∨𝑅) and 𝑠 𝜀 2 /𝜒 2 (𝜇 0 k 𝜋). The final mixing time bound implied
by Theorem 7.4.7 is then 𝑂 (𝜅 (𝑑 ∨ 𝑅) log(𝜒 2 (𝜇 0 k 𝜋)/𝜀 2 )) iterations.
Up until this point, the analysis is largely similar to [Dwi+19].
Lemma 7.5.7 ([LST20]). Suppose that 𝜋 ∝ exp(−𝑉 ) and that 0 ∇2𝑉 𝛽𝐼𝑑 . Then,
for all 𝑡 ≥ 0,
𝑡
𝜋 {k∇𝑉 k ≥ E𝜋 k∇𝑉 k + 𝑡 } ≤ 3 exp − √︁ .
𝛽
This shows that the fluctuations of k∇𝑉 k around its expectation are only of size 𝛽.
√︁
From warm start to feasible start. The factor of log 𝜒 2 (𝜇0 k 𝜋) in the bound incurs
additional dimension dependence under a feasible start, since with a Gaussian initialization
we can only show 𝜒 2 (𝜇0 k 𝜋) ≤ 𝜅𝑑/2 . The problem is that the conductance-based analysis
relies upon Poincaré-type inequalities, instead of log-Sobolev inequalities. To address
this issue, we can replace the assumption of a Cheeger isoperimetric inequality with a
212 CHAPTER 7. HIGH-ACCURACY SAMPLERS
Gaussian isoperimetric inequality (see Section 2.5.4). The essential difference is that under
a Cheeger isoperimetric inequality, as 𝑝 B 𝜋 (𝐴) & 0 we √︃ have 𝜋 + (𝐴) & 𝑝, whereas
under a Gaussian isoperimetric inequality we have 𝜋 + (𝐴) & 𝑝 log 𝑝1 . Using this stronger
assumption, [Che+20a] show that the dependence on the initialization can be improved to
log log 𝜒 2 (𝜇 0 k 𝜋). A similar effect can be achieved via the blocking conductance [KLM06],
which was used in [LST20]. We omit the details.
Lower bound. Finally, the analysis of MALA in Theorem 7.3.4 is tight, as shown in the
following lower bound.
Theorem 7.5.8 ([LST21a]). For every choice of step size ℎ > 0, there exists a target
distribution 𝜋 ∝ exp(−𝑉 ) on R𝑑 with 𝐼𝑑 ∇2𝑉 𝜅𝐼𝑑 , as well as an initialization 𝜇0
with 𝜒 2 (𝜇 0 k 𝜋) ≤ exp 𝑑, such that the number of iterations required for MALA to reach
total variation at most 14 from 𝜋 is at least Ω(𝜅𝑑).
e
This theorem is a lower bound in the sense that all of the known proofs for MALA do
not use any property of the initialization 𝜇0 except through 𝜒 2 (𝜇 0 k 𝜋). Thus, in order to
improve the analysis of MALA under a feasible start, one must use more specific properties
of the initialization, or use some other modification that bypasses the lower bound (e.g.,
random step sizes).
Using the projection property. The key insight is to use projection characterization
of the Metropolis–Hastings filter (Theorem 7.2.5): the MALA kernel 𝑇 is the closest kernel
7.6. ANALYSIS OF MALA FOR A WARM START 213
to the proposal 𝑄 (in an appropriate 𝐿 1 distance) among all reversible Markov chains with
stationary distribution 𝜋. Concretely, for any other kernel 𝑄¯ which is reversible w.r.t. 𝜋,
∬ ∬
|𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 ≤ |𝑄 (𝑥, 𝑦) − 𝑄¯ (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 .
(R𝑑 ×R𝑑 )\diag (R𝑑 ×R𝑑 )\diag
Pointwise projection property. The projection property is not enough for our pur-
poses, however, since it only bounds k𝑄 𝑥 − 𝑇𝑥 k TV in average, whereas we really need
high-probability bounds. Thankfully, we can extend the projection property.
1
∫ ∫
Φ(k𝑄 𝑥 − 𝑇𝑥 k TV ) 𝜋 (d𝑥) ≤ Φ(4 k𝑄 𝑥 − 𝑄¯𝑥 k TV ) 𝜋 (d𝑥)
2
(7.6.3)
1
∬
𝑄 (𝑥, 𝑦)
Φ2 ¯
− 1 𝑄 (𝑥, d𝑦) 𝜋 (d𝑥) .
+
𝑄¯ (𝑥, 𝑦)
2
We will not need the inequality (7.6.3), so the proof is left as Exercise 7.8. The reason
why (7.6.3) is included in the theorem is because it makes it clear why we can expect the
pointwise projection property to imply high-probability bounds for k𝑄 𝑥 − 𝑇𝑥 k TV . Note
that when we integrate the projection property w.r.t. 𝜋 (d𝑥), we recover (7.6.1) with a
factor of 4 on the right-hand side instead of 2.
214 CHAPTER 7. HIGH-ACCURACY SAMPLERS
𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥)
∫ ∫
1 − 𝑄 (𝑥, d𝑦) = |𝑄 (𝑥, 𝑦) − 𝑄¯ (𝑥, 𝑦)| d𝑦 = 2 k𝑄 𝑥 − 𝑄¯𝑥 k TV ,
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)
Let 𝑸¯ 𝑥 denote the measure on path space C([0, ℎ]; R𝑑 ) of the Langevin diffusion started at
𝑥. Similarly, let 𝑸 𝑥 denote the same for the interpolation of LMC. By the data-processing
7.6. ANALYSIS OF MALA FOR A WARM START 215
inequality, we obtain
¯ d𝑸
∬ ∫
𝑄 (𝑦, 𝑥)
¯
− 1 𝑄 (𝑦, d𝑥) 𝜋 (d𝑦) ≤ E𝑸 𝑥 𝑥 − 1 𝜋 (d𝑥) .
𝑝 𝑝
𝑄¯ (𝑦, 𝑥) d𝑸¯ 𝑥
d𝑸¯
We have a formula for the Radon–Nikodym derivative d𝑸 𝑥 thanks to Girsanov’s theorem,
𝑥
so again we can approach this via stochastic calculus. Controlling this term is slightly
more involved than controlling the KL divergence (essentially we are controlling a Rényi
divergence instead) but nevertheless we can bound the term when ℎ . √1 .
𝑑
With these high probability bounds, we can then return to the 𝑠-conductance analysis,
which implies that the mixing time is of order ℎ1 . Hence, under a warm start, the mixing
√
time improves from 𝑑 to 𝑑.
TODO: Flesh out the calculations in this section.
√
Lower bound. Under a warm start, [Che+21b]
√ showed a lower bound of roughly Ω(
e 𝑑),
which was improved in [WSC21] to Ω(𝜅
e 𝑑). We state the result here.
Theorem 7.6.4 ([WSC21]). For every choice of step size ℎ > 0, there exists a target
distribution 𝜋 ∝ exp(−𝑉 ) on R𝑑 with 𝐼𝑑 ∇2𝑉 𝜅𝐼𝑑 , as well as an initialization 𝜇 0
with 𝜒 2 (𝜇0 k 𝜋) . 1, such that the number√of iterations required for MALA to reach
total total variation 𝜀 from 𝜋 is at least Ω(𝜅
e 𝑑 log(1/𝜀)).
Bibliographical Notes
TODO: Fill in.
Exercises
An Overview of High-Accuracy Samplers
D Exercise 7.3 reversible Markov chains as gradient descent on the Dirichlet energy
Consider the setting of Section 7.4.
2. Show that if (𝜇𝑘 )𝑘∈N are the laws of the iterates of the ℓ-lazy version of 𝑃, then
the relative densities ( 𝜋𝑘 )𝑘∈N are the iterates of gradient descent on the Dirichlet
𝜇
energy ℰ in 𝐿 2 (𝜋). How does the laziness parameter ℓ relate to the step size of the
gradient descent?
4. Next, prove a generalization of Theorem 7.4.2 for any value of the laziness param-
eter ℓ ∈ [ 21 , 1] by showing that the spectral gap condition is equivalent to strong
convexity of ℰ. Why do we want ℓ ≥ 12 here?
5. Show that the conductance of the chain can also be described as the largest 𝔠 > 0
such that for all events 𝐴 ⊆ R𝑑 , it holds that ℰ( 1𝐴 ) ≥ 𝔠 k 1𝐴 − 𝜋 (𝐴)k 𝐿2 2 (𝜋) . Hence,
conductance can be viewed as a restricted strong convexity condition (restricting
the space of functions to indicators of events). In particular, show the bound 𝜆 ≤ 𝔠
in Cheeger’s inequality (Theorem 7.4.3).
KL(𝜇𝑃 k 𝜋) ≤ (1 − 𝑐) KL(𝜇 k 𝜋) .
In this chapter, we discuss the proximal sampler, which was introduced in [LST21c]. The
applications of the proximal sampler include improving the condition number dependence
of high-accuracy samplers and providing new state-of-the-art sampling guarantees for
various classes of target distributions. Besides these applications, the proximal sampler is
interesting in its own right due to its remarkable convergence analysis and its connections
with the proximal point method in optimization.
219
220 CHAPTER 8. THE PROXIMAL SAMPLER
2. Draw 𝑋𝑘+1 ∼ 𝜋 𝑋 |𝑌 (· | 𝑌𝑘 ).
Since Gibbs sampling always forms a reversible Markov chain with respect to the
target distribution, we conclude that the proximal sampler is unbiased: its stationary
distribution of the proximal sampler is 𝝅. As written, however, the proximal sampler is
an idealized algorithm because it is not yet clear how to implement the second step of
sampling from 𝜋 𝑋 |𝑌 . Note that
k𝑦 − 𝑥 k 2
𝜋 𝑋 |𝑌 (𝑥 | 𝑦) ∝𝑥 exp −𝑉 (𝑥) − .
2ℎ
Also, recall that in optimization, we wish to minimize the function 𝑉 , whereas in sam-
pling we want to sample from 𝜋 ∝ exp(−𝑉 ). Via this correspondence, we see that the
optimization analogue of sampling from 𝜋 𝑋 |𝑌 is computing the proximal map
n k𝑦 − 𝑥 k 2 o
proxℎ𝑉 (𝑦) B arg min 𝑉 (𝑥) + .
𝑥 ∈R𝑑 2ℎ
The distribution 𝜋 𝑋 |𝑌 is known as the restricted Gaussian oracle (RGO), and the
proximal sampler can be viewed as the analogue of the proximal point method for sampling.
See Exercise 8.1 for another connection between the proximal sampler and the proximal
point method from optimization.
Implementability of the RGO. In order to obtain an actual algorithm from the proxi-
mal sampler, an implementation of the RGO must be provided. As we will see in Section 8.6,
the RGO can be implemented by using an auxiliary high-accuracy sampler such as MALA.
Although this may seem circular (if we need to use an auxiliary sampler to implement the
RGO, then why not use the auxiliary sampler in the first place without bothering with
the proximal sampler?), we will see there are benefits to the overall scheme. Namely, the
proximal sampler can boost the condition number dependence of the auxiliary sampler,
and it can be used to sample from a larger class of distributions.
For now, we will consider a simple implementation of the RGO based on rejection
sampling, which we studied in Section 7.1. Suppose that the potential 𝑉 is 𝛽-smooth.
1
Then, for 𝑉𝑦 (𝑥) B 𝑉 (𝑥) + 2ℎ k𝑦 − 𝑥 k 2 we have ( ℎ1 − 𝛽) 𝐼𝑑 ∇2𝑉𝑦 ( ℎ1 + 𝛽) 𝐼𝑑 . In particular,
if ℎ < 𝛽1 , then the RGO is strongly log-concave. Note that the condition number of 𝑉𝑦
is 𝜅 = ( ℎ1 + 𝛽)/( ℎ1 − 𝛽). If we now choose ℎ = 𝛽𝑑 1
, we can check that 𝜅 ≤ exp(4/𝑑) for
𝑑 ≥ 2. By Proposition 7.1.2, if we have access to the minimizer of 𝑉𝑦 (which is equivalent
to being able to compute the proximal operator for ℎ𝑉 ), we can construct an upper
envelope for which the average number of iterations of rejection sampling is bounded by
𝜅𝑑/2 ≤ exp(2) ≤ 8. We summarize this discussion as follows.
8.2. CONVERGENCE UNDER STRONG LOG-CONCAVITY 221
Each iteration of rejection sampling requires one call to an evaluation oracle for 𝑉 (in
order to compute the acceptance probability). We summarize the guarantees for this
implementation of the RGO in the following theorem.
1
Theorem 8.1.1. Assume that 𝑉 is 𝛽-smooth. Then, if ℎ ≤ 𝛽𝑑 , rejection sampling
implements the RGO for 𝜋 exactly using one computation of the proximal map for 𝑉
𝑋
Notation. We write 𝜇𝑘𝑋 for the law of 𝑋𝑘 and 𝜇𝑘𝑌 for the law of 𝑌𝑘 for the iterates of
the proximal sampler. Observe that if (𝑄𝑡 )𝑡 ≥0 denotes the standard heat semigroup, i.e.
𝜇𝑄𝑡 = 𝜇 ∗ normal(0, 𝑡𝐼𝑑 ), then 𝜇𝑘𝑌 = 𝜇𝑘𝑋 𝑄ℎ and 𝜋 𝑌 = 𝜋 𝑋 𝑄ℎ .
We also abbreviate 𝜋 𝑋 |𝑌 (· | 𝑦) as 𝜋 𝑋 |𝑌 =𝑦 .
Theorem 8.2.1. Assume that the target 𝜋 𝑋 is 𝛼-strongly log-concave. Also, let (𝜇𝑘𝑋 )𝑘∈N
222 CHAPTER 8. THE PROXIMAL SAMPLER
and ( 𝜇¯𝑘𝑋 )𝑘∈N denote two runs of the proximal sampler with target 𝜋 𝑋 . Then,
𝑊2 (𝜇0𝑋 , 𝜇¯0𝑋 )
𝑊2 (𝜇𝑘𝑋 , 𝜇¯𝑘𝑋 ) ≤ .
(1 + 𝛼ℎ)𝑘
The contraction factor matches the contraction for the proximal point method in
optimization, see Exercise 8.2. Since 𝜋 𝑋 is left invariant by the proximal sampler, the
contraction result also implies a convergence result in 𝑊2 .
We will give two proofs of this theorem. First, note that it suffices to consider
1
one iteration and to prove 𝑊2 (𝜇 1𝑋 , 𝜇¯1𝑋 ) ≤ 1+𝛼ℎ 𝑊2 (𝜇 0𝑋 , 𝜇¯0𝑋 ). Next, since the heat flow
is a Wasserstein contraction (which follows from (1.4.9) but can also be proven by a
straightforward coupling), it holds that 𝑊2 (𝜇𝑌0 , 𝜇¯𝑌0 ) ≤ 𝑊2 (𝜇0𝑋 , 𝜇¯0𝑋 ), so it suffices to show
1
𝑊2 (𝜇 1𝑋 , 𝜇¯1𝑋 ) ≤ 1+𝛼ℎ 𝑊2 (𝜇𝑌0 , 𝜇¯𝑌0 ).
We will use the following coupling lemma.
𝑊2 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 𝐶 k𝑦 − 𝑦¯ k . (8.2.3)
The intuition is that since 𝜇 1𝑋 and 𝜇¯1𝑋 are obtained from 𝜇𝑌0 and 𝜇¯𝑌0 by sampling from
the RGO 𝜋 𝑋 |𝑌 , the contraction statement in (8.2.3) can be used to bound 𝑊2 (𝜇1𝑋 , 𝜇¯1𝑋 ). The
proof of the lemma is relatively straightforward and good practice for working with
couplings, so it is left as Exercise 8.3.
The first proof we present is from [LST21b].
¯
Proof of Theorem 8.2.1 via functional inequalities. To prove (8.2.3), we note that 𝜋 𝑋 |𝑌 (· | 𝑦)
1
is (𝛼 + ℎ )-strongly log-concave. Recall that by the Bakry–Émery theorem (Theorem 1.2.28)
and the Otto–Villani theorem (Exercise 1.16) this implies the log-Sobolev inequality (1.4.7)
and Talagrand’s T2 inequality (1.4.8). Applying these inequalities,
2 1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 1
KL(𝜋 𝑋 |𝑌 =𝑦
k 𝜋 𝑋 |𝑌 =𝑦¯
) ≤ 2
FI(𝜋 𝑋 |𝑌 =𝑦 k 𝜋 𝑋 |𝑌 =𝑦¯ ) .
𝛼+ℎ 1
(𝛼 + ℎ )
𝜋 𝑋 |𝑌 =𝑦 k𝑦¯ − ·k 2 k𝑦 − ·k 2 𝑦 − 𝑦¯
∇ ln = ∇ − =
𝜋 𝑋 |𝑌 =𝑦¯ 2ℎ 2ℎ ℎ
8.2. CONVERGENCE UNDER STRONG LOG-CONCAVITY 223
so that
𝜋 𝑋 |𝑌 =𝑦
2 k𝑦 − 𝑦¯ k 2
FI(𝜋 𝑋 |𝑌 =𝑦 k 𝜋 𝑋 |𝑌 =𝑦¯ ) = E𝜋 𝑋 |𝑌 =𝑦
∇ ln 𝑋 |𝑌 =𝑦¯
=
.
𝜋 ℎ2
Hence,
1 k𝑦 − 𝑦¯ k 2 1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 2 2
= 2
k𝑦 − 𝑦¯ k 2 .
1 (1 +
(𝛼 + ℎ ) ℎ 𝛼ℎ)
The next proof, from [Che+22a], directly uses strong convexity in Wasserstein space.
Proof of Theorem 8.2.1 via Wasserstein calculus. This proof rests on the following interpre-
tation of the RGO. Let F(𝜇) B KL(𝜇 k 𝜋 𝑋 ). Then, by Exercise 8.1,
n 1 2 o
𝜋 𝑋 |𝑌 =𝑦
= arg min F(𝜇) + 𝑊2 (𝜇, 𝛿𝑦 ) C proxℎF (𝛿𝑦 ) .
𝜇∈P2 (R𝑑 ) 2ℎ
The first-order optimality conditions on Wasserstein space [AGS08, Lemma 10.1.2] reads
1
0 ∈ 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) +
(id − 𝑦) , 𝜋 𝑋 |𝑌 =𝑦 -a.s.
ℎ
where 𝜕F is the subdifferential of F on Wasserstein space.
Using this, we obtain
id ∈ 𝑦 − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) , 𝜋 𝑋 |𝑌 =𝑦 -a.s.
id ∈ 𝑦¯ − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) , 𝜋 𝑋 |𝑌 =𝑦¯ -a.s.
Let 𝑇 be the optimal transport map from 𝜋 𝑋 |𝑌 =𝑦 to 𝜋 𝑋 |𝑌 =𝑦¯ . The second condition above
can then be rewritten as
𝑇 ∈ 𝑦¯ − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 , 𝜋 𝑋 |𝑌 =𝑦 -a.s.
We now abuse notation and write 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) for a particular element of the subdif-
ferential and similarly for 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ). Then, 𝜋 𝑋 |𝑌 =𝑦 -a.s.,
k𝑇 − idk 2 = k𝑦¯ − 𝑦 k 2 − 2ℎ h𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 − 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ),𝑇 − idi
− ℎ 2 k𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 − 𝜕F(𝜋 𝑋 |𝑌 =𝑦 )k 2 .
We now integrate w.r.t. 𝜋 𝑋 |𝑌 =𝑦 and apply strong convexity of F in Wasserstein space:
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ k𝑦 − 𝑦¯ k 2 − 2𝛼ℎ 𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) − 𝛼 2ℎ 2 𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ )
and hence
1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 2
k𝑦 − 𝑦¯ k 2 .
(1 + 𝛼ℎ)
224 CHAPTER 8. THE PROXIMAL SAMPLER
The point of the second proof is that, although it uses some heavy machinery, it is just
a translation of a Euclidean optimization proof into the language of Wasserstein space
(see Exercise 8.2).
Simultaneous heat flow. The first technique is based on the observation that in going
from 𝜇𝑘𝑋 to 𝜇𝑘𝑌 , and from 𝜋 𝑋 to 𝜋 𝑌 , we are applying the heat flow. Given any 𝑓 -divergence
D 𝑓 (· k ·), we will compute its time derivative when both arguments undergo simultaneous
heat flow. Remarkably, the result will be almost the same as the time derivative of the
𝑓 -divergence to the target along the continuous-time Langevin diffusion, in a sense to be
made precise. The upshot is that the analysis of the proximal sampler closely resembles
the analysis of the continuous-time Langevin diffusion.
The simultaneous heat flow calculation is inspired by [VW19], and was carried out at
this level of generality in [Che+22a].
Let 𝑓 : R+ → R+ be a convex function with 𝑓 (1) = 0, and let D 𝑓 be the associated
𝑓 -divergence (see Section 1.5). We begin with a quick computation of the time derivative
of the 𝑓 -divergence along the Langevin diffusion.
Theorem 8.3.1. Let (𝜋𝑡 )𝑡 ≥0 denote the law of the continuous-time Langevin diffusion
with target 𝜋. Then, for any 𝑓 -divergence D 𝑓 , it holds that
where
D 𝜇 𝜇E
J𝑓 (𝜇 k 𝜋) B E𝜇 ∇ 𝑓 0 ◦ , ∇ ln . (8.3.2)
𝜋 𝜋
Theorem 8.3.3. Let (𝑄𝑡 )𝑡 ≥0 denote the standard heat semigroup. Then,
1
𝜕𝑡 D 𝑓 (𝜇𝑄𝑡 k 𝜋𝑄𝑡 ) = − J𝑓 (𝜇𝑄𝑡 k 𝜋𝑄𝑡 ) ,
2
where J𝑓 is defined in (8.3.2).
Proof. For brevity, write 𝜇𝑡 B 𝜇𝑄𝑡 and 𝜋𝑡 B 𝜋𝑄𝑡 . Since 𝜕𝑡 𝜇𝑡 = 21 Δ𝜇𝑡 = 12 div(𝜇𝑡 ∇ ln 𝜇𝑡 )
and similarly for 𝜕𝑡 𝜋𝑡 , we compute
∫ ∫ ∫
𝜇𝑡 0 𝜇𝑡 𝜇𝑡 𝜇𝑡
2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = 2 𝜕𝑡 𝑓 d𝜋𝑡 = 2 𝑓 𝜕𝑡 𝜇𝑡 − 𝜕𝑡 𝜋𝑡 + 2 𝑓 𝜕𝑡 𝜋𝑡
∫ 𝜋𝑡 𝜋𝑡 ∫ 𝜋𝑡 𝜋𝑡
𝜇𝑡 𝜇𝑡 𝜇𝑡
= 𝑓0 div(𝜇𝑡 ∇ ln 𝜇𝑡 ) − div(𝜋𝑡 ∇ ln 𝜋𝑡 ) + 𝑓 div(𝜋𝑡 ∇ ln 𝜋𝑡 )
∫ D 𝜋 𝑡 𝜋 𝑡 ∫ D 𝜋 𝑡
𝜇𝑡 E 0 𝜇𝑡 𝜇𝑡 E
=− ∇ 𝑓 ◦0
, ∇ ln 𝜇𝑡 d𝜇𝑡 + ∇ 𝑓 , ∇ ln 𝜋𝑡 d𝜋𝑡
𝜋𝑡 𝜋𝑡 ∫𝜋𝑡
D 𝜇𝑡 E
− ∇ 𝑓 ◦ , ∇ ln 𝜋𝑡 d𝜋𝑡
∫ D ∫ D 𝜋𝑡
𝜇𝑡 𝜇𝑡 E 𝜇𝑡 E 𝜇𝑡
=− ∇ 𝑓0◦ , ∇ ln d𝜇𝑡 + ∇ , ∇ ln 𝜋𝑡 𝑓 0 d𝜋𝑡
𝜋𝑡 𝜋𝑡 𝜋𝑡 ∫ D 𝜋𝑡
𝜇𝑡 E 𝜇𝑡
− ∇ , ∇ ln 𝜋𝑡 𝑓 0 d𝜋𝑡
𝜋𝑡 𝜋𝑡
= −J𝑓 (𝜇𝑡 k 𝜋𝑡 ) .
Although this theorem is already enough to prove new convergence results for the
proximal sampler, the rates will be slightly suboptimal. The reason for this is because we
have only considered one step of the proximal sampler, in which the algorithm goes from
𝜇𝑘𝑋 to 𝜇𝑘𝑌 (and the target goes from 𝜋 𝑋 to 𝜋 𝑌 ). In order to obtain the sharp convergence
rates, we also need to consider the second step, in which we go from 𝜇𝑘𝑌 to 𝜇𝑘+1 𝑋 (and the
target returns from 𝜋 𝑌 to 𝜋 𝑋 ). For reasons that will become clear shortly, we refer to these
steps as the “forwards step” and the “backwards step” respectively.
First, consider the evolution of the target along the heat semigroup 𝑡 ↦→ 𝜋 𝑋 𝑄𝑡 , so
that at time ℎ we arrive at 𝜋 𝑌 . The stochastic process representation of this evolution is
d𝑍𝑡 = d𝐵𝑡 , with 𝑍 0 ∼ 𝜋 𝑋 and 𝑍ℎ ∼ 𝜋 𝑌 , thus describing the forward step. So far so good,
but how should we think∫about the backwards step? By definition, 𝜋 𝑋 is obtained from
𝜋 𝑌 by the relation 𝜋 𝑋 = 𝜋 𝑋 |𝑌 =𝑦 d𝜋 𝑌 (𝑦), but this is not as helpful because we lose the
stochastic process view which allows us to apply calculus. Instead, we will think of 𝜋 𝑋 as
being obtained from 𝜋 𝑌 by the time reversal of the diffusion (𝑍𝑡 )𝑡 ∈[0,ℎ] .
226 CHAPTER 8. THE PROXIMAL SAMPLER
Time reversal. To emphasize the underlying principles, we will consider a more general
diffusion d𝑍𝑡 = 𝑏𝑡 (𝑍𝑡 ) d𝑡 + d𝐵𝑡 on the time interval [0,𝑇 ]. The discussion can also be
extended to include non-isotropic diffusion matrices, but we do not consider this for
clarity of exposition. Also, let 𝜋𝑡 denote the law of 𝑍𝑡 .
The time reversal of the diffusion is (𝑍𝑡← )𝑡 ∈[0,𝑇 ] B (𝑍𝑇 −𝑡 )𝑡 ∈[0,𝑇 ] . Then, (𝑍𝑡← )𝑡 ∈[0,𝑇 ] is
also a Markov process, and our aim is to describe its evolution via a stochastic differential
equation.1 First, note that since (𝑍𝑡 )𝑡 ∈[0,𝑇 ] is not time-homogeneous, we need a family of
generators (ℒ𝑡 )𝑡 ∈[0,𝑇 ] , where ℒ𝑡 𝑓 B 12 Δ𝑓 + h𝑏𝑡 , ∇𝑓 i. It suffices to compute the generators
(ℒ𝑡← )𝑡 ∈[0,𝑇 ] of the reversed process.
1 Why can the reversed process be described by a stochastic differential equation? An elegant approach
to this question was proposed by Föllmer in [Föl85]: via Girsanov’s theorem it suffices to show that the law
of the time reversal on path space has finite KL divergence w.r.t. the Wiener measure. This well-known
paper of Föllmer is often cited by works on the stochastic process (the time reversal of Brownian motion)
which now bears his name.
8.3. SIMULTANEOUS HEAT FLOW AND TIME REVERSAL 227
∫ ∫
=− 𝑓 ℒ𝑠 𝑃𝑠,𝑡 𝑔 𝜋𝑠 + 𝑓 𝑃𝑠,𝑡 𝑔 ℒ𝑠∗ 𝜋𝑠 ,
where ℒ𝑠∗ denotes the Lebesgue adjoint of ℒ𝑠 . A direct computation using the definition
of ℒ𝑠 shows that ℒ𝑠∗ (𝑓 𝜋𝑠 ) = 𝑓 ℒ𝑠∗ 𝜋𝑠 + { 12 Δ𝑓 + h−𝑏𝑠 + ∇ ln 𝜋𝑠 , ∇𝑓 i} 𝜋𝑠 , so
∫
𝜕𝑠 E[𝑓 (𝑍𝑠 )𝑔(𝑍𝑡 )] = − {ℒ𝑠∗ (𝑓 𝜋𝑠 ) − 𝑓 ℒ𝑠∗ 𝜋𝑠 } 𝑃𝑠,𝑡 𝑔
1
∫
Δ𝑓 + h−𝑏𝑠 + ∇ ln 𝜋𝑠 , ∇𝑓 i 𝑃𝑠,𝑡 𝑔 d𝜋𝑠
=−
2
1
Δ𝑓 (𝑍𝑠 ) + h−𝑏𝑠 (𝑍𝑠 ) + ∇ ln 𝜋𝑠 (𝑍𝑠 ), ∇𝑓 (𝑍𝑠 )i 𝑔(𝑍𝑡 ) .
= −E
2
Comparing this with (8.3.6), we obtain the result.
Applying the time reversal to the RGO. Now let (𝑍𝑡 )𝑡 ∈[0,ℎ] be standard Brownian
motion with 𝑍 0 ∼ 𝜋 𝑋 , so that 𝑍ℎ ∼ 𝜋 𝑌 . Then, (𝑍 0, 𝑍ℎ ) ∼ 𝝅, and in particular the law of
𝑍 0 conditioned on 𝑍ℎ = 𝑦 is 𝜋 𝑋 |𝑌 =𝑦 . On the other hand, this is the same as the law of 𝑍ℎ←
conditioned on 𝑍 0← = 𝑦, where (𝑍𝑡← )𝑡 ∈[0,ℎ] is the time reversal of (𝑍𝑡 )𝑡 ∈[0,ℎ] .
Using the explicit form of the reversed process in (8.3.5), we know that
the process at 𝑍 0← ∼ 𝜇𝑘𝑌 , then the law of 𝑍ℎ← is 𝜋 𝑋 |𝑌 =𝑦 d𝜇𝑘𝑌 (𝑦) = 𝜇𝑘+1
𝑋 . Thus, we have
successfully exhibited a stochastic process representation which takes us from 𝜇𝑘𝑌 to 𝜇𝑘+1 𝑋 .
For any measure 𝜇, write 𝜇𝑄𝑡← for the law of 𝑍𝑡← initialized at 𝑍 0← ∼ 𝜇.
Simultaneous backwards heat flow. Next, we will show that the time derivative of
the 𝑓 -divergence along the simultaneous backwards heat flow also behaves the same way
as the simultaneous forwards heat flow. This leads to a pleasing symmetry between the
forwards and backwards steps of the proximal sampler.
Theorem 8.3.8. Let (𝑄𝑡← )𝑡 ∈[0,ℎ] denote the construction described above by reversing
the heat flow started at 𝜋 𝑋 . Then,
1
𝜕𝑡 D 𝑓 (𝜇𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − J𝑓 (𝜇𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) ,
2
228 CHAPTER 8. THE PROXIMAL SAMPLER
Proof. For brevity, write 𝜇𝑡← B 𝜇𝑄𝑡← and 𝜋𝑡← B 𝜋 𝑌 𝑄𝑡← . By construction of the reversed
process, 𝜋𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 . Then, by the Fokker–Planck equation,
1 1
𝜕𝑡 𝜋𝑡← = − div(𝜋𝑡← ∇ ln 𝜋𝑡← ) + Δ𝜋𝑡← = − Δ𝜋𝑡← ,
2 2
1 𝜇← 1
𝜕𝑡 𝜇𝑡← = − div(𝜇𝑡← ∇ ln 𝜋𝑡← ) + Δ𝜇𝑡← = div 𝜇𝑡← ∇ ln 𝑡← − Δ𝜇𝑡← .
2 𝜋𝑡 2
Note that the fact that (𝜋𝑡← )𝑡 ∈[0,ℎ] satisfies the backwards heat equation is completely
natural in light of our construction via the reversed process.
Hence, we compute
∫ ← 𝜇𝑡←
∫
𝜇𝑡←
0 𝜇𝑡
2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = 2
← ←
𝑓 𝜕 𝜇 ←
− 𝜕 𝜋 ←
+ 2 𝑓 𝜕𝑡 𝜋𝑡←
𝜋𝑡← 𝜋𝑡← 𝜋𝑡←
𝑡 𝑡 𝑡 𝑡
∫ ← 𝜇𝑡← 𝜇𝑡← ∫ 𝜇𝑡←
0 𝜇𝑡
= 𝑓 2 div 𝜇 ←
∇ ln − Δ𝜇 ←
+ Δ𝜋 ←
− 𝑓 Δ𝜋𝑡←
𝜋𝑡← 𝑡
𝜋𝑡← 𝑡
𝜋𝑡← 𝑡 𝜋𝑡←
∫ ← 𝜇𝑡←
0 𝜇𝑡
=2 𝑓 div 𝜇𝑡 ∇ ln ←
←
𝜋𝑡← 𝜋𝑡
h∫ 𝜇 ← 𝜇𝑡← ∫ 𝜇𝑡← i
− 𝑓 0 𝑡
Δ𝜇𝑡 − ← Δ𝜋𝑡 +
← ←
𝑓 ← Δ𝜋𝑡 . ←
𝜋𝑡← 𝜋𝑡 𝜋𝑡
| {z }
(★)
The term (★) is exactly the same kind of term we encountered in the proof of Theorem 8.3.3,
and by the same calculations it equals −J𝑓 (𝜇𝑡← k 𝜋𝑡← ). Therefore,
𝜇← 𝜇← E
∫ D
2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = −2
← ←
∇ 𝑓 0 ◦ 𝑡← , ∇ ln 𝑡← d𝜇𝑡← + J𝑓 (𝜇𝑡← k 𝜋𝑡← )
𝜋𝑡 𝜋𝑡
← ←
= −J𝑓 (𝜇𝑡 k 𝜋𝑡 ) .
Theorem 8.4.1. Assume that the target 𝜋 𝑋 is log-concave. Then, for the law 𝜇𝑘𝑋 of the
𝑘-th iterate of the proximal sampler,
𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ .
𝑘ℎ
Proof. Forwards step. Along the simultaneous heat flow, Theorem 8.3.3 shows that
1
𝜕𝑡 KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) = − FI(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
2
so we need to lower bound the Fisher information. Also, log-concavity is preserved by
convolution, so 𝜋 𝑋 𝑄𝑡 is log-concave. [TODO: Justify this fact.] Hence, by convexity of
the KL divergence to a log-concave target along Wasserstein geodesics (Theorem 1.4.5),
D 𝜇𝑋 𝑄𝑡 E
0 = KL(𝜋 𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) ≥ KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) + E𝜇𝑋 𝑄𝑡 ∇ ln 0𝑋 ,𝑇𝜇𝑋 𝑄𝑡 →𝜋 𝑋 𝑄𝑡 − id .
0 𝜋 𝑄𝑡 0
Rearranging this and using the Cauchy–Schwarz inequality,
h
𝜇 𝑋 𝑄𝑡
2 i 2
E𝜇𝑋 𝑄𝑡
∇ ln 0𝑋
𝑊22 (𝜇 0𝑋 𝑄𝑡 , 𝜋 𝑋 𝑄𝑡 ) ≥ KL(𝜇0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) .
0 𝜋 𝑄𝑡
| {z }
FI(𝜇 0𝑋 𝑄𝑡 k𝜋 𝑋 𝑄𝑡 )
Combining this with the fact that the Wasserstein distance is decreasing along the simul-
taneous heat flow,
2
1 KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
𝜕𝑡 KL(𝜇0𝑋 𝑄𝑡 𝑋
k 𝜋 𝑄𝑡 ) ≤ − .
2 𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )
1 1 1 ℎ
= ≥ + .
KL(𝜇𝑌0 k 𝜋𝑌 ) KL(𝜇0𝑋 𝑄ℎ k 𝜋𝑋𝑄 ℎ) KL(𝜇0𝑋 k 𝜋𝑋 ) 2𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )
Backwards step. Along the simultaneous backwards heat flow, Theorem 8.3.8 gives
1
𝜕𝑡 KL(𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − FI(𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) .
2
230 CHAPTER 8. THE PROXIMAL SAMPLER
Since 𝜋 𝑌 𝑄𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 is log-concave and 𝑡 ↦→ 𝑊22 (𝜋 𝑌 𝑄𝑡←, 𝜋 𝑌 𝑄𝑡← ) is decreasing (which
is checked via a coupling argument using the diffusion (8.3.7)), a similar calculation as the
forwards step leads to the inequality
1 1 1 ℎ
= ≥ + .
KL(𝜇1𝑋 k 𝜋𝑋 ) KL(𝜇𝑌0 𝑄ℎ← k 𝜋 𝑌 𝑄ℎ← ) KL(𝜇𝑌0 k 𝜋𝑌 ) 2𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
We iterate these inequalities, using the fact that 𝑊2 (𝜇𝑘𝑋 , 𝜋 𝑋 ) ≤ 𝑊2 (𝜇0𝑋 , 𝜋 𝑋 ) for all
𝑘 ∈ N (which follows from Theorem 8.2.1) to obtain
1 1 𝑘ℎ
≥ +
KL(𝜇𝑘𝑋 k 𝜋 𝑋 ) KL(𝜇 0𝑋 k 𝜋 𝑋 ) 𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )
or
KL(𝜇0𝑋 k 𝜋 𝑋 ) 𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ ≤ .
1 + 𝑘ℎ KL(𝜇0𝑋 k 𝜋 𝑋 )/𝑊22 (𝜇0𝑋 , 𝜋 𝑋 ) 𝑘ℎ
Theorem 8.5.1. Suppose that the target 𝜋 𝑋 satisfies a Poincaré inequality with constant
𝐶 PI . Then, for the law 𝜇𝑘𝑋 of the 𝑘-th iterate of the proximal sampler,
𝜒 2 (𝜇0𝑋 k 𝜋 𝑋 )
𝜒 2 (𝜇𝑘𝑋 k 𝜋 𝑋 ) ≤ .
(1 + ℎ/𝐶 PI ) 2𝑘
Proof. Forwards step. For the chi-squared divergence, we can check that the dissipation
functional is given by J (𝜇 k 𝜋) = 2 E𝜋 [k∇(𝜇/𝜋)k 2 ]. Along the simultaneous heat flow,
by Theorem 8.3.3,
h
𝜇 𝑋 𝑄𝑡
2 i
𝜕𝑡 𝜒 (𝜇 0 𝑄𝑡 k 𝜋 𝑄𝑡 ) = −J (𝜇 0 𝑄𝑡 k 𝜋 𝑄𝑡 ) = − E𝜋 𝑋 𝑄𝑡
∇ 0𝑋
.
2 𝑋 𝑋 𝑋 𝑋
𝜋 𝑄𝑡
Since 𝜋 𝑋 satisfies a Poincaré inequality with constant 𝐶 PI , by subadditivity of the Poincaré
constant under convolution (Proposition 2.3.7), 𝜋 𝑋 𝑄𝑡 satisfies a Poincaré inequality with
constant at most 𝐶 PI + 𝑡. It therefore yields
2 1 𝜇 0𝑋 𝑄𝑡 1
𝜕𝑡 𝜒 (𝜇 0𝑋 𝑄𝑡 𝑋
k 𝜋 𝑄𝑡 ) ≤ − var𝜋 𝑋 𝑄𝑡 𝑋 =− 𝜒 2 (𝜇0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
𝐶 PI + 𝑡 𝜋 𝑄𝑡 𝐶 PI + 𝑡
8.5. CONVERGENCE UNDER FUNCTIONAL INEQUALITIES 231
and hence
2 2
∫ ℎ
1 𝜒 2 (𝜇 0𝑋 k 𝜋 𝑋 )
𝜒 (𝜇𝑌0 𝑌
k𝜋 )=𝜒 (𝜇0𝑋 𝑄ℎ k 𝜋 𝑄ℎ ) ≤ exp −
𝑋
d𝑡 𝜒 2 (𝜇0𝑋 k 𝜋 𝑋 ) = .
0 𝐶 PI + 𝑡 1 + ℎ/𝐶 PI
Backwards step. Along the simultaneous backwards heat flow, Theorem 8.3.8 yields
h
𝜇 𝑌 𝑄 ←
2 i
𝜕𝑡 𝜒 2 (𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − E𝜋 𝑌 𝑄𝑡←
∇ 0𝑌 ←
.
𝑡
𝜋 𝑄𝑡
Using the fact that 𝜋 𝑌 𝑄𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 satisfies the Poincaré inequality with constant at
most 𝐶 PI + ℎ − 𝑡, we deduce similarly that
𝜒 2 (𝜇𝑌0 k 𝜋 𝑌 )
𝜒 2 (𝜇 1𝑋 k 𝜋 𝑋 ) = 𝜒 2 (𝜇𝑌0 𝑄ℎ← k 𝜋 𝑌 𝑄ℎ← ) ≤ .
1 + ℎ/𝐶 PI
A similar result holds for the log-Sobolev inequality; since the proof is entirely analo-
gous, we leave it as Exercise 8.5.
Theorem 8.5.2. Suppose that the target 𝜋 𝑋 satisfies a log-Sobolev inequality with
constant 𝐶 LSI . Then, for the law 𝜇𝑘𝑋 of the 𝑘-th iterate of the proximal sampler,
KL(𝜇0𝑋 k 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ .
(1 + ℎ/𝐶 LSI ) 2𝑘
divergence matches the contraction factor in 𝑊22 distance (Theorem 8.2.1). To get this
sharp result, it is necessary to utilize the backwards step.
Similarly to Theorem 2.2.15, it is also possible to obtain guarantees for Rényi diver-
gences, see Exercise 8.6.
Remark 8.5.3. It is a curious observation that in the 𝑊2 guarantee of Theorem 8.2.1, the
1
contraction factor of (1+𝛼ℎ) 2 occurs solely in the backwards step, whereas in Theorem 8.5.2
1
the forwards and backwards steps each contribute a contraction factor of 1+𝛼ℎ .
232 CHAPTER 8. THE PROXIMAL SAMPLER
8.6 Applications
The original application of the proximal sampler was for sampling from certain families of
structured log-concave distributions [LST21c]. Since then, the proximal sampler has been
used to provide new guarantees for non-smooth and weakly smooth potentials [GLL22;
LC22a; LC22b]. We will restrict ourselves to applications which are more or less immediate
corollaries of our present analysis.
New guarantees for sampling from smooth potentials. When the potential 𝑉 is
𝛽-smooth, as discussed in Section 8.1, the RGO can be implemented via rejection sampling.
We obtain the following corollaries.
1
Corollary 8.6.1. Let 𝜋 𝑋 ∝ exp(−𝑉 ), where 𝑉 is 𝛽-smooth. Take ℎ = 𝛽𝑑 and assume we
have an oracle to 𝑉 which evaluates 𝑉 and the proximal operator for 𝑉 . Let 𝜇 𝑋𝑁 denote
the law of the 𝑁 -th iterate of the proximal sampler, in which the RGO is implemented
via rejection sampling.
3. (Theorem 8.5.1)
√︃ If 𝑉 satisfies a Poincaré inequality with constant 𝐶 PI , we obtain the
2 (𝜇 𝑋 k𝜋)
guarantee 𝜒 2 (𝜇 𝑋𝑁 k 𝜋 𝑋 ) ≤ 𝜀 using 𝑂 (𝐶 PI 𝛽𝑑 log 𝜀02 ) queries to the oracle
𝜒
in expectation.
4. (Theorem 8.5.2) √︃
If 𝑉 satisfies a log-Sobolev inequality with constant 𝐶 LSI , we obtain
KL(𝜇 0𝑋 k𝜋)
the guarantee KL(𝜇 𝑋𝑁 k 𝜋 𝑋 ) ≤ 𝜀 using 𝑂 (𝐶 LSI 𝛽𝑑 log 𝜀2
) queries to the
oracle in expectation.
Note that for the strongly log-concave case, these results are competitive with the
state-of-the-art results for MALA under a feasible start (Theorem 7.3.4)!
suppose that 𝜋 𝑋 ∝ exp(−𝑉 ) is such that 0 ≺ 𝛼𝐼𝑑 ∇2𝑉 𝛽𝐼𝑑 with condition number
𝜅 B 𝛽/𝛼. Suppose we have a high-accuracy sampler which, given any target satisfying
these assumptions, outputs a sample from a probability measure 𝜇 with k𝜇 − 𝜋 𝑋 k TV ≤ 𝜀
using 𝑂e(𝑓 (𝜅) 𝑑 𝑐 polylog(1/𝜀)) queries, where 𝑓 : R+ → R+ is some increasing function.
Then, by combining this high-accuracy sampler with the proximal sampler, we can obtain
a new sampler whose complexity is only 𝑂 e(𝜅𝑑 𝑐 polylog(𝜅/𝜀)), i.e., we have improved the
dependence on the condition number to near linear.
To see how this works, observe that if we choose the step size ℎ = 𝛽1 for the proximal
sampler, then the RGO 𝜋 𝑋 |𝑌 =𝑦 has condition number 𝑂 (1). Thus, the high-accuracy
sampler can obtain a 𝛿-approximate sample from the RGO using 𝑂 e(𝑑 𝑐 polylog(1/𝛿))
queries. On the other hand, with this choice of step size, we know from Theorem 8.5.2
that with a perfect implementation of the RGO the number of iterations required for the
e(𝜅 log(𝑑/𝜀 2 )). To complete the
proximal sampler to output 𝜇 𝑋𝑁 with k𝜇 𝑋𝑁 − 𝜋 k TV ≤ 𝜀 is 𝑂
analysis, we need to analyze how the error propagates due to the imperfect implementation
of the RGO. This is handled via a coupling argument (Exercise 8.8).
Lemma 8.6.2. Let 𝜇 𝑋𝑁 denote the law of the 𝑁 -th iterate of the proximal sampler with
perfect implementation of the RGO. Suppose that instead, in each step of the proximal
sampler, we use a sample from a distribution which is 𝛿-close to the RGO in total variation
distance; let 𝜇ˆ𝑋𝑁 denote the law of the 𝑁 -th iterate of the proximal sampler with imperfect
implementation of the RGO. Then,
k 𝜇ˆ𝑋𝑁 − 𝜇 𝑋𝑁 k TV ≤ 𝑁 𝛿 .
Since 𝑁 = 𝑂 e(𝜅 log(𝑑/𝜀 2 )), we can take 𝛿 𝜀/𝑁 . The total complexity of the prox-
imal sampler (the number of iterations 𝑁 of the proximal sampler multiplied by the
cost of approximately implementing the RGO with the high-accuracy sampler) is then
e(𝜅𝑑 𝑐 polylog(1/𝜀)) as claimed.
𝑂
In particular, applying this to the Metropolized random walk (MRW) algorithm (Theo-
rem 7.3.4) improves the complexity from 𝑂 e(𝜅 2𝑑 polylog(1/𝜀)) to 𝑂
e(𝜅𝑑 polylog(1/𝜀)).
Zeroth-order algorithms for sampling. The example above shows that boosting
the MRW algorithm with the proximal sampler leads to an algorithm whose complexity
is competitive with that of MALA. Moreover, unlike MALA, the algorithm based on
MRW only uses zeroth-order information, which is crucial for certain applications such
as Bayesian inverse problems in which gradient information is prohibitively expensive.
Similarly, implementing the RGO using rejection sampling only uses zeroth-order
234 CHAPTER 8. THE PROXIMAL SAMPLER
Lack of discretization analysis. Finally, we mention that the results in Corollary 8.6.1
are state-of-the-art under the various assumptions. A key reason why the proximal
sampler yields powerful complexity guarantees is because there is no “discretization
analysis”. For example, consider the sampling from a target distribution satisfying a
Poincaré inequality. Since a Poincaré inequality implies convergence in chi-squared
divergence, it is natural to perform a 𝜒 2 analysis of LMC, but this leads to substantial
new technical hurdles (see Chapter 5). Moreover, under a Poincaré inequality it becomes
non-trivial even to prove moment bounds for the LMC iterates. All of this is handled via
a careful analysis in [Che+21a], but the results there have worse dependence on 𝑑, 𝐶 PI 𝛽,
and 𝜀 −1 . In contrast, Corollary 8.6.1 bypasses all of these difficulties because the proximal
sampler reduces the task of sampling from distributions satisfying a Poincaré inequality
to the task of sampling from strongly log-concave distributions for the implementation of
the RGO, and even this is made straightforward via rejection sampling provided that we
take a small enough step size for the proximal sampler.
Bibliographical Notes
The reader is encouraged to read the original paper [LST21b] on the proximal sampler,
which contains applications to sampling from composite densities 𝜋 ∝ exp(−(𝑓 + 𝑔)),
where 𝑓 is well-conditioned and 𝑔 admits an implementable RGO, as well as to sampling
from log-concave finite sums 𝜋 ∝ exp(−𝐹 ) where 𝐹 B 𝑛 −1 𝑛𝑖=1 𝑓𝑖 is well-conditioned
Í
and the complexity is measured via the number of oracle calls to the individual functions
(𝑓𝑖 )𝑖∈[𝑛] . The proximal sampler has also been used to sample from weakly smooth and non-
smooth potentials [LC22a; LC22b], and it has been applied to the problem of differentially
private convex optimization [GLL22].
The optimization results in Exercise 8.2 obtained in analogy with the proximal sampler
are given in [Che+22a].
Exercises
Introduction to the Proximal Sampler
8.6. APPLICATIONS 235
2. Next, suppose that 𝜋 𝑋 = normal(0, Σ) and that 𝜇0𝑋 = normal(𝑚 0, Σ0 ). Show that
the next iterate of the proximal sampler is 𝜇1𝑋 = normal(𝑚 1, Σ1 ), where the mean
satisfies 𝑚 1 = proxℎ𝑉 (𝑚 0 ) and 𝑉 (𝑥) B 21 h𝑥, Σ−1 𝑥i. In other words, the mean of
the iterate of the proximal sampler evolves according to the proximal point method.
Applications
In order to determine if our sampling guarantees are optimal, we need to pair them with
lower bounds. However, the problem of establishing query complexity lower bounds
for sampling is challenging and the work on this topic is nascent. Here, we will give an
overview of the current progress in this direction.
1
Theorem 9.1.1 ([Che+22b]). The query complexity of outputting a sample which is 64
close in total variation distance to the target 𝜋, uniformly over the choice of 𝜋 ∈ Π𝜅 , is
237
238 CHAPTER 9. LOWER BOUNDS FOR SAMPLING
In what follows, we will make this theorem more precise and give a proof.
Lower bound. The lower bound will hold for any local oracle. Loosely speaking, a local
oracle accepts as an input a point 𝑥 ∈ R and outputs some information about the target 𝜋
such that if 𝜋ˆ is another possible target and 𝜋 ∝ 𝜋ˆ in some neighborhood of 𝑥, then the
output of the oracle is the same for both 𝜋 and 𝜋. ˆ This just formalizes the idea that the
oracle only outputs information about 𝜋 “near the point 𝑥”. To simplify the discussion,
however, we will suppose for concreteness that we have access to a second-order oracle:
given 𝑥 ∈ R𝑑 , it outputs the triple (𝑉 (𝑥), 𝑉 0 (𝑥), 𝑉 00 (𝑥)), where we recall that 𝑉 is only
specified up to an additive constant. (If this is confusing, you may instead suppose that
the oracle outputs the triple (𝑉 (𝑥) − 𝑉 (0), 𝑉 0 (𝑥), 𝑉 00 (𝑥)) where 𝜋 = exp(−𝑉 ).)
The lower bound will proceed in two stages.
2. Next, we will prove a lower bound on the number of queries required to solve the
statistical testing problem: “testing is hard”. This relies on standard information-
theoretic techniques for proving minimax lower bounds for statistical problems.
The main difference between this problem and the usual statistical setting is that
rather than having i.i.d. samples from some data distribution, we instead have query
access and the algorithm is allowed to be adaptive.
Combining the two steps then yields our query complexity lower bound for sampling. We
begin with the construction of 𝜋1, . . . , 𝜋𝑚 , which is slightly tricky.
2𝑚−2
Let 𝑚 be the largest integer such that exp(− 2 2𝜅 ) ≥ 21 (and note that 𝑚 = Θ(log 𝜅)).
We define two auxiliary functions
1
𝜅, 2 ≤ 𝑥 < 1, 5
1, ≤ 𝑥 < 4,
1 , 2
1 ≤ 𝑥 < 2,
𝜙 (𝑥) B 𝜓 (𝑥) B 𝜅 , 4 ≤ 𝑥 < 5,
𝜅, 2 ≤ 𝑥 < 25 , 0 , otherwise .
0 ,
otherwise ,
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 239
We define a family (𝑉𝑖 )𝑖∈[𝑚] of 1-strongly convex and 𝜅-smooth potentials as follows. We
require that 𝑉𝑖 (0) = 𝑉𝑖0 (0) = 0 and that 𝑉𝑖 be an even function, so it suffices to specify 𝑉𝑖00
on R+ . The second derivative is given by
𝑚−1
− 21 𝑥 𝑥 1
B 1{𝑥 ≤ 𝜅 + 1{𝑥 ≥ 5𝜅 − 2 2𝑚−1 } ,
∑︁
𝑉𝑖00 (𝑥) 2 𝑥 ≥ 0.
𝑖−1
} +𝜙 + 𝜓
𝜅 − 12 2𝑖 𝜅 − 21 2𝑗
𝑗=𝑖
Observe that all of the terms in the above summation have disjoint supports. Although
the construction seems complicated, the basic idea is to make 𝑉𝑖00 oscillate between its
minimum and maximum allowable values 1 and 𝜅; see Figure 9.1 for a visual.
√
𝑉𝑖00 (𝑥/ 𝜅)
𝑥
2𝑖−1 2𝑖 2𝑖+1 5 𝑖+1 2𝑖+2 5 𝑖+2 5 𝑚+2
42 42 42
...
Figure 9.1: The dashed lines correspond to 𝜙 and the dotted lines correspond to 𝜓 . Here,
the horizontal axis is distorted for clarity.
There are two key properties of this construction. First, we will show in Lemma 9.1.2
1 1
that each 𝜋𝑖 places a substantial amount of mass on the interval (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ]. This
implies that if we can sample from 𝜋𝑖 , it is likely that the sample will land in this interval,
which is used to reduce the sampling task to the statistical testing task. Then, we will
show in Lemma 9.1.4 that 𝑉𝑖 and 𝑉𝑖+1 agree exactly outside of a small interval which is
1
approximately located at 𝜅 − 2 2𝑖 . This implies that for any given value 𝑥 ∈ R𝑑 , there are
only 𝑂 (1) possible values of (𝑉𝑖 (𝑥), 𝑉𝑖0 (𝑥), 𝑉𝑖00 (𝑥)) as 𝑖 ranges in [𝑚], which in turn will
be used to show that the oracle is not very informative (and hence prove a lower bound
for the statistical testing task).
1
The intuition behind the following lemma is that at 𝜅 − 2 2𝑖−1 , 𝑉𝑖00 = 𝜅 for the first time
and so the density 𝜋𝑖 drops off rapidly after this point.
240 CHAPTER 9. LOWER BOUNDS FOR SAMPLING
− 12 − 21 2𝑖−3 1
2 2
𝑖−2 𝑖−1
𝜋𝑖 (𝜅 ,𝜅 ] ≥ ≥ ,
2 (2 + 1)
𝑖 32
which proves the result.
In the above proof, we used the following lemma (see Exercise 9.1).
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 241
The next lemma is the main reason why we used an oscillating construction for 𝑉𝑖00.
𝑥
2𝑖−1 2𝑖 2𝑖+1 5 𝑖+1 2𝑖+2 5 𝑖+2
42 42
the horizontal axis lengths to make it easier to visually compare the relative lengths of
intervals on which the second derivatives are constant.
Proof. Refer to Figure 9.2 for a visual aid for the proof.
1
Clearly the potentials and derivatives match when |𝑥 | ≤ 𝜅 − 2 2𝑖−1 . Since the second
1
derivatives match when |𝑥 | ≥ 45 𝜅 − 2 2𝑖+1 , it suffices to show that
5 − 1 𝑖+1 0 5 − 21 𝑖+1 5 − 1 𝑖+1 5 1
𝑉𝑖0 𝜅 22 𝜅 2 ) and 𝜅 22 = 𝑉𝑖+1 𝜅 − 2 2𝑖+1 .
= 𝑉𝑖+1 𝑉𝑖
4 4 4 4
To that end, note that for 𝑥 ≥ 0,
1 1 𝑥 𝑥 𝑥
00
(𝑥) − 𝑉𝑖00 (𝑥) = 1{𝜅 − 2 2𝑖−1 < 𝑥 ≤ 𝜅 − 2 2𝑖 } − 𝜙
𝑉𝑖+1 +𝜙 −𝜓
𝜅 − 12 2𝑖 𝜅 − 12 2𝑖+1 𝜅 − 21 2𝑖
242 CHAPTER 9. LOWER BOUNDS FOR SAMPLING
1 1
−(𝜅 − 1) , 𝜅 − 2 2𝑖−1 ≤ 𝑥 ≤ 𝜅 − 2 2𝑖 ,
+(𝜅 − 1) , 𝜅 − 12 2𝑖 ≤ 𝑥 ≤ 𝜅 − 12 2𝑖+1 ,
= 1 1
−(𝜅 − 1) , 𝜅 − 2 2𝑖+1 ≤ 𝑥 ≤ 54 𝜅 − 2 2𝑖+1 ,
0 ,
otherwise .
A little algebra shows that the above expression integrates to zero, hence we deduce the
1
0 ( 5 𝜅 − 21 2𝑖+1 ). Also, by integrating this expression twice,
equality 𝑉𝑖0 ( 45 𝜅 − 2 2𝑖+1 ) = 𝑉𝑖+1 4
5 − 1 𝑖+1 5 1 𝜅 − 1 − 1 𝑖−1 2
𝜅 22 − 𝑉𝑖 𝜅 − 2 2𝑖+1 = (𝜅 2 2 )
𝑉𝑖+1 −
4 4 2
| {z }
1 1
integral on [𝜅 − 2 2𝑖−1, 𝜅 − 2 2𝑖 ]
1 1 𝜅 − 1 −1 𝑖 2
− (𝜅 − 1) 𝜅 − 2 2𝑖−1 𝜅 − 2 2𝑖 + (𝜅 2 2 )
2
| {z }
1 1
integral on [𝜅 − 2 2𝑖 , 𝜅 − 2 2𝑖+1 ]
1 1 1 𝜅 − 1 1 − 1 𝑖+1 2
+ (𝜅 − 1) 𝜅 − 2 2𝑖−1 𝜅 − 2 2𝑖+1 − 𝜅 22
4 2 4
| {z }
1 5 1
integral on [𝜅 − 2 2𝑖+1, 4 𝜅 − 2 2𝑖+1 ]
𝜅−1
= {−22𝑖−3 − 22𝑖−1 + 22𝑖−1 + 22𝑖−2 − 22𝑖−3 }
𝜅
= 0.
We need one final ingredient: Fano’s inequality, which is the standard tool for
establishing information-theoretic lower bounds.
Theorem 9.1.5 (Fano’s inequality). Let 𝒊 ∼ uniform([𝑚]). Then, for any estimator b𝒊 of
𝒊, where b𝒊 is measurable with respect to some data 𝑌 ,
I(𝒊; 𝑌 ) + ln 2
P b𝒊 ≠ 𝒊 ≥ 1 −
,
ln 𝑚
where I is the mutual information I(𝒊; 𝑌 ) B KL(law(𝒊, 𝑌 ) k law(𝒊) ⊗ law(𝑌 )).
Proof. Let H(·) denote the entropy of a discrete random variable, i.e., if 𝑋 has law 𝑝 on a
discrete alphabet X, then H(𝑋 ) = 𝑥 ∈X 𝑝 (𝑥) ln(1/𝑝 (𝑥)). We refer to [CT06, Chapter 2]
Í
for the basic properties of entropy (and related quantities).
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 243
Let 𝐸 B 1{b𝒊 ≠ 𝒊 } denote the indicator of an error. Using the chain rule for entropy in
two different ways,
H( 𝒊, 𝐸 | b𝒊 ) = H( 𝒊 | b𝒊 ) + H( 𝐸 | 𝒊,b𝒊 )
| {z }
=0
= H( 𝐸 | b𝒊 ) + H( 𝒊 | 𝐸,b𝒊 ) .
Hence,
P{b𝒊 ≠ 𝒊 } ln 𝑚 + ln 2 ≥ H( 𝒊 | b𝒊 ) = H( 𝒊 ) − I( 𝒊 ; b𝒊 ) ≥ ln 𝑚 − I( 𝒊 ; 𝑌 )
where the last inequality is the data-processing inequality. Rearranging the inequality
completes the proof of Fano’s inequality.
Proof of Theorem 9.1.1, lower bound. We follow the general outline described above.
1. Reduction to statistical testing. Let 𝒊 ∼ uniform([𝑚]) and suppose that for each
1
𝑖 ∈ [𝑚], 𝜋ˆ𝑖 is a distribution with k 𝜋ˆ𝑖 − 𝜋𝑖 k TV ≤ 64 . Suppose that we have a sample
𝑋 ∼ 𝜋ˆ 𝒊 (more precisely, this means that conditioned on 𝒊 = 𝑖, we have 𝑋 ∼ 𝜋ˆ𝑖 ). In light
of Lemma 9.1.2, a good candidate estimator b𝒊 for 𝒊 is
1 1
b𝒊 B 𝑖 ∈ N such that 𝑋 ∈ (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] if such an 𝑖 exists .
1 ∑︁
𝒊 =𝑖 = 1
𝑚 𝑚
1 1
∑︁
P 𝑋 ∈ (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] 𝒊 = 𝑖
P 𝒊=𝒊 =
b P b𝒊 = 𝑖
𝑚 𝑖=1 𝑚 𝑖=1
1 ∑︁ 1 ∑︁ 1 1
𝑚 𝑚
1 1 1 1
𝜋ˆ𝑖 (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] ≥ 𝜋𝑖 (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] −
= ≥ .
𝑚 𝑖=1 𝑚 𝑖=1 64 64
(9.1.6)
First, suppose that the algorithm is deterministic, i.e., we assume that each query point
𝑥 𝑗 of the algorithm is a deterministic function of the previous query points and query
values. Let 𝒪𝑖 (𝑥) B (𝑉𝑖 (𝑥), 𝑉𝑖0 (𝑥), 𝑉𝑖00 (𝑥)) denote the output of the oracle on input 𝑥
when the target is 𝜋𝑖 . Since the estimator b𝒊 is a function of {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] , then Fano’s
inequality (Theorem 9.1.5) yields
I(𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] ) + ln 2
P b𝒊 ≠ 𝒊 ≥ 1 −
.
ln 𝑚
By the chain rule for mutual information,
𝑛
∑︁
(9.1.7)
I 𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] = I 𝒊; 𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 ) {𝑥 𝑗 0 , 𝒪𝒊 (𝑥 𝑗 0 )} 𝑗 0 ∈[ 𝑗−1] .
𝑗=1
I 𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] ≤ 𝑛 ln 5 .
Upper bound. To show that the lower bound is tight, we exhibit an algorithm, based
on rejection sampling, which achieves the lower complexity bound. As per our discussion
in Section 7.1, to implement rejection sampling we must specify the construction of an
upper envelope e 𝜇 ≥ 𝜋e, where 𝜋 ∈ Π𝜅 .
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 245
Without loss of generality, we assume that 𝑉 (0) = 0 (if not, replace the output 𝑉 (𝑥) of
an oracle query with 𝑉 (𝑥)−𝑉 (0)). The upper bound algorithm only requires a zeroth-order
oracle, and it is as follows.
√
1. Find the first index 𝑖 − ∈ {0, 1, . . . , d 12 log2 𝜅e} such that 𝑉 (−2𝑖 − / 𝜅) ≥ 21 .
√
2. Find the first index 𝑖 + ∈ {0, 1, . . . , d 12 log2 𝜅e} such that 𝑉 (+2𝑖 + / 𝜅) ≥ 12 .
√ √
3. Set 𝑥 − B −2𝑖 − / 𝜅 and 𝑥 + B +2𝑖 + / 𝜅; then, set
𝑥 −𝑥
− (𝑥 − 𝑥 − ) 2
exp −
− , 𝑥 ≤ 𝑥− ,
2𝑥 − 2
𝜇 (𝑥) B 1 ,
e 𝑥− ≤ 𝑥 ≤ 𝑥+ ,
2
𝑥 − 𝑥 + (𝑥 − 𝑥 + )
exp − 2𝑥 −
, 𝑥 ≥ 𝑥+ .
2
+
To see why 𝑖 − √ and 𝑖 + exist, from 𝑉 00 ≥ 1 and 𝑉 (0) = 𝑉 0 (0) = 0 we have 𝑉 (𝑥) ≥ 𝑥 2 /2.
Hence, if |𝑥 | = 2 / 𝜅 where 𝑖 ≥ 21 ln 𝜅, we have 𝑉 (𝑥) ≥ 1/2.
𝑖
Since 𝑉 is decreasing (resp. increasing) on R− (resp. R+ ), the first two steps can be
implemented by running binary search over arrays of size 𝑂 (log 𝜅), which therefore
only requires 𝑂 (log log 𝜅) queries. We will prove that e 𝜇 is a valid upper envelope for the
unnormalized target 𝜋e B exp(−𝑉 ), and that 𝑍 𝜇 /𝑍 𝜋 . 1. In turn, Theorem 7.1.1 shows
that once e 𝜇 is constructed, an exact sample can be drawn from 𝜋 using 𝑂 (1) additional
queries in expectation.
Alternatively, if we require that the algorithm use a fixed (non-random) number of
iterations, then note that in order to make the failure probability (the probability that
rejection sampling fails to terminate within the allotted number of iterations) at most 𝜀,
it suffices to run rejection sampling for 𝑂 (log(1/𝜀)) steps. Combining this with the cost
of constructing e 𝜇 , we conclude that we can output a sample whose law is 𝜀-close to 𝜋 in
total variation distance using 𝑂 (log log 𝜅 + log(1/𝜀)) queries.
Proof of Theorem 9.1.1, upper bound. First, we prove that e 𝜇 is a valid upper envelope. Since
𝜋e is decreasing on R+ with 𝜋e(0) = exp(−𝑉 (0)) = 1, then 𝜋e ≤ 1 ≤ e 𝜇 on [0, 𝑥 + ]. Next, since
𝑉 (𝑥 + ) ≥ 1/2 (by the definition of 𝑥 + ), convexity of 𝑉 yields
𝑉 (𝑥 + ) − 𝑉 (0) 1
𝑉 0 (𝑥 + ) ≥ ≥ .
𝑥+ 2𝑥 +
Thus, for 𝑥 ≥ 𝑥 + ,
1 1 1
𝑉 (𝑥) ≥ 𝑉 (𝑥 + ) + 𝑉 0 (𝑥 + ) (𝑥 − 𝑥 + ) + (𝑥 − 𝑥 + ) 2 ≥ (𝑥 − 𝑥 + ) + (𝑥 − 𝑥 + ) 2 ,
2 2𝑥 + 2
246 CHAPTER 9. LOWER BOUNDS FOR SAMPLING
𝑖 + = 0, this holds
√ √
1/ 𝜅 1/ 𝜅
𝜅𝑥 2 1
∫ 𝑥+ ∫ ∫
𝑥+
𝜇= exp(−𝑉 ) ≥ exp − d𝑥 ≥ √ = .
2 3 𝜅 3
e
0 0 0
1 1
∫ ∫ 𝑥+ ∫ ∞ ∫ ∞
(𝑥 − 𝑥 + ) − (𝑥 − 𝑥 + ) 2 d𝑥 ≤ 3𝑥 + .
exp −
𝜇= 𝜇+ 𝜇 ≤ 𝑥+ +
2𝑥 + 2
e e e
R+ 0 𝑥+ 𝑥+
Hence, R e
𝜇 ≤ 3𝑥 + ≤ 12 R 𝜋e, and similarly R e
𝜇 ≤ 12 R 𝜋e. Therefore, 𝑍 𝜇 /𝑍 𝜋 ≤ 12.
∫ ∫ ∫ ∫
+ + − −
Lower bounds for particular algorithms. As already discussed in Section 7.3, the
works [Che+21b; LST21a; WSC21] obtain lower bounds for the complexity of MALA,
9.2. OTHER APPROACHES 247
culminating in a precise understanding of the runtime of MALA both from feasible and
warm start initializations.
The paper [CLW21] provides an approach to proving lower bounds for discretization
schemes. In their setup, there is a stochastic process (𝑍𝑡 )𝑡 ≥0 , driven by some underlying
Brownian motion (𝐵𝑡 )𝑡 ≥0 ; for example, the process (𝑍𝑡 )𝑡 ≥0 could be the Langevin diffusion
or the underdamped Langevin diffusion (Section 6.3). The algorithm is allowed to make
queries to the potential 𝑉 , as well as certain queries to the driving Brownian motion
(𝐵𝑡 )𝑡 ≥0 , and the goal of the algorithm is to output a point 𝑍b𝑇 which is close to 𝑍𝑇 in
mean squared error: E[k𝑍𝑇 − 𝑍b𝑇 k 2 ] ≤ 𝜀 2 . Within this framework, they prove that the
randomized midpoint discretization (introduced in Section 6.1) is optimal for simulating
the underdamped Langevin dynamics (see Theorem 6.3.8 for the upper bound).
Estimating the normalizing constant. In [RV08], the authors consider the number
of membership queries needed to estimate the volume of a convex body 𝐾 ⊆ R𝑑 such that
B(0, 1) ⊆ 𝐾 ⊆ B(0, 𝑂 (𝑑 8 )) to within a small multiplicative constant; their lower bound
e 2 ). In comparison, the state-of-the-art upper bound for volume
for this problem is Ω(𝑑
computation is 𝑂 e(𝑑 3 ) (see [CV18; Jia+21]).
In ∫[GLL20], the authors consider the problem of estimating the normalizing constant
𝑍 𝜋 B 𝜋e from queries to the unnormalized density 𝜋e. Based on a multilevel Monte Carlo
scheme, they show that sampling algorithms can be turned into approximation algorithms
for the normalizing constant, with the cost of an extra 𝑂 (𝑑) dimension dependence in
the reduction. By combining this with the randomized midpoint discretization of the
underdamped Langevin diffusion (Theorem 6.3.8), they show that a 1 ± 𝜀 multiplicative
approximation to 𝑍 𝜋 can be obtained using 𝑂 e((𝑑 4/3𝜅 +𝑑 7/6𝜅 7/6 )/𝜀 2 ) queries (in the strongly
log-concave case).
They then prove that Ω(𝑑 1−𝑜 (1) /𝜀 2−𝑜 (1) ) queries are necessary to obtain a 1 ± 𝜀 multi-
plicative approximation to 𝑍 𝜋 . Unfortunately, due to the 𝑂 (𝑑) loss in the reduction from
estimating the normalizing constant to sampling, this does not imply a non-trivial lower
bound for the task of sampling.
Lower bound for a stochastic oracle. In [CBL22], the authors obtain a lower bound
on the complexity of sampling using a stochastic oracle. Namely, in order to output an
𝜀-approximate sample (in TV distance) from an 𝛼-strongly log-concave and 𝛽-log-smooth
distribution whose mean lies in the ball B(0, 1/𝛼), with an oracle that given 𝑥 ∈ R𝑑
outputs ∇𝑉 (𝑥) + 𝜉 with 𝜉 ∼ normal(0, Σ) and tr Σ ≤ 𝜎 2𝑑, the number of queries required
is at least Ω(𝜎 2𝑑/𝜀 2 ). On the other hand, when 𝛼, 𝜎 1, this complexity is achieved via
stochastic gradient Langevin Monte Carlo.
248 CHAPTER 9. LOWER BOUNDS FOR SAMPLING
Bibliographical Notes
Since the theory of lower bounds for sampling is still early in its development, there are
not too many works yet in this direction. Section 9.2 contains a brief survey.
Recently, the paper [GLL22] obtains a query complexity bound for the following class
of target distributions:
∞
n ∑︁ 𝛼 2
o
Π𝛼,𝐿 B 𝜋 ∈ Pac B(0, 1) 𝜋 ∝ exp − 𝑓𝑖 − k·k , 𝑓𝑖 : B(0, 1) → R is 𝐿-Lipschitz .
𝑖=1
2
They show that the minimum number of queries to the individual functions (𝑓𝑖 )𝑖=1
∞
required
to obtain a sample which is 𝜀-close to a target 𝜋 ∈ Π𝛼,𝐿 is, in the regime 𝑑 𝐿 2 /𝛼, of the
e 2 /𝛼). The upper bound is based on the proximal sampler (Section 8.1), whereas
order Θ(𝐿
the lower bound, which in this context reduces sampling to an optimization task, relies
on information-theoretic arguments.
Exercises
A Query Complexity Result in One Dimension
Structured Sampling
So far, we have only considered sampling within the black-box model, in which we only
have access to oracle queries to the potential and its gradient. We will now consider
several new sampling algorithms which go beyond the black-box model.
249
250 CHAPTER 10. STRUCTURED SAMPLING
Historically, mirror descent was introduced by Nemirovsky and Yudin [NY83] with the
following intuition. Suppose we are optimizing a function 𝑉 which is not defined over the
Euclidean space R𝑑 , but rather over a Banach space B. Then, the gradient of 𝑉 is not an
element of B but rather of the dual space B∗ , and so the gradient descent iteration (10.2.1)
does not even make sense. On the other hand, the mirror descent iteration (10.2.2) works
because the primal point 𝑥𝑘 ∈ B is first mapped to the dual space B∗ via the mapping ∇𝜙.
This reasoning is not so esoteric as it may seem, because even for a function 𝑉 defined
over R𝑑 its natural geometry may correspond to a different norm (e.g., the ℓ1 norm), in
which case R𝑑 is better viewed as a Banach space.
Our aim is to understand the sampling analogue of mirror descent, known as mirror
Langevin. We will keep in mind the key example of constrained sampling. Here, the
potential 𝑉 : R𝑑 → R ∪ {∞} has domain X ( R𝑑 . In this case, the standard Langevin
algorithm leaves the constraint set X which is undesirable; in particular, it is not possible
to obtain guarantees in metrics such as KL divergence because the law of the iterate of the
algorithm is not absolutely continuous with respect to the target. Projecting the iterates
onto X does not solve this issue because the law of the iterate will then have positive mass
on the boundary 𝜕X. Besides, projection may not adapt well to the shape of the constraint
set X. Instead, the use of a mirror map 𝜙 which is a barrier for X can automatically enforce
the constraint.
Here, the diffusion term is no longer an isotropic Brownian motion but rather involves the
1/2
matrix [∇2𝜙 (𝑍𝑡 )] ; this is necessary in order to ensure that the stationary distribution
is 𝜋. Also, we have given the SDE in the dual space. Using Itô’s formula (Theorem 1.1.18),
one can write down an SDE for (𝑍𝑡 )𝑡 ≥0 in the primal space, but it is more complicated,
involving the third derivative tensor of 𝜙 (see Exercise 10.1), and as such we prefer to
work with the representation (10.2.3).
Using (10.2.3), we can compute the generator ℒ, the carré du champ Γ, and the
Dirichlet energy ℰ of the mirror Langevin diffusion (see Section 1.2 and Exercise 10.2):
∫
2 −1
ℰ(𝑓 , 𝑔) = h∇𝑓 , [∇2𝜙] ∇𝑔i d𝜋 .
−1
Γ(𝑓 , 𝑔) = h∇𝑓 , [∇ 𝜙] ∇𝑔i , (10.2.4)
10.2. MIRROR LANGEVIN 251
The expression shows that the mirror Langevin diffusion is reversible with respect to 𝜋.
Also, if 𝜋𝑡 denotes the law of 𝑍𝑡 , then
∫ ∫ ∫ ∫
𝜋𝑡 −1 𝜋𝑡
𝑓 d𝜋𝑡 = d𝜋 = − ∇𝑓 , [∇2𝜙] ∇ d𝜋 (10.2.5)
𝑓 𝜕𝑡 𝜋𝑡 = 𝜕𝑡 ℒ𝑓
∫ 𝜋 ∫ 𝜋
𝜋𝑡 𝜋𝑡
∇𝑓 , [∇2𝜙] ∇ ln 𝑓 div 𝜋𝑡 [∇2𝜙] ∇ ln
−1 −1
d𝜋𝑡 = (10.2.6)
=−
𝜋 𝜋
from which we deduce the Fokker–Planck equation
𝜋𝑡
𝜕𝑡 𝜋𝑡 = div 𝜋𝑡 [∇2𝜙]
−1
∇ ln .
𝜋
From the interpretation as a continuity equation (see Theorem 1.3.17), we deduce that
(𝜋𝑡 )𝑡 ≥0 describes the evolution of a particle which travels according to the family of vector
fields 𝑡 ↦→ −[∇2𝜙] ∇ ln(𝜋𝑡 /𝜋). Recalling that ∇ ln(𝜋𝑡 /𝜋) is the Wasserstein gradient of
−1
KL(· k 𝜋) at 𝜋𝑡 , we can interpret the mirror Langevin diffusion as a “mirror flow” of the
KL divergence in Wasserstein space.
Alternatively, we can equip X with the Riemannian metric induced by ∇2𝜙, i.e., we
set h𝑢, 𝑣i𝑥 B h𝑢, ∇2𝜙 (𝑥) 𝑣i. Then, the mirror Langevin diffusion becomes the Wasserstein
gradient flow of the KL divergence over the Riemannian manifold X (see Section 2.6.1).
The Newton Langevin diffusion. In the special case when the mirror map 𝜙 is chosen
to be the same as the potential 𝑉 , we arrive at a sampling analogue of Newton’s algorithm,
and hence we call it the Newton Langevin diffusion. The equation for the Newton
Langevin diffusion can be written (in the dual space) as
√
d𝑍𝑡∗ = −𝑍𝑡∗ d𝑡 + 2 [∇2𝑉 ∗ (𝑍𝑡∗ )]
−1/2
d𝐵𝑡 , (10.2.7)
Can we expect similar properties to hold for the Newton Langevin diffusion? At
least for the property of affine invariance, we have the Brascamp–Lieb inequality
(Theorem 2.2.8): if 𝜋 ∝ exp(−𝑉 ) is strictly log-concave, then for all 𝑓 : R𝑑 → R,
Below, we will also give an alternative proof of the Bregman transport inequality (The-
orem 2.2.10) based on Wasserstein calculus, which implies the Brascamp–Lieb inequality.
Note that in the strongly convex case ∇2𝑉 𝛼𝐼𝑑 , it implies a Poincaré inequality (in the
sense of Example 1.2.21) for 𝜋 with constant 1/𝛼. However, in our present context with
Dirichlet energy given by (10.2.4), we instead interpret the Brascamp–Lieb inequality as a
Poincaré inequality (in the sense of Definition 1.2.18) for the Newton–Langevin diffusion.
Then, the Poincaré constant is 1, independent of the strong convexity of 𝑉 .
We also obtain a Poincaré inequality for the mirror Langevin diffusion under the
condition of relative strong convexity.
2
Observe that when 𝜙 = k·k2 , these definitions reduce to the usual definitions of strong
convexity and smoothness. Recall from Definition 2.2.9 that the Bregman divergence 𝐷𝜙
associated with 𝜙 is the mapping 𝐷𝜙 (·, ·) : X × X → R given by
The Bregman divergence plays an important role in the analysis of mirror Langevin
because it is the correct substitute for the Euclidean distance (𝑥, 𝑦) ↦→ 12 k𝑥 − 𝑦 k 2 in this
context. Note the following observations: (1) 𝐷𝜙 is non-negative due to convexity of
𝜙, and if 𝜙 is strictly convex then it equals 0 if and only if its two arguments are equal;
(2) since 𝐷𝜙 (𝑥, 𝑦) is defined by subtracting the first-order Taylor expansion of 𝜙 at 𝑦, it
10.2. MIRROR LANGEVIN 253
𝐷𝑉 ≥ 𝛼 𝐷𝜙 .
𝐷𝑉 ≤ 𝛽 𝐷𝜙 .
Corollary 10.2.10 (mirror Poincaré inequality, [Che+20b]). Suppose that the potential
𝑉 is 𝛼-relatively convex w.r.t. 𝜙. Then, the mirror Langevin diffusion satisfies the following
Poincaré inequality: for all 𝑓 : R𝑑 → R,
1 1
E𝜋 h∇𝑓 , [∇2𝜙] ∇𝑓 i = ℰ(𝑓 , 𝑓 ) .
−1
var𝜋 𝑓 ≤
𝛼 𝛼
see Section 2.6. Such a calculation was performed in, e.g., [Kol14], which implies that if
(∇𝜙) # 𝜋 is log-concave, then the Newton Langevin diffusion satisfies CD( 12 , ∞). However,
it is not clear under what conditions (∇𝜙) # 𝜋 is log-concave.
Here is another consequence of the fact that relative convexity is not a curvature-
dimension condition: relative convexity (apparently) does not seem to imply contraction
properties for the mirror Langevin diffusion with respect to an appropriately defined
Wasserstein metric.
To summarize: either we can assume the curvature-dimension condition CD(𝛼, ∞),
which imposes complicated conditions on 𝜙 and 𝑉 , or we can adopt the more inter-
pretable relative convexity assumption, which in turn only implies a Poincaré inequality
(Corollary 10.2.10). We will follow the latter approach.
Upon reflection, the curvature-dimension approach for studying the mirror Langevin
diffusion is arguably the less natural one. Indeed, the curvature-dimension approach is
based on viewing the mirror Langevin diffusion from the lens of Riemannian geometry,
but the mirror descent algorithm in optimization is not typically studied via Riemannian
geometry. Instead, the study of mirror descent is based on ideas from convex analysis,
centered around the Bregman divergence. So it seems prudent at this stage to abandon
the Riemannian interpretation of mirror Langevin in favor of convex analysis tools, and
this is indeed how our discretization proof will go. In fact, in lieu of using the Poincaré
inequality in Corollary 10.2.10, we will directly use relative convexity.
where
√ ∫ 𝑡
[∇2𝜙 ∗ (𝑋𝑠∗ )]
−1/2
𝑋𝑡∗ = +
∇𝜙 (𝑋𝑘ℎ ) + 2 d𝐵𝑠 for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] . (10.2.11)
𝑘ℎ
2
Note that when 𝜙 = k·k2 , this reduces to the standard LMC algorithm. When generalizing
LMC to different mirror maps, this discretization is chosen to preserve the “forward-flow”
interpretation of [Wib18] (see Section 4.3). In particular, the update from 𝑋𝑘ℎ to 𝑋𝑘ℎ
+ is a
mirror descent step, while the update from 𝑋𝑘ℎ to 𝑋 (𝑘+1)ℎ follows a “Wasserstein mirror
+
Key technical results. We now establish the analogues of the various facts that we
invoked in the study of LMC.
First, in the standard gradient descent analysis, if 𝑥 + B 𝑥 − ℎ ∇𝑉 (𝑥), then we have
the key inequality
1
h∇𝑉 (𝑥), 𝑥 + − 𝑧i = {k𝑥 − 𝑧 k 2 − k𝑥 + − 𝑧 k 2 − k𝑥 + − 𝑥 k 2 } for all 𝑧 ∈ R𝑑 .
2ℎ
Remarkably, there is an analogue of this fact for mirror descent, which follows from the
following identity (which can be checked by simple algebra, see Exercise 10.5):
˜ − ∇𝜙 (𝑥), 𝑥 − 𝑧i = 𝐷𝜙 (𝑥,
h∇𝜙 (𝑥) ˜ 𝑥) + 𝐷𝜙 (𝑧, 𝑥)
˜ − 𝐷𝜙 (𝑧, 𝑥) , ˜ 𝑧 ∈ X.
for all 𝑥, 𝑥,
(10.2.12)
Lemma 10.2.13 (Bregman proximal lemma, [CT93]). For 𝑥 ∈ X, let 𝑥 + be defined via
∇𝜙 (𝑥 + ) = ∇𝜙 (𝑥) − ℎ ∇𝑉 (𝑥). Then, for all 𝑧 ∈ X,
1
h∇𝑉 (𝑥), 𝑥 + − 𝑧i = {𝐷𝜙 (𝑧, 𝑥) − 𝐷𝜙 (𝑧, 𝑥 + ) − 𝐷𝜙 (𝑥 +, 𝑥)} .
ℎ
Substituting the identity (10.2.12) into the above equation proves the result.
Unlike the case of LMC, the presence of a non-constant diffusion matrix involving ∇2𝜙
introduces another source of discretization error. To address this, we introduce a condition
on the third derivative of 𝜙. Note however that a uniform bound on the operator norm
of ∇3𝜙 is not compatible with the assumption that 𝜙 tends to +∞ on 𝜕X. The solution to
this issue was discovered by Nesterov and Nemirovsky [NN94]: we can ask that ∇3𝜙 is
bounded with respect to the geometry induced by 𝜙. This approach also has the benefit of
being consistent with the affine invariance of Newton’s method. The precise definition of
the third derivative condition is as follows.
∇3𝜙 (𝑥) [𝑣, 𝑣, 𝑣] ≤ 2𝑀𝜙 k𝑣 k 3∇2𝜙 (𝑥) B 2𝑀𝜙 h𝑣, ∇2𝜙 (𝑥) 𝑣i 3/2 .
The norm k𝑣 k ∇2𝜙 (𝑥) B h𝑣, ∇2𝜙 (𝑥) 𝑣i is called the local norm, and it is the tangent
√︁
Self-concordant functions are well-studied due to their central role in the theory of
interior-point methods for optimization, see the monograph [NN94]. A key example of a
self-concordant mirror map is when the constraint set is a polytope,
in which case 𝜙 (𝑥) = ln(1/ 𝑖=1 (𝑏𝑖 − h𝑎𝑖 , 𝑥i)) is self-concordant with 𝑀𝜙 = 1.
Í𝑁
Finally, a key step in our analysis of LMC was to use the convexity of the entropy
functional along 𝑊2 geodesics. In our analysis of MLMC, we will replace the 𝑊2 distance
with the Bregman transport cost D𝑉 (recall Definition 2.2.9). To study these costs, we first
state an analogue of Brenier’s theorem (Theorem 1.3.8).
1 The proof is surprisingly difficult. The reader can try to prove the result with a worse constant factor.
10.2. MIRROR LANGEVIN 257
Theorem 10.2.16 (Brenier’s theorem for the Bregman transport cost). Suppose that
𝜇, 𝜈 ∈ P (R𝑑 ). Then, the unique optimal Bregman transport coupling (𝑋, 𝑌 ) for 𝜇 and 𝜈
is of the form
∇𝜙 (𝑌 ) = ∇𝜙 (𝑋 ) − ∇ℎ(𝑋 ) ,
Proof sketch. We need facts about optimal transport with general costs 𝑐 (Exercise 1.11).
Namely, the optimal pair of dual potentials (𝑓 ★, 𝑔★) are 𝑐-conjugates, meaning that
For 𝛾 ★-a.e. (𝑥, 𝑦), it holds that 𝑓 ★ (𝑥) + 𝑔★ (𝑦) = 𝑐 (𝑥, 𝑦). If 𝑐 is smooth and such that
∇𝑥 𝑐 (𝑥, ·) is injective for all 𝑥 ∈ R𝑑 , then from the definition of 𝑔★ it suggests that we have
∇𝑥 𝑐 (𝑥, 𝑦) = ∇𝑓 ★ (𝑥) for 𝛾 ★-a.e. (𝑥, 𝑦). See [Vil09, Theorem 10.28] for a rigorous statement
and proof of these results.
Applying this to our cost function 𝑐 = 𝐷𝜙 , it yields the existence of 𝐷𝜙 -conjugates
ℎ, ℎ˜ : R𝑑 → R ∪ {−∞} such that ∇𝑥 𝐷𝜙 (𝑥, 𝑦) = ∇ℎ(𝑥) under the optimal plan 𝛾 ★. Hence,
and
˜
ℎ(𝑥) = inf {𝐷𝜙 (𝑥, 𝑦) − ℎ(𝑦)} .
𝑦∈R𝑑
˜
𝜙 (𝑥) − ℎ(𝑥) = sup {h∇𝜙 (𝑦), 𝑥 − 𝑦i + ℎ(𝑦) + 𝜙 (𝑦)} .
𝑦∈X
𝜇𝑡 = law(𝑋𝑡 ) , 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 ,
and (𝑋 0,𝑊 ), (𝑋 1,𝑊 ) are both optimally coupled for the 𝑊2 metric, where 𝑊 ∼ 𝜌.
It is left as Exercise 10.6 to check that the entropy functional is also convex along
generalized geodesics.
Theorem 10.2.18. Let H(𝜇) B 𝜇 ln 𝜇. Then, for any generalized geodesic (𝜇𝑡 )𝑡 ∈[0,1] ,
∫
𝑡 ↦→ H(𝜇𝑡 ) is convex .
In our context, it implies that if 𝜇, 𝜈 ∈ P (R𝑑 ) and (𝑋, 𝑌 ) is an optimal coupling for the
Bregman cost D𝜙 (𝜇, 𝜈), then
H(𝜈) ≥ H(𝜇) + Eh∇ ln 𝜇 (𝑋 ), 𝑌 − 𝑋 i . (10.2.19)
As an aside, we remark that this observation leads to another proof of the Bregman
transport inequality.
Proof of the Bregman transport inequality (Theorem 2.2.10). Let 𝜋 = exp(−𝑉 ), where 𝑉 is
strictly convex, and let 𝜇 ∈ P (R𝑑 ). Let 𝑋 ∼ 𝜇, 𝑍 ∼ 𝜋 be optimally coupled for the
Bregman transport cost. Then,
KL(𝜇 k 𝜋) = E 𝑉 (𝑋 ) + H(𝜇) .
For the first term, by the definition of 𝐷𝑉 ,
E 𝑉 (𝑋 ) = E 𝑉 (𝑍 ) + E 𝐷𝑉 (𝑋, 𝑍 ) + Eh∇𝑉 (𝑍 ), 𝑋 − 𝑍 i .
For the second term, (10.2.19) implies
H(𝜇) ≥ H(𝜋) + Eh∇ ln 𝜋 (𝑍 ), 𝑋 − 𝑍 i .
10.2. MIRROR LANGEVIN 259
Hence,
KL(𝜇 k 𝜋) ≥ E 𝑉 (𝑍 ) + H(𝜋) + Eh∇𝑉 (𝑍 ) + ∇ ln 𝜋 (𝑍 ), 𝑋 − 𝑍 i + E 𝐷𝑉 (𝑋, 𝑍 )
| {z } | {z }
=KL(𝜋 k𝜋)=0 =0
= D𝑉 (𝜇 k 𝜋) ,
which is what we wanted to show.
Theorem 10.2.20 ([AC21]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝜙 is the mirror map. Assume:
• 𝜙 is 𝑀𝜙 -self-concordant.
• 𝑉 is 𝐿-relatively Lipschitz w.r.t. 𝜙, i.e., k∇𝑉 (𝑥)k [∇2𝜙 (𝑥)] −1 ≤ 𝐿 for all 𝑥 ∈ X.
Let (𝜇𝑘ℎ )𝑘∈N denote the law of MLMC and let 𝛽 0 B 𝛽 + 2𝐿𝑀𝜙 .
√
1. (weakly convex case) Suppose that 𝛼 = 0. For any 𝜀 ∈ [0, 𝑑], if we take step
2
size ℎ 𝛽𝜀0𝑑 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds that
Í𝑁
𝛽 0𝑑 D𝜙 (𝜋, 𝜇 0 )
𝑁 =𝑂 iterations .
𝜀4
2. (strongly convex case) Suppose that 𝛼 >√ 0 and let 𝜅 B 𝛽 0/𝛼 denote the “con-
2
dition number”. Then, for any 𝜀 ∈ [0, 𝑑], with step size ℎ 𝛼𝜀𝛽 0𝑑 we obtain
√
𝛼 D𝜙 (𝜋, 𝜇 𝑁ℎ ) ≤ 𝜀 and KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁
𝜅𝑑 𝛼 D𝜙 (𝜋, 𝜇 0 )
𝑁 =𝑂 log iterations ,
𝜀2 𝜀2
Í2𝑁
where 𝜇¯𝑁ℎ,2𝑁ℎ B 𝑁 −1 𝑘=𝑁 +1 𝜇𝑘ℎ .
1. The forward step dissipates the energy. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋𝑘ℎ .
Then, applying the relative convexity and relative smoothness of 𝑉 ,
E(𝜇𝑘ℎ
+
) − E(𝜋) = E[𝑉 (𝑋𝑘ℎ
+
) − 𝑉 (𝑋𝑘ℎ ) + 𝑉 (𝑋𝑘ℎ ) − 𝑉 (𝑍 )]
+ +
≤ E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑋𝑘ℎ i + 𝛽 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ )
+ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i − 𝛼 𝐷𝜙 (𝑍, 𝑋𝑘ℎ )
+ +
(10.2.22)
= E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i + 𝛽 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ ) − 𝛼 𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) .
+ 1 + +
h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍i = {𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) − 𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) − 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ )} .
ℎ
Substituting this into (10.2.22) and using ℎ ≤ 𝛽1 , it yields
1
E(𝜇𝑘ℎ
+
) − E(𝜋) ≤ {(1 − 𝛼ℎ) D𝜙 (𝜇𝑘ℎ , 𝜋) − D𝜙 (𝜇𝑘ℎ
+
, 𝜋)} . (10.2.23)
ℎ
2. The flow step does not substantially increase the energy. We write
Let 𝑓 (𝑥) B 𝑉 (∇𝜙 ∗ (𝑥)) and apply Itô’s formula. Note that
∇𝑓 (𝑥) = ∇𝑉 (∇𝜙 ∗ (𝑥)) T ∇2𝜙 ∗ (𝑥) = ∇𝑉 (∇𝜙 ∗ (𝑥)) T [∇2𝜙 (∇𝜙 ∗ (𝑥))]
−1
,
2 2 ∗ 2 ∗ −1 2 ∗
∇ 𝑓 (𝑥) = [∇ 𝑉 (∇𝜙 (𝑥))] [∇ 𝜙 (∇𝜙 (𝑥))] [∇ 𝜙 (𝑥)]
+ ∇𝑉 (∇𝜙 ∗ (𝑥)) [∇2𝜙 (∇𝜙 ∗ (𝑥))] [∇3𝜙 (∇𝜙 ∗ (𝑥))] [∇2𝜙 (∇𝜙 ∗ (𝑥))]
T −1 −2
.
=E
𝑘ℎ
∫ (𝑘+1)ℎ
tr ∇𝑉 (𝑋𝑡 ) T [∇2𝜙 (𝑋𝑡 )] [∇3𝜙 (𝑋𝑡 )] [∇2𝜙 (𝑋𝑡 )]
−1 −1
+E d𝑡 . (10.2.25)
𝑘ℎ
(10.2.24) ≤ 𝛽𝑑ℎ .
3. The flow step dissipates the entropy. Let 𝜇𝑡 B law(𝑋𝑡 ) = law(∇𝜙 ∗ (𝑋𝑡∗ )). The
mirror diffusion (10.2.11) evolves according to the vector field −[∇2𝜙]
−1
∇ ln 𝜇𝑡 . Also, note
that ∇𝑦 𝐷𝜙 (𝑥, 𝑦) = −∇2𝜙 (𝑥) (𝑦 − 𝑥). Using these, one can show that
where (𝑍, 𝑋𝑡 ) is an optimal coupling for D𝜙 (𝜋, 𝜇𝑡 ). Using the convexity of H along
generalized geodesics (Theorem 10.2.18),
Using the fact that 𝑡 ↦→ H(𝜇𝑡 ) is decreasing (prove this from the Fokker–Planck equation
for the mirror diffusion!), we then have
Result for the non-smooth case. Although the preceding result applies when 𝜙 is a
logarithmic barrier, it does not apply to perhaps one of the most classical applications
of mirror descent: namely, X is the probability simplex in R𝑑 and 𝜙 (𝑥) B 𝑑𝑖=1 𝑥𝑖 ln 𝑥𝑖 is
Í
the entropy. The next result we formulate adopts assumptions which precisely match the
usual ones for mirror descent in this context.
Theorem 10.2.28 ([AC21]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝜙 is the mirror map. Let |||·||| be a norm on R𝑑 . Assume:
• 𝑉 is convex and 𝐿-Lipschitz w.r.t. the dual norm |||·||| ∗ , in the sense that
𝐿 2 D𝜙 (𝜋, 𝜇 + )
0
𝑁 =𝑂 iterations .
𝜀4
For example, it is a classical fact that the entropy is strongly convex w.r.t. the ℓ1 norm.
We leave the proof of the non-smooth case as Exercise 10.7.
10.3. PROXIMAL LANGEVIN 263
Bibliographical Notes
In the context of optimization, self-concordant barriers play a key role in interior point
methods for constrained optimization [NN94; Bub15; Nes18]. Relative convexity and
relative smoothness were introduced in [BBT17; LFN18].
The first use of mirror maps with the Langevin diffusion was via the mirrored Langevin
algorithm (which is different from the mirror Langevin diffusion) in [Hsi+18]. The mirror
Langevin diffusion was introduced in an earlier draft of [Hsi+18], as well as in [Zha+20].
In [Zha+20], Zhang et al. also studied the Euler–Maruyama discretization of the mirror
Langevin diffusion (which differs from MLMC in that it discretizes the diffusion step
as well), but they were unable to prove convergence of the algorithm; they were only
able to prove convergence to a Wasserstein ball of non-vanishing radius around 𝜋, even
as the step size tends to zero. They also conjectured that the non-vanishing bias of the
algorithm is unavoidable. Subsequently, [Che+20b] studied the mirror Langevin diffusion
in continuous time, and [AC21] introduced and studied the MLMC discretization, which
does lead to vanishing bias (as ℎ & 0).
Since then, there have been further studying the non-vanishing bias issue: [Jia21]
studied both the Euler–Maruyama and MLMC discretizations under a “mirror log-Sobolev
inequality” and was only able to prove vanishing bias for the later discretization; [Li+22]
showed that the Euler–Maruyama discretization has vanishing bias under stronger as-
sumptions; and [GV22] studied MLMC as a special case of more general Riemannian
Langevin algorithms. The bias issue is still not settled, and it is certainly of interest to
obtain guarantees for fully discretized algorithms. Nevertheless, in our presentation, we
have stuck with the analysis of [AC21] because it is the cleanest, and because it relies on
assumptions which are well-motivated from convex optimization.
Exercises
Mirror Langevin
264 CHAPTER 10. STRUCTURED SAMPLING
d𝑍𝑡 = {−[∇2𝜙 (𝑍𝑡 )] ∇𝑉 (𝑍𝑡 ) − [∇2𝜙 (𝑍𝑡 )] ∇3𝜙 (𝑍𝑡 ) [∇2𝜙 (𝑍𝑡 )] } d𝑡
−1 −1 −1
√
+ 2 [∇2𝜙 (𝑍𝑡 )]
−1/2
d𝐵𝑡 .
D Exercise 10.2 Markov semigroup theory for the mirror Langevin diffusion
1. Compute the generator of the mirror Langevin diffusion. Use this to show that 𝜋 is
stationary for the diffusion, and verify the equations (10.2.4) for the carré du champ
and Dirichlet energy.
2. Let ℒdual denote the generator for (𝑍𝑡∗ )𝑡 ≥0 (we write ℒdual instead of ℒ ∗ to avoid
confusion with the adjoint of ℒ). By computing 𝜕𝑡 E 𝑓 (𝑍𝑡∗ ), show that
Then, via a similar calculation to (10.2.5) and (10.2.6), show that the Dirichlet energy
for (𝑍𝑡∗ )𝑡 ≥0 can be expressed as
∫
ℰdual (𝑓 , 𝑔) = h∇𝑓 , [∇2𝜙 ∗ ] ∇𝑔i d𝜋 ∗ ,
−1
Non-Log-Concave Sampling
In this chapter, we study the problem of sampling from a smooth but non-log-concave
target. Although some results from previous chapters also cover some non-log-concave
targets (such as targets satisfying a Poincaré or log-Sobolev inequality), these results do
not encompass the full breadth of the non-log-concave sampling problem.
In general, one cannot hope for polynomial-time guarantees from sampling from
non-log-concave targets in usual metrics such as total variation distance. Instead, taking
inspiration from the literature on non-convex optimization, we will develop a notion of
approximate first-order stationarity for sampling, and show that this goal can achieved
via an averaged version of the LMC algorithm. This is based on the work [Bal+22].
267
268 CHAPTER 11. NON-LOG-CONCAVE SAMPLING
is often a useful first step towards more detailed analysis, and it has the advantage that we
can develop a general theory surrounding this notion. Note that in the convex case, finding
a global minimizer is equivalent to finding a first-order stationarity point, so stationary
point analysis can be viewed as a natural generalization of the convex optimization
analysis to non-convex settings.
To develop a sampling analogue of this concept, we recall that the Langevin diffusion
is the gradient flow of the KL divergence KL(· k 𝜋) w.r.t. the Wasserstein geometry
(Section 1.4). Moreover, the gradient of the KL divergence at 𝜇 is ∇ ln(𝜇/𝜋), and the
squared norm of the gradient is the Fisher information FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ].
Hence, a reasonable definition of finding an approximate first-order stationary point in
sampling is to output a sample from 𝜇 satisfying FI(𝜇 k 𝜋) ≤ 𝜀. We will show shortly
√︁
3
Proposition 11.1.1. Let 𝑚 > 0 and let 𝜋∓ B normal(∓𝑚, 1). Let 𝜇 B 4 𝜋 − + 14 𝜋+ and
𝜋 B 12 𝜋− + 12 𝜋+ . Then, it holds that
whereas
𝑚2
FI(𝜇 k 𝜋) . 𝑚 2 exp − →0 as 𝑚 → ∞ .
2
Technical remark. One has to be slightly careful with the definition of the Fisher
information. For example, suppose that 𝜋 is the standard Gaussian, and suppose that 𝜇 is
the Gaussian restricted to the unit ball. Then, it is tempting to argue that the density of 𝜇
is proportional to that of 𝜋 on the unit ball, and hence ∇ ln(𝜇/𝜋) = 0 (𝜇-a.e.); from the
expression FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ], it suggests that FI(𝜇 k 𝜋) = 0, and in particular
𝜇 is a spurious stationary point. However, this argument is not correct.
The reason is that 𝜇 does not have enough regularity w.r.t. 𝜋 in order to apply the
270 CHAPTER 11. NON-LOG-CONCAVE SAMPLING
formula FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ]. Indeed, in order to apply the formula, we must
d𝜇
require that the density d𝜋 lie in an appropriate Sobolev space w.r.t. 𝜋 (more precisely,
√︃
d𝜇
d𝜋 should lie in the domain of the Dirichlet energy functional). If this does not hold,
then we define the Fisher information to be infinite: FI(𝜇 k 𝜋) = ∞.
√︁ In our theorem below, the Fisher information bound should be interpreted as2follows:
FI(𝜇 k 𝜋) ≤ 𝜀 means that 𝜇 has enough regularity w.r.t. 𝜋 and E𝜇 [k∇ ln(𝜇/𝜋)k ] ≤ 𝜀 2 .
The Fisher information bound in the next theorem will be proven using the interpolation
technique (§4.2).
Theorem 11.2.1 ([Bal+22]). Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation of LMC
with step size ℎ > 0. Assume that 𝜋 ∝ exp(−𝑉 ) where ∇𝑉 is 𝛽-Lipschitz. Then, for any
step size 0 < ℎ ≤ 4𝛽1 , for all 𝑁 ∈ N,
1 2 KL(𝜇 0 k 𝜋)
∫ 𝑁ℎ
FI(𝜇𝑡 k 𝜋) d𝑡 ≤ + 6𝛽 2𝑑ℎ .
𝑁ℎ 0 𝑁ℎ
√ √
In particular, if KL(𝜇 0 k 𝜋) ≤ 𝐾0 and we choose ℎ = 𝐾0 /(2𝛽 𝑑𝑁 ), then provided that
𝑁 ≥ 9𝐾0 /𝑑,
√
1 8𝛽 𝑑𝐾0
∫ 𝑁ℎ
FI(𝜇𝑡 k 𝜋) d𝑡 ≤ √ .
𝑁ℎ 0 𝑁
In order to translate the result into a more useful form, we recall that the Fisher
information is convex in its first argument.
Proof. Let 𝜇0, 𝜇1 ∈ P (R𝑑 ) be such that FI(𝜇0 k 𝜋) ∨ FI(𝜇1 k 𝜋) < ∞. For 𝑡 ∈ (0, 1), let
11.2. FISHER INFORMATION BOUND 271
d𝜇
𝜇𝑡 B (1 − 𝑡) 𝜇 0 + 𝑡 𝜇1 , and write 𝑓𝑡 B d𝜋𝑡 = (1 − 𝑡) 𝑓0 + 𝑡 𝑓1 . Then,
1
∫
FI( 𝜇¯𝑁ℎ k 𝜋) ≤ FI(𝜇𝑡 k 𝜋) d𝑡 (11.2.3)
𝑁ℎ
and the guarantees of the theorem translate into guarantees for 𝜇¯𝑁ℎ . Moreover, we can
output a sample from 𝜇¯𝑁ℎ via the following procedure:
1. Pick a time 𝑡 ∈ [0, 𝑁ℎ] uniformly at random.
2. Let 𝑘 be the largest integer such that 𝑘ℎ ≤ 𝑡, and let 𝑋𝑘ℎ denote the 𝑘-th iterate of
the LMC algorithm.
3. Perform a partial LMC update
√
𝑋𝑡 = 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ )
and output 𝑋𝑡 .
Combined with Theorem 11.2.1 and (11.2.3), and assuming that KL(𝜇0 k 𝜋) = 𝑂 (𝑑),
we
√︁ conclude that it is possible to algorithmically obtain a sample from a measure 𝜇 with
FI(𝜇 k 𝜋) ≤ 𝜀 using 𝑂 (𝛽 2𝑑 2 /𝜀 4 ) queries to ∇𝑉 .
We now give the proof of Theorem 11.2.1, which combines the usual stationary point
analysis in non-convex optimization with the interpolation argument.
Proof of Theorem 11.2.1. Recall from the proof of Theorem 4.2.5 that
1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
2
This inequality was obtained under the sole assumption that ∇𝑉 is 𝛽-Lipschitz. In The-
orem 4.2.5, we proceeded to apply a log-Sobolev inequality, but here we will instead
telescope this inequality. By integrating over 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ],
1 (𝑘+1)ℎ
∫
KL(𝜇 (𝑘+1)ℎ k 𝜋) − KL(𝜇𝑘ℎ k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) d𝑡 + 3𝛽 2𝑑ℎ 2 .
2 𝑘ℎ
272 CHAPTER 11. NON-LOG-CONCAVE SAMPLING
Theorem 11.3.2. Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation (11.3.1) of LMC, and
suppose that the target is 𝜋 ∝ exp(−𝑉 ) where ∇𝑉 is 𝛽-Lipschitz. Suppose that LMC is
initialized at a measure 𝜇 0 with KL(𝜇0 k 𝜋) < ∞, and that the sequence of step sizes
satisfies 0 < ℎ𝑘 < 4𝛽1 for all 𝑘 ∈ N+ together with the conditions
∞ ∞
ℎ𝑘2 < ∞ .
∑︁ ∑︁
ℎ𝑘 = ∞ , and
𝑘=1 𝑘=1
1
Write 𝜇¯𝜏𝑛 B 𝜇𝑡 d𝑡. Then, 𝜇¯𝜏𝑛 → 𝜋 weakly.
∫ 𝜏𝑛
𝜏𝑛 0
Proof. By repeating the proof of Theorem 11.2.1 but incorporating time-varying step sizes,
we obtain for 𝑡 ∈ [𝜏𝑛 , 𝜏𝑛+1 ]
1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝜏𝑛 ) . (11.3.3)
2
By integrating this inequality and summing,
1 𝜏𝑛
∫ 𝑛
FI(𝜇𝑡 k 𝜋) d𝑡 + 3𝛽 2𝑑 ℎ𝑘2 .
∑︁
KL(𝜇𝜏𝑛 k 𝜋) ≤ KL(𝜇 0 k 𝜋) − (11.3.4)
2 0
𝑘=1
11.3. APPLICATIONS OF THE FISHER INFORMATION BOUND 273
2 KL(𝜇0 k 𝜋) 6𝛽 2𝑑 ∑︁ 2
∞
1
∫ 𝜏𝑛
FI( 𝜇¯𝜏𝑛 k 𝜋) ≤ FI(𝜇𝑡 k 𝜋) d𝑡 ≤ + ℎ𝑘 . (11.3.5)
𝜏𝑛 0 𝜏𝑛 𝜏𝑛
𝑘=1
On the other hand, if 𝑡 ∈ [𝜏𝑛 , 𝜏𝑛+1 ], then integrating (11.3.3) and combining with (11.3.4)
yields
∞
2 2 2
ℎ𝑘2 < ∞ .
∑︁
KL(𝜇𝑡 k 𝜋) ≤ KL(𝜇𝜏𝑛 k 𝜋) + 3𝛽 𝑑 (𝑡 − 𝜏𝑛 ) ≤ KL(𝜇 0 k 𝜋) + 6𝛽 𝑑
𝑘=1
and in this case a Fisher information guarantee readily translates into a KL divergence
guarantee; however, this is not very interesting because we have obtained a sharper KL
divergence guarantee for targets 𝜋 satisfying a log-Sobolev inequality in Theorem 4.2.5.
Instead, we will show that under the weaker assumption of a Poincaré inequality, a Fisher
information guarantee implies a total variation guarantee.
The key observation is the following implication of a Poincaré inequality.
d𝜇
Proof. We can assume 𝜇 𝜋; let 𝑓 B d𝜋 . The total variation distance has the expressions
1 1
∫ ∫
k𝜇 − 𝜋 k TV = |𝑓 − 1| d𝜋 = {(𝑓 ∨ 1) − (𝑓 ∧ 1)} d𝜋
2 2
∫ √︁ ∫ √︁ √︄∫ ∫
𝑓 d𝜋 = (𝑓 ∧ 1) (𝑓 ∨ 1) d𝜋 ≤ (𝑓 ∧ 1) d𝜋 (𝑓 ∨ 1) d𝜋
√︃
1 − k𝜇 − 𝜋 k 2TV .
√︁
= (1 − k𝜇 − 𝜋 k TV ) (1 + k𝜇 − 𝜋 k TV ) =
Therefore,
∫ √︁ 2
− 𝜋 k 2TV
√︁
k𝜇 ≤ 1− 𝑓 d𝜋 = var𝜋 𝑓.
This is sometimes called Le Cam’s inequality; in statistics, the right-hand side is often
written as 𝐻 2 (𝜇, 𝜋) (1 − 14 𝐻 2 (𝜇, 𝜋)), where 𝐻 2 denotes the squared Hellinger distance.
Applying the Poincaré inequality,
𝐶 PI
k𝜇 − 𝜋 k 2TV ≤ 𝐶 PI E𝜋 [k∇ 𝑓 k 2 ] =
√︁
FI(𝜇 k 𝜋) .
4
Corollary 11.3.7. Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation of LMC with step
size ℎ > 0. Assume that 𝜋 ∝ exp(−𝑉 ), where ∇𝑉 is 𝛽-Lipschitz and 𝜋 satisfies the
Poincaré
√ inequality
√ with constant 𝐶 PI . Then, if KL(𝜇0 k 𝜋) ≤ 𝐾0 and we choose step size
ℎ = 𝐾0 /(2𝛽 𝑑𝑁 ), then
√
1 ∫ 𝑡
2 2𝐶 𝛽 𝑑𝐾0
k 𝜇¯𝑁ℎ − 𝜋 k 2TV B
PI
𝜇𝑡 d𝑡 − 𝜋
≤ √
.
𝑁ℎ 0 TV 𝑁
11.3. APPLICATIONS OF THE FISHER INFORMATION BOUND 275
Bibliographical Notes
Exercises
D Exercise 11.1 Fisher information for mixtures of Gaussians
Prove Proposition 11.1.1.
276 CHAPTER 11. NON-LOG-CONCAVE SAMPLING
Bibliography
277
278 BIBLIOGRAPHY
[Bar+18] Jean-Baptiste Bardet, Nathaël Gozlan, Florent Malrieu, and Pierre-André Zitt.
“Functional inequalities for Gaussian convolutions of compactly supported
measures: explicit bounds and dimension dependence”. In: Bernoulli 24.1
(2018), pp. 333–353.
[BB18] Adrien Blanchet and Jérôme Bolte. “A family of functional inequalities:
Łojasiewicz inequalities and displacement convex functions”. In: J. Funct.
Anal. 275.7 (2018), pp. 1650–1673.
[BB99] Jean-David Benamou and Yann Brenier. “A numerical method for the optimal
time-continuous mass transport problem and related problems”. In: Monge
Ampère equation: applications to geometry and optimization (Deerfield Beach,
FL, 1997). Vol. 226. Contemp. Math. Amer. Math. Soc., Providence, RI, 1999,
pp. 1–11.
[BBI01] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry.
Vol. 33. Graduate Studies in Mathematics. American Mathematical Society,
Providence, RI, 2001, pp. xiv+415.
[BBT17] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. “A descent lemma
beyond Lipschitz gradient continuity: first-order methods revisited and ap-
plications”. In: Math. Oper. Res. 42.2 (2017), pp. 330–348.
[BC13] Franck Barthe and Dario Cordero-Erausquin. “Invariances in variance esti-
mates”. In: Proc. Lond. Math. Soc. (3) 106.1 (2013), pp. 33–64.
[BD01] Louis J. Billera and Persi Diaconis. “A geometric interpretation of the
Metropolis–Hastings algorithm”. In: Statist. Sci. 16.4 (2001), pp. 335–339.
[BEZ20] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. “Coupling and
convergence for Hamiltonian Monte Carlo”. In: Ann. Appl. Probab. 30.3 (2020),
pp. 1209–1250.
[BGG18] François Bolley, Ivan Gentil, and Arnaud Guillin. “Dimensional improvements
of the logarithmic Sobolev, Talagrand and Brascamp–Lieb inequalities”. In:
Ann. Probab. 46.1 (2018), pp. 261–301.
[BGL14] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of
Markov diffusion operators. Vol. 348. Grundlehren der Mathematischen Wis-
senschaften [Fundamental Principles of Mathematical Sciences]. Springer,
Cham, 2014, pp. xx+552.
[BH97] Serguei G. Bobkov and Christian Houdré. “Some connections between
isoperimetric and Sobolev-type inequalities”. In: Mem. Amer. Math. Soc.
129.616 (1997), pp. viii+111.
BIBLIOGRAPHY 279
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration in-
equalities. A nonasymptotic theory of independence, With a foreword by
Michel Ledoux. Oxford University Press, Oxford, 2013, pp. x+481.
[Bor75] Christer Borell. “The Brunn–Minkowski inequality in Gauss space”. In: Invent.
Math. 30.2 (1975), pp. 207–216.
[Bor85] Christer Borell. “Geometric bounds on the Ornstein–Uhlenbeck velocity
process”. In: Z. Wahrsch. Verw. Gebiete 70.1 (1985), pp. 1–13.
[Bub15] Sébastien Bubeck. “Convex optimization: algorithms and complexity”. In:
Foundations and Trends® in Machine Learning 8.3-4 (2015), pp. 231–357.
[BV05] François Bolley and Cédric Villani. “Weighted Csiszár–Kullback–Pinsker
inequalities and applications to transportation inequalities”. In: Ann. Fac. Sci.
Toulouse Math. (6) 14.3 (2005), pp. 331–352.
[Car92] Manfredo P. do Carmo. Riemannian geometry. Mathematics: Theory & Appli-
cations. Translated from the second Portuguese edition by Francis Flaherty.
Birkhäuser Boston, Inc., Boston, MA, 1992, pp. xiv+300.
[CBL22] Niladri S. Chatterji, Peter L. Bartlett, and Philip M. Long. “Oracle lower
bounds for stochastic gradient sampling algorithms”. In: Bernoulli 28.2 (2022),
pp. 1074–1092.
[CCN21] Hong-Bin Chen, Sinho Chewi, and Jonathan Niles-Weed. “Dimension-free
log-Sobolev inequalities for mixture distributions”. In: Journal of Functional
Analysis 281.11 (2021), p. 109236.
[CFM04] Dario Cordero-Erausquin, Matthieu Fradelizi, and Bernard Maurey. “The (B)
conjecture for the Gaussian measure of dilates of symmetric convex sets and
related problems”. In: J. Funct. Anal. 214.2 (2004), pp. 410–427.
[Cha04] Djalil Chafai. “Entropies, convexity, and functional inequalities: on Φ-
entropies and Φ-Sobolev inequalities”. In: J. Math. Kyoto Univ. 44.2 (2004),
pp. 325–363.
[Che+18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan.
“Underdamped Langevin MCMC: a non-asymptotic analysis”. In: Proceedings
of the 31st Conference on Learning Theory. Ed. by Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet. Vol. 75. Proceedings of Machine Learning
Research. PMLR, June 2018, pp. 300–323.
[Che+20a] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. “Fast mixing
of Metropolized Hamiltonian Monte Carlo: benefits of multi-step gradients”.
In: J. Mach. Learn. Res. 21 (2020), Paper No. 92, 71.
280 BIBLIOGRAPHY
[Che+20b] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and
Austin J. Stromme. “Exponential ergodicity of mirror-Langevin diffusions”.
In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran Associates,
Inc., 2020, pp. 19573–19585.
[Che+21a] Sinho Chewi, Murat A. Erdogdu, Mufan B. Li, Ruoqi Shen, and Matthew
Zhang. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev”.
In: arXiv e-prints, arXiv:2112.12662 (2021).
[Che+21b] Sinho Chewi, Chen Lu, Kwangjun Ahn, Xiang Cheng, Thibaut Le Gouic, and
Philippe Rigollet. “Optimal dimension dependence of the Metropolis-adjusted
Langevin algorithm”. In: Proceedings of Thirty Fourth Conference on Learning
Theory. Ed. by Mikhail Belkin and Samory Kpotufe. Vol. 134. Proceedings of
Machine Learning Research. PMLR, 15–19 Aug 2021, pp. 1260–1300.
[Che+22a] Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. “Improved
analysis for a proximal algorithm for sampling”. In: Proceedings of Thirty
Fifth Conference on Learning Theory. Ed. by Po-Ling Loh and Maxim Ragin-
sky. Vol. 178. Proceedings of Machine Learning Research. PMLR, Feb. 2022,
pp. 2984–3014.
[Che+22b] Sinho Chewi, Patrik R. Gerber, Chen Lu, Thibaut Le Gouic, and Philippe
Rigollet. “The query complexity of sampling from strongly log-concave dis-
tributions in one dimension”. In: Proceedings of Thirty Fifth Conference on
Learning Theory. Ed. by Po-Ling Loh and Maxim Raginsky. Vol. 178. Proceed-
ings of Machine Learning Research. PMLR, Feb. 2022, pp. 2041–2059.
[Che21a] Yuansi Chen. “An almost constant lower bound of the isoperimetric coef-
ficient in the KLS conjecture”. In: Geom. Funct. Anal. 31.1 (2021), pp. 34–
61.
[Che21b] Sinho Chewi. “The entropic barrier is 𝑛-self-concordant”. In: arXiv e-prints,
arXiv:2112.10947 (2021).
[CLL19] Yu Cao, Jianfeng Lu, and Yulong Lu. “Exponential decay of Rényi divergence
under Fokker–Planck equations”. In: J. Stat. Phys. 176.5 (2019), pp. 1172–1184.
[CLW21] Yu Cao, Jianfeng Lu, and Lihan Wang. “Complexity of randomized algorithms
for underdamped Langevin dynamics”. In: Commun. Math. Sci. 19.7 (2021),
pp. 1827–1853.
[Cor02] Dario Cordero-Erausquin. “Some applications of mass transport to Gaussian-
type inequalities”. In: Arch. Ration. Mech. Anal. 161.3 (2002), pp. 257–269.
BIBLIOGRAPHY 281
[Eva10] Lawrence C. Evans. Partial differential equations. Second. Vol. 19. Graduate
Studies in Mathematics. American Mathematical Society, Providence, RI,
2010, pp. xxii+749.
[FLO21] James Foster, Terry Lyons, and Harald Oberhauser. “The shifted ODE method
for underdamped Langevin MCMC”. In: arXiv e-prints, arXiv:2101.03446
(2021).
[Föl85] Hans Föllmer. “An entropy approach to the time reversal of diffusion pro-
cesses”. In: Stochastic differential systems (Marseille-Luminy, 1984). Vol. 69.
Lect. Notes Control Inf. Sci. Springer, Berlin, 1985, pp. 156–163.
[Fol99] Gerald B. Folland. Real analysis. Second. Pure and Applied Mathematics
(New York). Modern techniques and their applications, A Wiley-Interscience
Publication. John Wiley & Sons, Inc., New York, 1999, pp. xvi+386.
[FS18] Max Fathi and Yan Shu. “Curvature and transport inequalities for Markov
chains in discrete spaces”. In: Bernoulli 24.1 (2018), pp. 672–698.
[Gen+20] Ivan Gentil, Christian Léonard, Luigia Ripani, and Luca Tamanini. “An en-
tropic interpolation proof of the HWI inequality”. In: Stochastic Process. Appl.
130.2 (2020), pp. 907–923.
[GLL20] Rong Ge, Holden Lee, and Jianfeng Lu. “Estimating normalizing constants
for log-concave distributions: algorithms and lower bounds”. In: STOC ’20—
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Com-
puting. ACM, New York, [2020] ©2020, pp. 579–586.
[GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. “Private convex optimization
via exponential mechanism”. In: Proceedings of Thirty Fifth Conference on
Learning Theory. Ed. by Po-Ling Loh and Maxim Raginsky. Vol. 178. Proceed-
ings of Machine Learning Research. PMLR, Feb. 2022, pp. 1948–1989.
[GM11] Olivier Guédon and Emanuel Milman. “Interpolating thin-shell and sharp
large-deviation estimates for isotropic log-concave measures”. In: Geom.
Funct. Anal. 21.5 (2011), pp. 1043–1068.
[GM96] Wilfrid Gangbo and Robert J. McCann. “The geometry of optimal transporta-
tion”. In: Acta Math. 177.2 (1996), pp. 113–161.
[Goz+14] Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. “Dis-
placement convexity of entropy and related inequalities on graphs”. In:
Probab. Theory Related Fields 160.1-2 (2014), pp. 47–94.
284 BIBLIOGRAPHY
[Gro07] Misha Gromov. Metric structures for Riemannian and non-Riemannian spaces.
English. Modern Birkhäuser Classics. Based on the 1981 French original,
With appendices by M. Katz, P. Pansu and S. Semmes, Translated from the
French by Sean Michael Bates. Birkhäuser Boston, Inc., Boston, MA, 2007,
pp. xx+585.
[GT20] Arun Ganesh and Kunal Talwar. “Faster differentially private samplers via
Rényi divergence analysis of discretized Langevin MCMC”. In: Advances
in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato,
R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020,
pp. 7222–7233.
[GV22] Khashayar Gatmiry and Santosh S. Vempala. “Convergence of the Rieman-
nian Langevin algorithm”. In: arXiv e-prints, arXiv:2204.10818 (2022).
[Han16] Ramon van Handel. Probability in high dimension. 2016.
[HBE20] Ye He, Krishnakumar Balasubramanian, and Murat A. Erdogdu. “On the
ergodicity, bias and asymptotic normality of randomized midpoint sampling
method”. In: Advances in Neural Information Processing Systems. Ed. by H.
Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran
Associates, Inc., 2020, pp. 7366–7376.
[HG14] Matthew D. Hoffman and Andrew Gelman. “The no-U-turn sampler: adap-
tively setting path lengths in Hamiltonian Monte Carlo”. In: J. Mach. Learn.
Res. 15 (2014), pp. 1593–1623.
[Hsi+18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. “Mirrored
Langevin dynamics”. In: Advances in Neural Information Processing Systems.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett. Vol. 31. Curran Associates, Inc., 2018.
[Hsu02] Elton P. Hsu. Stochastic analysis on manifolds. Vol. 38. Graduate Stud-
ies in Mathematics. American Mathematical Society, Providence, RI, 2002,
pp. xiv+281.
[IM12] Marcus Isaksson and Elchanan Mossel. “Maximally stable Gaussian partitions
with discrete applications”. In: Israel J. Math. 189 (2012), pp. 347–396.
[Jia+21] He Jia, Aditi Laddha, Yin Tat Lee, and Santosh Vempala. “Reducing isotropy
and volume to KLS: an 𝑂 ∗ (𝑛 3𝜓 2 ) volume algorithm”. In: Proceedings of the
53rd Annual ACM SIGACT Symposium on Theory of Computing. New York,
NY, USA: Association for Computing Machinery, 2021, pp. 961–974.
BIBLIOGRAPHY 285
[Jia21] Qijia Jiang. “Mirror Langevin Monte Carlo: the case under isoperimetry”.
In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34.
Curran Associates, Inc., 2021, pp. 715–725.
[JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto. “The variational formu-
lation of the Fokker–Planck equation”. In: SIAM J. Math. Anal. 29.1 (1998),
pp. 1–17.
[KKO18] Guy Kindler, Naomi Kirshner, and Ryan O’Donnell. “Gaussian noise sensi-
tivity and Fourier tails”. In: Israel J. Math. 225.1 (2018), pp. 71–109.
[KL22] Bo’az Klartag and Joseph Lehec. “Bourgain’s slicing problem and KLS
isoperimetry up to polylog”. In: arXiv e-prints, arXiv:2203.15551 (2022).
[Kla+16] Bo’az Klartag, Gady Kozma, Peter Ralli, and Prasad Tetali. “Discrete curvature
and abelian groups”. In: Canad. J. Math. 68.3 (2016), pp. 655–674.
[KLM06] Ravi Kannan, László Lovász, and Ravi Montenegro. “Blocking conductance
and mixing in random walks”. In: Combin. Probab. Comput. 15.4 (2006),
pp. 541–570.
[KLS95] Ravi Kannan, László Lovász, and Miklós Simonovits. “Isoperimetric problems
for convex bodies and a localization lemma”. In: Discrete Comput. Geom.
13.3-4 (1995), pp. 541–559.
[Kol14] Alexander V. Kolesnikov. “Hessian metrics, 𝐶𝐷 (𝐾, 𝑁 )-spaces, and optimal
transportation of log-concave measures”. In: Discrete Contin. Dyn. Syst. 34.4
(2014), pp. 1511–1532.
[LC22a] Jiaming Liang and Yongxin Chen. “A proximal algorithm for sampling”. In:
arXiv e-prints, arXiv:2202.13975 (2022).
[LC22b] Jiaming Liang and Yongxin Chen. “A proximal algorithm for sampling from
non-smooth potentials”. In: arXiv e-prints, arXiv:2110.04597 (2022).
[Le 16] Jean-François Le Gall. Brownian motion, martingales, and stochastic calculus.
French. Vol. 274. Graduate Texts in Mathematics. Springer, [Cham], 2016,
pp. xiii+273.
[Led00] Michel Ledoux. “The geometry of Markov diffusion generators”. In: vol. 9. 2.
Probability theory. 2000, pp. 305–366.
[Led01] Michel Ledoux. The concentration of measure phenomenon. Vol. 89. Mathemat-
ical Surveys and Monographs. American Mathematical Society, Providence,
RI, 2001, pp. x+181.
286 BIBLIOGRAPHY
[Léo17] Christian Léonard. “On the convexity of the entropy along entropic interpo-
lations”. In: Measure theory in non-smooth spaces. Partial Differ. Equ. Meas.
Theory. De Gruyter Open, Warsaw, 2017, pp. 194–242.
[LFN18] Haihao Lu, Robert M. Freund, and Yurii Nesterov. “Relatively smooth convex
optimization by first-order methods, and applications”. In: SIAM J. Optim.
28.1 (2018), pp. 333–354.
[Li+19] Xuechen Li, Yi Wu, Lester Mackey, and Murat A. Erdogdu. “Stochastic
Runge–Kutta accelerates Langevin Monte Carlo and beyond”. In: Advances
in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Asso-
ciates, Inc., 2019.
[Li+22] Ruilin Li, Molei Tao, Santosh S. Vempala, and Andre Wibisono. “The mirror
Langevin algorithm converges with vanishing bias”. In: Proceedings of The
33rd International Conference on Algorithmic Learning Theory. Ed. by Sanjoy
Dasgupta and Nika Haghtalab. Vol. 167. Proceedings of Machine Learning
Research. PMLR, 29 Mar–01 Apr 2022, pp. 718–742.
[LO00] Rafał Latała and Krzysztof Oleszkiewicz. “Between Sobolev and Poincaré”.
In: Geometric aspects of functional analysis. Vol. 1745. Lecture Notes in Math.
Springer, Berlin, 2000, pp. 147–168.
[LS88] Gregory F. Lawler and Alan D. Sokal. “Bounds on the 𝐿 2 spectrum for Markov
chains and Markov processes: a generalization of Cheeger’s inequality”. In:
Trans. Amer. Math. Soc. 309.2 (1988), pp. 557–580.
[LS93] László Lovász and Miklós Simonovits. “Random walks in a convex body and
an improved volume algorithm”. In: Random Structures Algorithms 4.4 (1993),
pp. 359–412.
[LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Logsmooth gradient concen-
tration and tighter runtimes for Metropolized Hamiltonian Monte Carlo”.
In: Proceedings of Thirty Third Conference on Learning Theory. Ed. by Jacob
Abernethy and Shivani Agarwal. Vol. 125. Proceedings of Machine Learning
Research. PMLR, Sept. 2020, pp. 2565–2597.
[LST21a] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Lower bounds on Metropolized
sampling methods for well-conditioned distributions”. In: Advances in Neu-
ral Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y.
Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34. Curran Associates,
Inc., 2021, pp. 18812–18824.
BIBLIOGRAPHY 287
[LST21b] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Structured logconcave sampling
with a restricted Gaussian oracle”. In: arXiv e-prints, arXiv:2010.03106 (2021).
[LST21c] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Structured logconcave sampling
with a restricted Gaussian oracle”. In: Proceedings of Thirty Fourth Conference
on Learning Theory. Ed. by Mikhail Belkin and Samory Kpotufe. Vol. 134.
Proceedings of Machine Learning Research. PMLR, 15–19 Aug 2021, pp. 2993–
3050.
[LV09] John Lott and Cédric Villani. “Ricci curvature for metric-measure spaces via
optimal transport”. In: Ann. of Math. (2) 169.3 (2009), pp. 903–991.
[LV17] Yin Tat Lee and Santosh S. Vempala. “Eldan’s stochastic localization and
the KLS hyperplane conjecture: an improved lower bound for expansion”.
In: 58th Annual IEEE Symposium on Foundations of Computer Science—FOCS
2017. IEEE Computer Soc., Los Alamitos, CA, 2017, pp. 998–1007.
[LY21] Yin Tat Lee and Man-Chung Yue. “Universal barrier is 𝑛-self-concordant”.
In: Math. Oper. Res. 46.3 (2021), pp. 1129–1148.
[LZT22] Ruilin Li, Hongyuan Zha, and Molei Tao. “Sqrt(𝑑) dimension dependence of
Langevin Monte Carlo”. In: International Conference on Learning Representa-
tions. 2022.
[Ma+21] Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L.
Bartlett, and Michael I. Jordan. “Is there an analog of Nesterov acceleration
for gradient-based MCMC?” In: Bernoulli 27.3 (2021), pp. 1942–1992.
[Maa11] Jan Maas. “Gradient flows of the entropy for finite Markov chains”. In: J.
Funct. Anal. 261.8 (2011), pp. 2250–2292.
[Mar96] Katalin Marton. “A measure concentration inequality for contracting Markov
chains”. In: Geom. Funct. Anal. 6.3 (1996), pp. 556–571.
[Mie13] Alexander Mielke. “Geodesic convexity of the relative entropy in reversible
Markov chains”. In: Calc. Var. Partial Differential Equations 48.1-2 (2013),
pp. 1–31.
[Mil09] Emanuel Milman. “On the role of convexity in isoperimetry, spectral gap
and concentration”. In: Invent. Math. 177.1 (2009), pp. 1–43.
[Mir17] Ilya Mironov. “Rényi differential privacy”. In: 2017 IEEE 30th Computer Secu-
rity Foundations Symposium (CSF). 2017, pp. 263–275.
[MN15] Elchanan Mossel and Joe Neeman. “Robust optimality of Gaussian noise
stability”. In: J. Eur. Math. Soc. (JEMS) 17.2 (2015), pp. 433–482.
288 BIBLIOGRAPHY
[MS19] Oren Mangoubi and Aaron Smith. “Mixing of Hamiltonian Monte Carlo on
strongly log-concave distributions 2: numerical integrators”. In: Proceedings
of the Twenty-Second International Conference on Artificial Intelligence and
Statistics. Ed. by Kamalika Chaudhuri and Masashi Sugiyama. Vol. 89. Pro-
ceedings of Machine Learning Research. PMLR, 16–18 Apr 2019, pp. 586–
595.
[MV18] Oren Mangoubi and Nisheeth Vishnoi. “Dimensionally tight bounds for
second-order Hamiltonian Monte Carlo”. In: Advances in Neural Information
Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc., 2018.
[Nea11] Radford M. Neal. “MCMC using Hamiltonian dynamics”. In: Handbook of
Markov chain Monte Carlo. Chapman & Hall/CRC Handb. Mod. Stat. Methods.
CRC Press, Boca Raton, FL, 2011, pp. 113–162.
[Nes18] Yurii Nesterov. Lectures on convex optimization. Vol. 137. Springer Optimiza-
tion and Its Applications. Springer, Cham, 2018, pp. xxiii+589.
[NN94] Yurii Nesterov and Arkadii S. Nemirovskii. Interior-point polynomial algo-
rithms in convex programming. Vol. 13. SIAM Studies in Applied Mathematics.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1994, pp. x+405.
[NW20] Richard Nickl and Sven Wang. “On polynomial-time computation of high-
dimensional posterior measures by Langevin-type algorithms”. In: arXiv
e-prints, arXiv:2009.05298 (2020).
[NY83] Arkadii S. Nemirovsky and David B. Yudin. Problem complexity and method
efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics.
Translated from the Russian and with a preface by E. R. Dawson. John Wiley
& Sons, Inc., New York, 1983, pp. xv+388.
[Oll07] Yann Ollivier. “Ricci curvature of metric spaces”. In: C. R. Math. Acad. Sci.
Paris 345.11 (2007), pp. 643–646.
[Oll09] Yann Ollivier. “Ricci curvature of Markov chains on metric spaces”. In: J.
Funct. Anal. 256.3 (2009), pp. 810–864.
[Ott01] Felix Otto. “The geometry of dissipative evolution equations: the porous
medium equation”. In: Comm. Partial Differential Equations 26.1-2 (2001),
pp. 101–174.
BIBLIOGRAPHY 289
[Tal96] Michel Talagrand. “A new look at independence”. In: Ann. Probab. 24.1 (1996),
pp. 1–34.
[Ver18] Roman Vershynin. High-dimensional probability. Vol. 47. Cambridge Series in
Statistical and Probabilistic Mathematics. An introduction with applications
in data science, With a foreword by Sara van de Geer. Cambridge University
Press, Cambridge, 2018, pp. xiv+284.
[Vil03] Cédric Villani. Topics in optimal transportation. Vol. 58. Graduate Stud-
ies in Mathematics. American Mathematical Society, Providence, RI, 2003,
pp. xvi+370.
[Vil09] Cédric Villani. Optimal transport. Vol. 338. Grundlehren der Mathematischen
Wissenschaften [Fundamental Principles of Mathematical Sciences]. Old and
new. Springer-Verlag, Berlin, 2009, pp. xxii+973.
[VW19] Santosh Vempala and Andre Wibisono. “Rapid convergence of the unadjusted
Langevin algorithm: isoperimetry suffices”. In: Advances in Neural Informa-
tion Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer,
F. d’Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8094–
8106.
[Wib18] Andre Wibisono. “Sampling as optimization in the space of measures: the
Langevin dynamics as a composite optimization problem”. In: Proceedings of
the 31st Conference on Learning Theory. Ed. by Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet. Vol. 75. Proceedings of Machine Learning
Research. PMLR, June 2018, pp. 2093–3027.
[WSC21] Keru Wu, Scott Schmidler, and Yuansi Chen. “Minimax mixing time of the
Metropolis-adjusted Langevin algorithm for log-concave sampling”. In: arXiv
e-prints, arXiv:2109.13055 (2021).
[Zha+20] Kelvin S. Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. “Wasserstein
control of mirror Langevin Monte Carlo”. In: Proceedings of Thirty Third
Conference on Learning Theory. Ed. by Jacob Abernethy and Shivani Agarwal.
Vol. 125. Proceedings of Machine Learning Research. PMLR, Sept. 2020,
pp. 3814–3841.
Index
291
292 INDEX