0% found this document useful (0 votes)
82 views308 pages

Log-Concave Sampling

Uploaded by

james
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views308 pages

Log-Concave Sampling

Uploaded by

james
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 308

Log-Concave Sampling

(unfinished draft)

Sinho Chewi

September 25, 2022


2
Contents

Preface i

I Diffusions in Continuous Time 1


1 The Langevin Diffusion in Continuous Time 3
1.1 A Primer on Stochastic Calculus . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Markov Semigroup Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 The Geometry of Optimal Transport . . . . . . . . . . . . . . . . . . . . . 30
1.4 The Langevin SDE as a Wasserstein Gradient Flow . . . . . . . . . . . . . 45
1.5 Overview of the Convergence Results . . . . . . . . . . . . . . . . . . . . 51
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2 Functional Inequalities 65
2.1 Overview of the Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.2 Proofs via Markov Semigroup Theory . . . . . . . . . . . . . . . . . . . . 68
2.3 Operations Preserving Functional Inequalities . . . . . . . . . . . . . . . 77
2.4 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.5 Isoperimetric Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.6 Metric Measure Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.7 Discrete Space and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3
4 CONTENTS

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3 Additional Topics in Stochastic Calculus 135


3.1 Quadratic Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

II Complexity of Sampling 141


4 Analysis of Langevin Monte Carlo 143
4.1 Proof via Wasserstein Coupling . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Proof via Interpolation Argument . . . . . . . . . . . . . . . . . . . . . . 148
4.3 Proof via Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . 151
4.4 Proof via Girsanov’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 156
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5 Convergence in Rényi Divergence 163


5.1 Proof under LSI via Interpolation Argument . . . . . . . . . . . . . . . . 163
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

6 Faster Low-Accuracy Samplers 171


6.1 Randomized Midpoint Discretization . . . . . . . . . . . . . . . . . . . . 171
6.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.3 The Underdamped Langevin Diffusion . . . . . . . . . . . . . . . . . . . . 180
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

7 High-Accuracy Samplers 191


7.1 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.2 The Metropolis–Hastings Filter . . . . . . . . . . . . . . . . . . . . . . . . 193
7.3 An Overview of High-Accuracy Samplers . . . . . . . . . . . . . . . . . . 196
7.4 Markov Chains in Discrete Time . . . . . . . . . . . . . . . . . . . . . . . 201
7.5 Analysis of MALA for a Feasible Start . . . . . . . . . . . . . . . . . . . . 208
7.6 Analysis of MALA for a Warm Start . . . . . . . . . . . . . . . . . . . . . 212
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
CONTENTS 5

8 The Proximal Sampler 219


8.1 Introduction to the Proximal Sampler . . . . . . . . . . . . . . . . . . . . 219
8.2 Convergence under Strong Log-Concavity . . . . . . . . . . . . . . . . . 221
8.3 Simultaneous Heat Flow and Time Reversal . . . . . . . . . . . . . . . . . 224
8.4 Convergence under Log-Concavity . . . . . . . . . . . . . . . . . . . . . 228
8.5 Convergence under Functional Inequalities . . . . . . . . . . . . . . . . . 230
8.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

9 Lower Bounds for Sampling 237


9.1 A Query Complexity Result in One Dimension . . . . . . . . . . . . . . . 237
9.2 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

10 Structured Sampling 249


10.1 Coordinate Langevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.2 Mirror Langevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.3 Proximal Langevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.4 Stochastic Gradient Langevin . . . . . . . . . . . . . . . . . . . . . . . . . 263
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

11 Non-Log-Concave Sampling 267


11.1 Approximate First-Order Stationarity via Fisher Information . . . . . . . 267
11.2 Fisher Information Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.3 Applications of the Fisher Information Bound . . . . . . . . . . . . . . . 272
Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

References 277

Index 291
6 CONTENTS
Preface

What This Book Contains


As discussed in the next section, a large portion of this book is dedicated to a systematic
and unified treatment of recent developments in the complexity theory for log-concave
sampling, with a particular emphasis on connections with the field of optimization. Many
of these developments appear here in textbook form for the first time. Although this is
still an active area of research, at this time there is enough beautiful mathematics and
canonical theory that it seemed a shame not to have available an exposition which is
accessible to, say, an ambitious graduate student.
From a broader view, however, it is not the specific applications to log-concave sam-
pling, but rather the general perspective and techniques used, that will have the largest
impact on the reader. With this in mind, the book includes several topics which are not di-
rectly related to sampling, but loosely illustrate the general theme of “modern applications
of stochastic analysis to probability and statistics”. The applications range from classical
mathematical questions, such as concentration of measure and geometry, to instances in
which the philosophy of diffusion processes has inspired recent algorithms for machine
learning tasks. Although tastes change, the overall importance of this perspective only
seems to grow with time.
The subject matter of this book touches upon many fields, such as geometry, PDE,
stochastic calculus, etc., and a primary goal of the exposition here is to make the material
accessible without extensive background knowledge in these topics. This means that at
several places we have sacrificed full mathematical rigor in favor of (hopefully) more
lucid explanations, referring to the original sources for details. As such, these subjects are
not prerequisites for this book, although more background knowledge on the reader’s
part naturally translates into an healthier understanding of the context of the material.
The main exceptions to this statement are: (1) we assume that the reader is familiar

i
ii PREFACE

with graduate-level analysis and probability; (2) since much of the theory of sampling
is inspired by ideas from optimization, we highly recommend that the reader is familiar
with the latter, as treated in, e.g., [Bub15; Nes18].

The Complexity of Sampling


In this book, we consider the following canonical sampling problem:

Given query access to a smooth function 𝑉 : R𝑑 → R, what is the mini-


mum number of queries required to output an approximate sample from the
probability density 𝜋 ∝ exp(−𝑉 ) on R𝑑 ?

The problem formulation is chosen due to the following considerations. In many


applications (some described below), we wish to sample from a probability density 𝜋,
and we have an explicit function 𝑉 : R𝑑 → R such that 𝜋 ∝ exp(−𝑉 ). In other words,
since 𝜋 is a probability density, then 𝜋 = 𝑍1 exp(−𝑉 ), where 𝑍 B exp(−𝑉 ) is called the

normalizing factor (or the partition function in statistical physics). Although 𝑍 (and thus
𝜋) are explicitly given in terms of the known function 𝑉 , a naı̈ve evaluation of 𝑍 as a
high-dimensional integral is intractable. Indeed, the usual approach of approximating an
integral by a sum requires discretizing space via a fine grid whose size scales exponentially
in the dimension 𝑑. Moreover, even if we had access to 𝑍 , it is still not clear how we could
use this to sample from 𝜋. Therefore, the focus here is to develop direct methods for the
sampling task which bypass the computation of 𝑍 .1
Not only do we want to develop fast algorithms, we also want to understand the
inherent complexity of the sampling task, which in turn allows us to identify optimal
algorithms. By complexity, we do not mean computational complexity, since proving
lower bounds in that context is out of reach (besides, we would have to spend too much
time worrying about the bit representation of 𝑉 ). Instead, following the well-trodden
path of optimization, we adopt a model in which we only have access to 𝑉 through
queries made to an oracle, and our notion of complexity is the number of queries made.
This is known as oracle complexity or query complexity; see [NY83, §1] for a detailed
discussion. We will usually consider a first-order oracle, i.e. given a point 𝑥 ∈ R𝑑 , the
oracle returns (𝑉 (𝑥), ∇𝑉 (𝑥)). Since 𝑉 is only well-defined up to an additive constant, we
can equivalently imagine that the oracle returns (𝑉 (𝑥) − 𝑉 (0), ∇𝑉 (𝑥)).
Before considering the problem further, here are some important applications.
1 In
fact, it goes the other way around: the state-of-the-art methods for approximately computing 𝑍 are
based on the sampling methods we develop here.
iii

1. (Bayesian statistics) Suppose that we wish to make inferences about a parameter


𝜗 of interest, which lies in a space Θ ⊆ R𝑑 . As a Bayesian, we have a prior density
𝑝𝜗 over Θ which encodes our subjective beliefs about the value of 𝜗 prior to seeing
any data. Next, we collect some data 𝑋 which, conditionally on the value of 𝜗, is
drawn from a density 𝑝𝑋 |𝜗 (· | 𝜗). According to Bayesian statistics, we should then
compute the posterior distribution

𝑝𝜗 (𝜃 ) 𝑝𝑋 |𝜗 (𝑋 | 𝜃 )
𝑝𝜗 |𝑋 (𝜃 | 𝑋 ) ∝ ∫ ,
𝑝 (d𝜃 0) 𝑝 | 𝜃 0)
Θ 𝜗 𝑋 |𝜗 (𝑋

which encodes our new beliefs about 𝜗 after seeing the data.
Typically, we have access to the functional forms of 𝑝𝜗 and {𝑝𝑋 |𝜗 (· | 𝜃 )}𝜃 ∈Θ , so we
can evaluate these densities and compute gradients. However, the denominator
of 𝑝𝜗 |𝑋 is precisely the normalizing constant described previously and cannot be
naı̈vely evaluated. Moreover, even if we had the functional form of 𝑝𝜗 |𝑋 (· | 𝑋 ), we
would still not be able to compute expectations E[𝜑 (𝜗) | 𝑋 ] of test functions w.r.t.
the posterior without evaluating another high-dimensional integral. Instead, the
sampling methods we discuss in this book can output random variables 𝜗 1, . . . , 𝜗𝑛
whose distributions are approximately 𝑝𝜗 |𝑋 (· | 𝑋 ), and the expectation can be
approximated to arbitrary accuracy via the averages 𝑛 −1 𝑛𝑖=1 𝜑 (𝜗𝑖 ).
Í

2. (high-dimensional integration) More generally, computing integrals of functions


against a known density 𝜋 is a fundamental task in scientific computing. In many
high-dimensional applications, the strategy of drawing samples from 𝜋 and then
approximating integrals via Monte Carlo averages is in fact the only known way to
efficiently tackle this problem.

3. (privacy) As machine learning algorithms are continually deployed in application


domains with personal and sensitive information, there is growing concern about
maintaining the privacy of the data on which the machine learning models are
trained. One way to address this issue is to require that the algorithm be differentially
private, which loosely speaking requires the output of the model to not depend
too much on the presence or absence of a single data point. The most common
method to achieve this goal is via the careful addition of noise to the algorithm.
Readers who are interested in the mathematics of privacy will benefit from a healthy
understanding of the analysis of sampling algorithms, and vice versa.

4. (statistical physics) In a physical system, 𝑉 (𝑥) represents the energy of a state


𝑥. In this situation, thermodynamics predicts that the equilibrium distribution
iv PREFACE

over states is the Boltzmann (or Gibbs) distribution whose density is proportional
to exp(−𝑉 /𝑇 ) (where 𝑇 is the temperature of the system). Naturally, sampling
provides a method for probing properties of the equilibrium distribution. More
subtly, the mixing time of specific sampling algorithms also provides information
about the system such as metastability phenomena; we revisit this in Chapter 11.
Due to this physical interpretation, we will often refer to 𝑉 as the potential energy.

5. (uncertainty quantification) In order to better understand the risks inherent in


any given system, it is important to quantify how much uncertainty is present
in any given prediction. This application is closely related to the discussion on
Bayesian statistics, since a Bayesian framework is a natural approach for performing
uncertainty quantification. More generally, the choice to use sampling rather than
optimization reflects a desire to understand typical outcomes of a procedure rather
than choosing a single fitted model which may fall victim to model misspecification.
Besides these examples, it is no surprise that sampling arises in many other applications,
since sampling is a fundamental algorithmic primitive. As such, sampling methods are
employed daily in applied domains such as biology, climatology, and cosmology.
Next, we turn towards the how rather than the why. A key theme of this book is
the surprising and close connection between methods in optimization and methods in
sampling. To illustrate, we introduce our first sampling method, which is the sampling
analogue of the well-known gradient descent algorithm from optimization. The Langevin
diffusion is the solution (𝑍𝑡 )𝑡 ≥0 to the stochastic differential equation (SDE)

d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 .
| {z } | {z }
gradient flow Brownian motion

With a pure gradient flow d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡, we would expect the dynamics to converge to
stationary points of 𝑉 . The Brownian motion ensures that we fully explore the distribution
𝜋, as is required in sampling. Under mild conditions, the unique stationary distribution
of the Langevin diffusion is indeed 𝜋 ∝ exp(−𝑉 ), which makes this diffusion a good
candidate upon which to base a sampling algorithm.
Since the Langevin diffusion is “a gradient flow + noise”, it is no wonder that researchers
have drawn parallels between this diffusion and the gradient flow from optimization.
However, the connection actually lies much deeper than this superficial observation would
suggest. There is a natural geometry on the space of probability measures with finite
second moment, P2 (R𝑑 ), namely the 2-Wasserstein distance 𝑊2 from the theory of optimal
transport. The space (P2 (R𝑑 ),𝑊2 ) turns out to be much richer than a metric space; in fact,
it is almost a Riemannian manifold. In turn, the Riemannian structure allows us to define
v

gradient flows on this space. The punchline here is that if 𝜋𝑡 denotes the law of 𝑍𝑡 , then
the curve of measures 𝑡 ↦→ 𝜋𝑡 is the gradient flow of the Kullback–Leibler (KL) divergence
KL(· k 𝜋) with respect to the 𝑊2 geometry. Hence, at the level of the trajectory (𝑍𝑡 )𝑡 ≥0 ,
the Langevin diffusion is a noisy gradient flow, but at the level of measures (𝜋𝑡 )𝑡 ≥0 , it is
precisely a gradient flow! This remarkable connection was introduced in the seminal work
of Jordan, Kinderlehrer, and Otto [JKO98].
This perspective suggests that we can study the convergence of the Langevin diffusion
using tools from optimization. For example, a standard assumption in the optimization
literature which allows for fast rates of convergence is that of strong convexity of the
objective function. Hence, we can ask under what conditions the functional KL(· k 𝜋)
on the space of measures is strongly convex along 𝑊2 geodesics. Quite pleasingly, this
is equivalent to the (Euclidean) strong convexity of the potential 𝑉 . Consequently, the
assumption of strong convexity of 𝑉 , which is natural in optimization for studying gradient
flows, turns out to be natural in the sampling context as well.
Much of this book is devoted to the case when 𝑉 is strongly convex; we refer to 𝜋
as being strongly log-concave. Besides its naturality and simplicity, it is also a practical
assumption. For example, in the application to Bayesian statistics, the Bernstein–von
Mises theorem states that as the number of data points tends to infinity, the posterior
distribution closely resembles a Gaussian distribution and is thus (almost) strongly log-
concave, a fact which has already been exploited to give sampling guarantees in the
context of Bayesian inverse problems (see, e.g., [NW20]). However, some of the results
also apply to restricted classes of non-log-concave measures, and in Chapter 11 we will
see what can be said about non-log-concave sampling in general.
Before using the Langevin diffusion for sampling, however, it is first necessary to
discretize the process in time. The simplest discretization, known as the Euler–Maruyama
discretization, proceeds by fixing a step size ℎ > 0 and following the iteration

𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .

Since the Brownian increment 𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ has the normal(0, ℎ𝐼𝑑 ) distribution, this
iteration can be easily implemented once we have access to a gradient oracle for 𝑉 and
the ability to draw standard Gaussian variables. This iteration is commonly known as the
Langevin Monte Carlo (LMC) algorithm, or the unadjusted Langevin algorithm (ULA); in
this book, we stick to the former acronym.
The LMC algorithm is the starting point of our study. As a result of research in the
last decade, we now have the following guarantee. For any of the common √︁ divergences
d between probability measures, e.g., d(𝜇, 𝜋) = 𝑊2 (𝜇, 𝜋) or d(𝜇, 𝜋) = KL(𝜇 k 𝜋), and
with an appropriate choice of initialization and step size, the law 𝜇 𝑁ℎ of the 𝑁 -th iterate
of LMC satisfies d(𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 with a number of iterations 𝑁 which is polynomial
vi PREFACE

in the problem parameters (the dimension 𝑑,√the condition number 𝜅 of 𝑉 , and the
inverse accuracy 1/𝜀). For example, when d = KL, the state-of-the-art guarantee reads
𝑁 =𝑂 e(𝜅𝑑/𝜀 2 ). Whereas the convergence of the continuous-time diffusion is classical and
typically proven via abstract calculus, the quantitative non-asymptotic convergence of the
discretized algorithm necessitates the development of a new toolbox of analysis techniques.
A primary goal of this book to make this toolbox more accessible to researchers who are
not yet acquainted with the field.
Beyond the standard LMC algorithm, there is now a rich variety of algorithms in
the sampling literature. Some algorithms are directly inspired by other optimization
algorithms (e.g., mirror descent), whereas other algorithms have their root in the classical
theory of Markov processes (e.g., the use of a Metropolis–Hastings filter). We will also
explore some of these more sophisticated algorithms in detail, as in many cases they
represent (provably) substantial improvements over standard LMC.
Finally, although we began this introduction by discussing the goal of understanding
the complexity of sampling, in fact the complexity is not yet well-understood. The issue
here is that there are currently very few lower bounds on the complexity of sampling. This
is in contrast with the field of optimization, in which oracle complexity lower bounds have
in most situations identified nearly optimal algorithms for optimizing various function
classes. In Chapter 9, we will explain the current progress towards achieving this goal for
sampling, but much work remains to be done.
For example, here is the precise statement for a fundamental open question about the
complexity of sampling.

Let 𝜋 ∝ exp(−𝑉 ) be a probability density on R𝑑 . Determine the minimum


number of queries to a first-order oracle for 𝑉 required to output a sample
whose law 𝜇 satisfies k𝜇 − 𝜋 k TV ≤ 𝜀, uniformly over the following class of
potentials: 𝑉 is twice continuously differentiable, satisfying the conditions
0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 and ∇𝑉 (0) = 0.

What This Book Does Not Contain


At the risk of offending researchers who are omitted even from the list of omissions, here
we point out a few egregious exclusions. First, as mentioned previously, the price we paid
for a succinct exposition of a variety of fields is a lack of rigorous development of the
fundamentals of said fields, which we leave to the reader to pursue more thoroughly.
The field of sampling has a rich literature spanning decades, and although we have
made an effort to cite the works most relevant to the modern perspective, it was not
possible to cite even a vanishing fraction of the applied and/or classical literature. This
vii

extends to even recent theoretical works on log-concave sampling, for which we have
omitted any discussion of sampling from convex bodies or polytopes. Although these
works constitute fundamental developments in the field, here we chose to limit our focus
to the part of the literature which is more strongly inspired by optimization algorithms.
Naturally, the other topics we explore in the book are far from comprehensive.

Notational Conventions
The symbols ∧ and ∨ mean “minimum” and “maximum” respectively. We write 𝑎 . 𝑏 or
𝑎 = 𝑂 (𝑏) to mean that 𝑎 ≤ 𝐶𝑏 for a universal constant 𝐶 > 0. Similarly, 𝑎 & 𝑏 or 𝑎 = Ω(𝑏)
mean that 𝑎 ≥ 𝑐𝑏 for a universal constant 𝑐 > 0, and 𝑎  𝑏 or 𝑎 = Θ(𝑏) mean that both
𝑎 . 𝑏 and 𝑎 & 𝑏.
For a function 𝑓 : R𝑑 → R, we write 𝜕𝑖 𝑓 to denote the 𝑖-th partial derivative of 𝑓 . The
gradient ∇𝑓 is the vector of partial derivatives (𝜕1 𝑓 , . . . , 𝜕𝑑 𝑓 ), and the Hessian ∇2 𝑓 is the
matrix (𝜕𝑖 𝜕 𝑗 𝑓 )𝑖,𝑗 ∈[𝑑] . For a vector field 𝑣 : R𝑑 → R𝑑 , we also use the notation ∇𝑣 to denote
the Jacobian matrix of 𝑣. The divergence of a vector field 𝑣 is div 𝑣 = ∇ · 𝑣 = 𝑑𝑖=1 𝜕𝑖 𝑣𝑖 , and
Í
the Laplacian of 𝑓 is Δ𝑓 = tr ∇2 𝑓 = 𝑑𝑖=1 𝜕𝑖2 𝑓 .
Í

Acknowledgements
This book started off as a series of lecture notes during my visit to New York University
in Spring 2022, and owes much to the audience and hospitality there.
Before its inception, however, the idea of writing a book was first conceived during
the Simons Institute for the Theory of Computing program on Geometric Methods in
Sampling and Optimization (GMOS) in Fall 2021. There, I met a lot of my long-term
collaborators and learned much of what I currently know about sampling. Overall, I found
the program to be incredibly intellectually stimulating and I am grateful.
I am indebted to collaborators with whom I have had many fruitful conversations,
including (but not limited to): Kwangjun Ahn, Francis Bach, Krishnakumar Balasubrama-
nian, Silvère Bonnabel, Joan Bruna, Sébastien Bubeck, Sitan Chen, Yongxin Chen, Yuansi
Chen, Xiang Cheng, Jaume de Dios Pont, Alain Durmus, Ronen Eldan, Murat Erdogdu,
Patrik Gerber, Ramon van Handel, Ye He, Marc Lambert, Thibaut Le Gouic, Holden Lee,
Yin Tat Lee, Jerry Li, Mufan (Bill) Li, Yuanzhi Li, Chen Lu, Jianfeng Lu, Yian Ma, Tyler
Maunu, Shyam Narayanan, Philippe Rigollet, Lionel Riou-Durand, Adil Salim, Ruoqi Shen,
Austin Stromme, Kevin Tian, Jure Vogrinc, Lihan Wang, Andre Wibisono, Anru Zhang,
and Matthew Zhang.
viii PREFACE

I am also appreciative of the encouragement and suggestions from Sébastien Bubeck,


Jonathan Niles-Weed, Martin Wainwright, and my advisor Philippe Rigollet, as well as
the careful reading of Michael Diao and Aram-Alexandre Pooladian.
Thank you to all of the friends who kept me sane throughout.
Part I

Diffusions in Continuous Time

1
CHAPTER 1

The Langevin Diffusion in Continuous Time

In this chapter, we study the continuous-time Langevin diffusion with potential 𝑉 ,


which is the solution to the following stochastic differential equation (SDE):

d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 . (1.E.1)
We begin with a quick introduction to stochastic calculus in order to make sense of this
equation. Then, we introduce two powerful frameworks for analyzing the Langevin diffu-
sion: Markov semigroup theory, and the calculus of optimal transport. These frameworks
are two perspectives on the same diffusion, and the abstract calculus rules we develop
within each framework streamline important computations.
A rigorous mathematical treatment of the theory in this chapter requires addressing
substantial analytical technicalities, such as checking that the various partial differential
equations (PDEs) are well-posed and that the calculations are carefully justified. We will
not attempt to do so here and instead refer to bibliography for detailed treatments. In
particular, the “proofs” in this section are more like “proof sketches” which are meant to
convey the main intuition.

1.1 A Primer on Stochastic Calculus


In this section, we introduce just enough stochastic calculus to understand the meaning
of the SDE (1.E.1). See [Ste01; Le 16] for thorough expositions. We treat further topics in
stochastic calculus in Chapter 3.

3
4 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

In Section 1.1.4, we discuss the rather technical construction of the Itô integral. For
the remainder of the book, the details of the construction are less important than the
calculation rules that follow. The reader who is unfamiliar with stochastic calculus is
encouraged to skip Section 1.1.4 upon first reading.

1.1.1 The Itô Integral

Definition 1.1.1. (Standard) Brownian motion is a stochastic process (𝐵𝑡 )𝑡 ≥0 in R𝑑


satisfying the following properties:

1. 𝐵 0 = 0.

2. (independence of increments) For all 0 < 𝑡 1 < · · · < 𝑡𝑘 , the random variables
(𝐵𝑡1 , 𝐵𝑡2 − 𝐵𝑡1 , . . . , 𝐵𝑡𝑘 − 𝐵𝑡𝑘−1 ) are mutually independent.

3. (law of the increments) For all 0 ≤ 𝑠 < 𝑡 < ∞,

𝐵𝑡−𝑠 ∼ normal(0, 𝑡 − 𝑠) .

4. (continuity of the paths) Almost surely, 𝑡 ↦→ 𝐵𝑡 is continuous.

Brownian motion was originally introduced over a century ago as a model for the
jittery path of a particle which is constantly colliding with surrounding molecules. Since
its inception, Brownian motion has been used to model the flow of heat, to price options
at the financial market, to solve partial differential equations, to tease out the geometry of
manifolds, and of course, to sample from probability distributions. It is perhaps not clear
at first sight that such a process even exists,1 but it would take us too far afield to give a
construction here.
Instead, our goal is to compute integrals involving Brownian motion: given a stochastic
process (𝜂𝑡 )𝑡 ≥0 , how do we make sense of an expression such as 0 𝜂𝑡 d𝐵𝑡 ? Once we have
∫𝑇

stochastic integration in hand, we can then formulate and solve stochastic differential
equations. The solution to such an equation is a diffusion process, no less jittery than the
Brownian motion which drives it, and yet in the right hands it becomes an incredible tool
for solving a plethora of disparate problems.
The main technical difficulty in defining the stochastic integral is that Brownian motion
is an irregular process: for√small 𝑡 > 0, by definition 𝐵𝑡 ∼ normal(0, 𝑡), which means that
|𝐵𝑡 | is typically of size  𝑡. In particular, this prevents Brownian motion from being
1 Of course, this did not stop Einstein from using it to probe the microscopic structure of matter.
1.1. A PRIMER ON STOCHASTIC CALCULUS 5

differentiable at 0, or indeed, anywhere. Nevertheless, such stochastic integrals can be


meaningfully defined and used to build a far-reaching calculus.
We give the high-level idea of the construction of the Itô integral here, deferring details
to Section 1.1.4. We work on a probability space (Ω, ℱ, P) which is complete, filtered, and
right-continuous, meaning that there is an increasing family (ℱ𝑡 )𝑡 ≥0 of 𝜎-algebras with
𝑡=0 ℱ𝑡 ⊆ ℱ, with 𝑡 >𝑠 ℱ𝑡 = ℱ𝑠 for all 𝑠 ≥ 0, and such that ℱ0 contains all subsets of null
Ð∞ Ñ
sets. We assume that Brownian motion is adapted to the filtration: 𝐵𝑡 is ℱ𝑡 -measurable
for each 𝑡 ≥ 0.

Defining the Itô integral at a single time 𝑇 . Suppose first that (𝜂𝑡 )𝑡 ≥0 is a process of
the form
𝑘−1
𝐻𝑖 1{𝑡 ∈ (𝑡𝑖 , 𝑡𝑖+1 ]}
∑︁
𝜂𝑡 = (1.1.2)
𝑖=0

for some 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 , where 𝐻𝑡𝑖 is bounded and ℱ𝑡𝑖 -measurable. We call 𝜂 an
elementary process. In this case, perhaps the only reasonable definition of the stochastic
integral is to take
∫ 𝑇 𝑘−1
∑︁
𝜂𝑡 d𝐵𝑡 B 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) . (1.1.3)
0 𝑖=0

This is indeed what we shall do, but for the moment we will refrain from using the integral
symbol and write this as I[0,𝑇 ] (𝜂) to avoid confusion.
We would like to extend this definition to more general processes, but before doing so
we record two key properties of the stochastic integral. The first is that 𝑡 ↦→ I[0,𝑡] (𝜂) is a
continuous martingale, i.e., it is continuous and satisfies the following definition.

Definition 1.1.4. A process (𝑀𝑡 )𝑡 ≥0 is a martingale w.r.t. the filtration (ℱ𝑡 )𝑡 ≥0 if


for all 𝑡 ≥ 0, 𝑀𝑡 is ℱ𝑡 -measurable and integrable, and

E[𝑀𝑡 | ℱ𝑠 ] = 𝑀𝑠 , for all 0 ≤ 𝑠 < 𝑡 .

Indeed, we deduce that 𝑡 ↦→ I[0,𝑡] (𝜂) is a martingale from the fact that 𝐻𝑖 is ℱ𝑡𝑖 -
measurable for each 𝑖, and because (𝐵𝑡 )𝑡 ≥0 is a martingale.
The second key property is that we can compute the variance:
𝑘−1
h ∑︁ 2 i ∑︁
𝑘−1
E[I[0,𝑇 ] (𝜂) 2 ] = E 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) = E[|𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 )| 2 ] (1.1.5)

𝑖=0 𝑘=0
6 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

𝑘−1 ∫ 𝑇
E[𝐻𝑖2 ] 𝜂𝑡2 d𝑡 .
∑︁
(1.1.6)

= (𝑡𝑖+1 ∧ 𝑇 ) − (𝑡𝑖 ∧ 𝑇 ) = E
𝑘=0 0

Here, we used the basic properties listed in the definition of Brownian motion, such as
independence of increments. This equation shows that if P𝑇 B P ⊗ 𝔪| [0,𝑇 ] , where 𝔪| [0,𝑇 ]
is the Lebesgue measure on [0,𝑇 ], then the mapping 𝜂 ↦→ I[0,𝑇 ] (𝜂) is an isometry from
𝐿 2 (P𝑇 ) to 𝐿 2 (P). We use this isometry to extend the definition of the stochastic integral
as follows.
For a more general process (𝜂𝑡 )𝑡 ≥0 , assume that it is progressive2 and satisfies the
integrability condition
∫ 𝑇
k𝜂 k 𝐿2 2 (P𝑇 ) =E 𝜂𝑡2 d𝑡 < ∞ .
0

One shows that (𝜂𝑡 )𝑡 ≥0 can be approximated by elementary processes {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N}


of the form (1.1.2) in the 𝐿 2 (P𝑇 ) norm. For each 𝑘, the stochastic integral I[0,𝑇 ] (𝜂 (𝑘) ) is
defined via (1.1.3), and lim𝑘→∞ I[0,𝑇 ] (𝜂 (𝑘) ) exists in 𝐿 2 (P) thanks to the isometry. We can
then take the limit to be the definition of the stochastic integral I[0,𝑇 ] (𝜂).

Defining the Itô integral as a stochastic process. Although the procedure above
successfully defines I[0,𝑡] (𝜂) for a fixed time 𝑡 > 0, there is no guarantee of coherence
between different times 𝑡. The trouble arises because I[0,𝑡] (𝜂) is defined as a limit, but
this limit is only well-specified up to an event of measure zero, and these measure zero
events for different times 𝑡 might conceivably accumulate into something more. This
is undesirable because the true power of stochastic calculus comes from viewing the
stochastic integral as a time-indexed stochastic process in its own right.
The key insight is to go back to the approximating sequence {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N}. For
each 𝑘, the Itô integral is defined as an entire process 𝑡 ↦→ I[0,𝑡] (𝜂 (𝑘) ) via (1.1.27), and
moreover this process is a continuous martingale. We can then apply powerful results
on martingale convergence, which are developed in Section 1.1.4, in order to prove the
following theorem.

Theorem 1.1.7. Suppose that (𝜂𝑡 )𝑡 ≥0 is progressive and satisfies E 0 𝜂𝑡2 d𝑡 < ∞. Then,
∫𝑇

there exists a continuous martingale, denoted ( 0 𝜂𝑠 d𝐵𝑠 )𝑡 ≥0 , which is adapted to (ℱ𝑡 )𝑡 ≥0


∫𝑡

2 The
process (𝜂𝑡 )𝑡 ≥0 is progressive if for all 𝑇 ≥ 0, the mapping (𝜔, 𝑡) ↦→ 𝜂𝑡 (𝜔) is measurable w.r.t.
ℱ𝑇 ⊗ ℬ[0,𝑇 ] , where ℬ[0,𝑇 ] is the Borel 𝜎-algebra on [0,𝑇 ].
1.1. A PRIMER ON STOCHASTIC CALCULUS 7

and satisfies
h ∫ 𝑡 2 i ∫ 𝑡
𝜂𝑠 d𝐵𝑠 = E 𝜂𝑠2 d𝑠 , for all 𝑡 ∈ [0,𝑇 ] . (1.1.8)

E
0 0

The formula (1.1.8) is called the Itô isometry.


Also, for each 𝑡 ∈ [0,𝑇 ], it holds that 0 𝜂𝑠 d𝐵𝑠 = I[0,𝑇 ] (𝜂) a.s.
∫𝑇

Extending the definition via localization. There is one final step which is traditionally
taken, namely to expand the class of allowable integrands to progressive processes 𝜂 with
∫ 𝑇
𝜂𝑠2 d𝑠 < ∞ almost surely . (1.1.9)
0

Note that this condition is weaker than the condition E 0 𝜂𝑠2 d𝑠 < ∞. Such an extension
∫𝑇

is evidently mathematically interesting, as we would like our definitions to be as broad as


possible. However, equally important is that it introduces the device of localization. On
the whole, localization actually serves to reduce the number of technicalities in the subject:
once introduced, it allows us to always work with a stopping time up to which the process
is as nice as one desires (e.g., bounded). The flexibility and utility that localization thus
brings cements its place as the natural mathematical framework for stochastic calculus.
However, this is not the focus of the book, and in what follows we will usually brush over
such localization arguments. For now, we simply introduce the basic definitions in order
to show the reader that the idea is actually fairly straightforward.

Definition 1.1.10. A stopping time 𝜏 is a random variable such that for each 𝑡 ≥ 0,
the event {𝜏 ≤ 𝑡 } is ℱ𝑡 -measurable.

Definition 1.1.11. An increasing sequence of stopping times (𝜏𝑛 )𝑛∈N is called a


localizing sequence for 𝜂 on [0,𝑇 ] if:

1. for all 𝑛 ∈ N, (𝜂𝑡 1{𝑡 ≤ 𝜏𝑛 })𝑡 ≥0 has finite k·k 𝐿2 (P𝑇 ) norm, and

2. 𝜏𝑛 → 𝑇 almost surely.

The good news is that localizing sequences are easy to find, and the following proposi-
tion barely needs a proof (and so we omit it).
8 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Proposition 1.1.12. If 𝜂 is a progressive process satisfying the condition (1.1.9), then


the sequence (𝜏𝑛 )𝑛∈N defined by
n ∫ 𝑡 o
𝜏𝑛 B inf 𝑡 ≥ 0 𝜂𝑠2 d𝑠 ≥ 𝑛 ∧ 𝑇

0

is a localizing sequence for 𝜂 on [0,𝑇 ].

The idea now is simple: for each progressive process 𝜂 satisfying (1.1.9), let (𝜏𝑛 )𝑛∈N
be a localizing sequence for 𝜂 on [0,𝑇 ]. For each 𝑛 ∈ N, by the definition of a localizing
sequence, we can apply our existing definition of the Itô integral, which gives us a
continuous martingale
∫ 𝑡
𝑡 ↦→ 𝜂𝑠 1{𝑠 ≤ 𝜏𝑛 } d𝐵𝑠 . (1.1.13)
0

Then, we can define the Itô integral of 𝜂 to be a limit of the processes (1.1.13). The details
are straightforward, and omitted.
We can also define an analogue of martingales using localizing sequences; these are
almost martingales, but lack the required integrability.

Definition 1.1.14. A process (𝑀𝑡 )𝑡 ≥0 is a local martingale if it is adapted to the


filtration (ℱ𝑡 )𝑡 ≥0 and there is an increasing sequence (𝜏𝑛 )𝑛∈N of stopping times such
that 𝜏𝑛 → ∞ and for each 𝑛, the process 𝑡 ↦→ 𝑀𝑡∧𝜏𝑛 − 𝑀0 is a martingale w.r.t. (ℱ𝑡 )𝑡 ≥0 .

Proposition 1.1.15. If 𝜂 is a progressive process satisfying (1.1.9), then the Itô integral
𝑡 ↦→ 0 𝜂𝑠 d𝐵𝑠 is a continuous local martingale.
∫𝑡

Looking forward. The construction of the Itô integral may seem quite abstract; indeed,
we are sorely lacking in examples. At this juncture, it is common to work out simple
exercises such as computing 0 𝐵𝑠 d𝐵𝑠 , and while this is pedagogically natural it is also
∫𝑡

liable to mislead the reader into thinking that the main use of Itô integration is to solve
synthetic problems with no apparent purpose. As counterintuitive as it may seem, our
solution to the heavy amount of abstraction will be more abstraction. In the next section,
we will develop the single most important computation rule in stochastic calculus (along
with the Itô isometry (1.1.8)), called Itô’s formula, after which we will hardly need to
1.1. A PRIMER ON STOCHASTIC CALCULUS 9

return to the definition of a stochastic integral ever again. And even Itô’s formula will be
abstracted out into the language of Markov semigroups in Section 1.2. The upshot is that
we introduced the Itô integral because it is the foundation of our field, but most of what
we have developed thus far is not necessary for the remainder of the book.

1.1.2 Itô’s Formula


With the Itô integral in hand, we consider the following class of processes.

Definition 1.1.16. A stochastic process (𝑋𝑡 )𝑡 ≥0 is an Itô process if it is of the form


∫ 𝑡 ∫ 𝑡
𝑋𝑡 = 𝑋 0 + 𝑏𝑠 d𝑠 + 𝜎𝑠 d𝐵𝑠 , for 𝑡 ≥ 0 ,
0 0

where (𝑏𝑡 )𝑡 ≥0 takes values in R𝑑 , (𝜎𝑡 )𝑡 ≥0 takes values in R𝑑×𝑁 , and (𝐵𝑡 )𝑡 ≥0 is a standard
Brownian motion in R𝑁 .

Implicit in the above definition is that the process should be well-defined: the coeffi-
cients (𝑏𝑡 )𝑡 ≥0 and (𝜎𝑡 )𝑡 ≥0 should be progressive processes for which the integrals exist.
Also, the random variable 𝑋 0 should be ℱ0 -measurable, in which case the process (𝑋𝑡 )𝑡 ≥0
is also progressive.
We refer to (𝑏𝑡 )𝑡 ≥0 as the drift coefficient and (𝜎𝑡 )𝑡 ≥0 as the diffusion coefficient.
When the drift coefficient is zero, then (𝑋𝑡 )𝑡 ≥0 is simply an Itô integral, and thus a
continuous local martingale (Proposition 1.1.15). Otherwise, for a non-zero drift coefficient,
the process (𝑋𝑡 )𝑡 ≥0 is no longer necessarily a local martingale. As a shorthand, we often
write the Itô process in differential form:

d𝑋𝑡 = 𝑏𝑡 d𝑡 + 𝜎𝑡 d𝐵𝑡 . (1.1.17)

Our goal is to understand how the Itô process transforms when we compose it with a
smooth function 𝑓 : R𝑑 → R. This leads to Itô’s formula, which is the bread and butter of
stochastic calculus computations.
Although the notation (1.1.17) is informal, it√conveys the main intuition. For ℎ > 0
small, we√can approximate 𝑋𝑡+ℎ ≈ 𝑋𝑡 + ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉, where 𝜉 ∼ normal(0, 𝐼𝑑 ). Note
that the ℎ scaling comes from the fact that the Brownian increment 𝐵𝑡+ℎ − 𝐵𝑡 has
the normal(0, ℎ𝐼𝑑 ) distribution. Now suppose that 𝑓 : R𝑑 → R is twice continuously
differentiable. Normally, to compute 𝑓 (𝑋𝑡+ℎ ) − 𝑓 (𝑋𝑡 ) up to order 𝑜 (ℎ), a first-order Taylor
expansion of 𝑓 suffices, but in stochastic calculus this would miss important terms arising
10 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

from the Brownian motion: indeed, second-order terms in 𝐵𝑡+ℎ − 𝐵𝑡 are of order ℎ and
hence not negligible.
Therefore, we carry out the Taylor expansion to an extra term:
√ 1 √ √
𝑓 (𝑋𝑡+ℎ ) − 𝑓 (𝑋𝑡 ) ≈ h∇𝑓 (𝑋𝑡 ), ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉i + hℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉, ∇2 𝑓 (𝑋𝑡 ) (ℎ 𝑏𝑡 + ℎ 𝜎𝑡 𝜉)i
2
1 √
= ℎ h∇𝑓 (𝑋𝑡 ), 𝑏𝑡 i + h𝜎𝑡 𝜉, ∇2 𝑓 (𝑋𝑡 ) 𝜎𝑡 𝜉i + ℎ h𝜎𝑡T ∇𝑓 (𝑋𝑡 ), 𝜉i + 𝑜 (ℎ) .

2
This expression suggests that (𝑓 (𝑋𝑡 ))𝑡 ≥0 is also an Itô process. The third term, which is of

order ℎ, turns into an Itô integral once integrated. Perhaps the most interesting term
is the second term, ℎ2 h𝜎𝑡 𝜎𝑡T ∇2 𝑓 (𝑋𝑡 ), 𝜉𝜉 T i, which is a genuinely new feature of stochastic
calculus. If we sum up many of these increments, we end up with an expression like
1 Í𝐾 2
2 𝑘=0 h𝜎𝑡+𝑘ℎ 𝜎𝑡+𝑘ℎ ∇ 𝑓 (𝑋𝑡+𝑘ℎ ), ℎ 𝜉𝑘 𝜉𝑘 i. If we replace each ℎ 𝜉𝑘 𝜉𝑘 by its expectation ℎ 𝐼𝑑
T T T

(which must be carefully justified; see the calculation of quadratic variation in Section 3.1),
then this resembles a Riemann sum, which converges to the integral of 12 h∇2 𝑓 (𝑋𝑡 ), 𝜎𝑡 𝜎𝑡T i.
This is formalized in the following theorem.

Theorem 1.1.18 (Itô’s formula). Let (𝑋𝑡 )𝑡 ≥0 be an Itô process, d𝑋𝑡 = 𝑏𝑡 d𝑡 + 𝜎𝑡 d𝐵𝑡 , and
let 𝑓 ∈ C 2 (R𝑑 ). Then, (𝑓 (𝑋𝑡 ))𝑡 ≥0 is also an Itô process which satisfies, for 𝑡 ≥ 0:

1
∫ 𝑡 ∫ 𝑡
h∇𝑓 (𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠 + h𝜎𝑠T ∇𝑓 (𝑋𝑠 ), d𝐵𝑠 i .

𝑓 (𝑋𝑡 ) − 𝑓 (𝑋 0 ) =
0 2 0

We omit the proof, since the bulk of the intuition is carried in the informal Taylor
series argument described above. Observe that since Itô integrals are (under appropriate
integrability conditions) continuous martingales, the expectation of the last term in Itô’s
formula is typically zero. Therefore,3
1
∫ 𝑡
E h∇𝑓 (𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠 , (1.1.19)
 
E 𝑓 (𝑋𝑡 ) − E 𝑓 (𝑋 0 ) =
0 2
or in differential form,
1
𝜕𝑡 E 𝑓 (𝑋𝑡 ) = E h∇𝑓 (𝑋𝑡 ), 𝑏𝑡 i + h∇2 𝑓 (𝑋𝑡 ), 𝜎𝑡 𝜎𝑡T i .
 
2
Itô’s formula can also be extended to time-dependent functions via
1
∫ 𝑡
𝜕𝑠 𝑓 (𝑠, 𝑋𝑠 ) + h∇𝑓 (𝑠, 𝑋𝑠 ), 𝑏𝑠 i + h∇2 𝑓 (𝑠, 𝑋𝑠 ), 𝜎𝑠 𝜎𝑠T i d𝑠

𝑓 (𝑡, 𝑋𝑡 ) − 𝑓 (0, 𝑋 0 ) =
0 2
3 Even when the Itô integral is only a local martingale, one can usually justify the formula (1.1.19) anyway

using the localizing sequence and the dominated convergence theorem.


1.1. A PRIMER ON STOCHASTIC CALCULUS 11
∫ 𝑡
+ h𝜎𝑠T ∇𝑓 (𝑠, 𝑋𝑠 ), d𝐵𝑠 i .
0

We revisit and streamline Itô’s formula in Section 3.1.

1.1.3 Existence and Uniqueness of SDEs


Let 𝑏 : R+ ×R𝑑 → R𝑑 and 𝜎 : R+ ×R𝑑 → R𝑑×𝑁 . We now consider the stochastic differential
equation (SDE)

d𝑋𝑡 = 𝑏 (𝑡, 𝑋𝑡 ) d𝑡 + 𝜎 (𝑡, 𝑋𝑡 ) d𝐵𝑡 .

Suppose we are given a complete filtered probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) which
supports a standard 𝑁 -dimensional adapted Brownian motion 𝐵. A solution to the SDE
is an adapted R𝑑 -valued process 𝑋 such that
∫ 𝑡 ∫ 𝑡
𝑋𝑡 = 𝑋 0 + 𝑏 (𝑠, 𝑋𝑠 ) d𝑠 + 𝜎 (𝑠, 𝑋𝑠 ) d𝐵𝑠 .
0 0

The question we address in this section is under what conditions there exists a unique
solution to the SDE. This is a question of a technical nature, but the answer is instructive
because the proof introduces standard arguments that recur frequently in stochastic
calculus. The main result is that if the coefficients 𝑏, 𝜎 are Lipschitz in space uniformly in
time, then the SDE admits a unique solution.
Before proceeding, we need the following lemma, which is used throughout the book.

Lemma 1.1.20 (Grönwall’s lemma). Let 𝑇 > 0 and let 𝑔 : [0,𝑇 ] → [0, ∞) be bounded
and measurable. Assume there exists 𝐶 1, 𝐶 2 ≥ 0 such that
∫ 𝑡
𝑔(𝑡) ≤ 𝐶 1 + 𝐶 2 𝑔, ∀𝑡 ∈ [0,𝑇 ] .
0

Then,

𝑔(𝑡) ≤ 𝐶 1 exp(𝐶 2𝑡) , ∀𝑡 ∈ [0,𝑇 ] .

Proof. By iterating the assumption, for each 𝑛 ∈ N,


∫ 𝑡 ∫ 𝑡 ∫ 𝑠1
2
𝑔(𝑡) ≤ 𝐶 1 + 𝐶 2 𝑔(𝑠 1 ) d𝑠 1 ≤ 𝐶 1 + 𝐶 1𝐶 2𝑡 + 𝐶 2 𝑔(𝑠 2 ) d𝑠 2 d𝑠 1 ≤ · · ·
0 0 0
12 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

𝑛
(𝐶 2𝑡)𝑘
∑︁ ∫ 𝑡 ∫ 𝑠𝑛
≤ 𝐶1 𝑛+1
+ 𝐶2 ··· 𝑔(𝑠𝑛+1 ) d𝑠𝑛+1 · · · d𝑠 1 .
𝑘=0
𝑘! 0 0

The remainder term is bounded by


∫ 𝑡 ∫ 𝑠𝑛 k𝑔k (𝐶 𝑡)𝑛+1
sup 2

𝑛→∞
𝑔(𝑠𝑛+1 ) d𝑠𝑛+1 · · · d𝑠 1 ≤ −−−−→ 0 ,
𝑛+1
···

𝐶 2
0 0 (𝑛 + 1)!
which gives the result. 
In the following theorem, for a matrix 𝑀,
k𝑀 k 2HS B tr(𝑀𝑀 T ) = tr(𝑀 T 𝑀)
denotes the Hilbert–Schmidt (or Frobenius) norm of 𝑀. The uniqueness result states that
if there are two solutions 𝑋 , 𝑋
e to SDE on the same probability space, driven by the same
Brownian motion, then 𝑋 = 𝑋 e.

Theorem 1.1.21 (existence and uniqueness of SDE solutions). Assume that 𝑏 and 𝜎
are continuous, and there exists 𝐶 > 0 such that for all 𝑡 ≥ 0 and 𝑥, 𝑦 ∈ R𝑑 ,

k𝑏 (𝑡, 𝑥) − 𝑏 (𝑡, 𝑦)k ∨ k𝜎 (𝑡, 𝑥) − 𝜎 (𝑡, 𝑦)k HS ≤ 𝐶 k𝑥 − 𝑦 k .

Then, for any complete filtered probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) and 𝑥 ∈ R𝑑 , there
exists a unique solution 𝑋 for the SDE with 𝑋 0 = 𝑥.

Proof. Uniqueness. Fix a time 𝑇 > 0 and suppose there exist two solutions (𝑋𝑡 )𝑡 ∈[0,𝑇 ]
and (𝑋e𝑡 )𝑡 ∈[0,𝑇 ] of the SDE on [0,𝑇 ] with 𝑋 0 = 𝑋 e0 . For 𝑡 ≥ 0, we compute the difference
between 𝑋𝑡 and 𝑋 e𝑡 using the Itô isometry (1.1.8), the Cauchy–Schwarz inequality, and the
Lipschitz assumption:
h ∫ 𝑡 2 ∫ 𝑡 2 i
2
E[k𝑋𝑡 − 𝑋𝑡 k ] ≤ 2 E {𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋𝑠 )} d𝑠 + {𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋𝑠 )} d𝐵𝑠
e e e
0
h ∫ 𝑡 ∫ 0𝑡 i
≤ 2E 𝑇 e𝑠 )k 2 d𝑠 +
k𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋 e𝑠 )k 2HS d𝑠
k𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋
0 0
∫ 𝑡
≤ 2𝐶 2 (1 + 𝑇 ) E k𝑋𝑠 − 𝑋 e𝑠 k 2 d𝑠 .
0

From Gronwall’s lemma (Lemma 1.1.20), we obtain 𝑋𝑡 = 𝑋 e𝑡 almost surely. Actually, this
proof has a gap: we have not fully justified why we can apply the Itô isometry. To get
around this, one may use technique of localization, the details of which are omitted.
1.1. A PRIMER ON STOCHASTIC CALCULUS 13

Existence. We use a method of establishing existence of solutions to ODEs, known as


Picard iteration. We start by defining the process 𝑋 (0) to be the constant process with
value 𝑥, and for 𝑛 ∈ N+ we let
∫ 𝑡 ∫ 𝑡
(𝑛) (𝑛−1)
𝑋𝑡 B 𝑥 + 𝑏 (𝑠, 𝑋𝑠 ) d𝑠 + 𝜎 (𝑠, 𝑋𝑠(𝑛−1) ) d𝐵𝑠 . (1.1.22)
0 0

In other words, we “freeze” the coefficients of the SDE using the process from the previous
stage of the iteration. The stochastic integrals make sense because inductively, each 𝑋 (𝑛)
is adapted and has continuous sample paths. Thus, since
∫ 𝑡 ∫ 𝑡
(𝑛+1) (𝑛) (𝑛) (𝑛−1)
𝑋𝑡 − 𝑋𝑡 = {𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋𝑠 )} d𝑠 + {𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠
0 0

we bound

h 𝑢 2
(𝑛) 2
E sup k𝑋 (𝑛+1)
−𝑋 k ≤ 2 E sup {𝑏 (𝑠, 𝑋𝑠(𝑛) ) − 𝑏 (𝑠, 𝑋𝑠(𝑛−1) )} d𝑠

[0,𝑡] 𝑢∈[0,𝑡] 0
∫ 𝑢 2 i
+ sup {𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠

𝑢∈[0,𝑡] 0
C I + II .

The first term is handled via Cauchy–Schwarz:


∫ 𝑡 ∫ 𝑡
(𝑛−1) 2 2
I ≤𝑇 E (𝑛)
k𝑏 (𝑠, 𝑋𝑠 ) − 𝑏 (𝑠, 𝑋𝑠 )k d𝑠 ≤ 𝐶 𝑇 E sup k𝑋 (𝑛) − 𝑋 (𝑛−1) k 2 d𝑠 .
0 0 [0,𝑠]

For the second term, recall that 𝑡 ↦→ 0 {𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )} d𝐵𝑠 is a martingale. By
∫𝑡

Doob’s 𝐿 2 maximal inequality (Corollary 1.1.30 in Section 1.1.4) and the Itô isometry (1.1.8),
we can bound
h ∫ 𝑡 2 i ∫ 𝑡
II ≤ 4 E (𝑛)
{𝜎 (𝑠, 𝑋𝑠 ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1)
)} d𝐵𝑠 = 4 E k𝜎 (𝑠, 𝑋𝑠(𝑛) ) − 𝜎 (𝑠, 𝑋𝑠(𝑛−1) )k 2HS d𝑠

∫ 0𝑡 0

≤ 4𝐶 2 E sup k𝑋 (𝑛) − 𝑋 (𝑛−1) k 2 d𝑠 .


0 [0,𝑠]

Putting these two bounds together,


∫ 𝑡
(𝑛) 2 2
E sup k𝑋 (𝑛+1)
−𝑋 k ≤ 2𝐶 (4 + 𝑇 ) E sup k𝑋 (𝑛) − 𝑋 (𝑛−1) k 2 d𝑠 .
[0,𝑡] 0 [0,𝑠]
14 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

By induction (and using the fact that E sup [0,𝑇 ] k𝑋 (1) − 𝑋 (0) k 2 ≤ 𝐶𝑇 for some constant
𝐶𝑇 < ∞, which follows from similar arguments), we get

{2𝐶 2𝑇 (4 + 𝑇 )}
𝑛−1
(𝑛−1) 2
E sup k𝑋 (𝑛)
−𝑋 k ≤ 𝐶𝑇 .
[0,𝑇 ] (𝑛 − 1)!

If we sum this, then we obtain


∞ ∞
(𝑛−1) 2
sup k𝑋 (𝑛) − 𝑋 (𝑛−1) k 2 < ∞ almost surely .
∑︁ ∑︁
E sup k𝑋 (𝑛)
−𝑋 k < ∞ =⇒
𝑛=1 [0,𝑇 ] 𝑛=1 [0,𝑇 ]

By completeness of C([0,𝑇 ]), it implies that 𝑋 (𝑛) converges uniformly on [0,𝑇 ] to a


continuous process 𝑋 as 𝑛 → ∞.
Using again the Lipschitz assumption on the coefficients of the SDE, we have
∫ 𝑡 ∫ 𝑡
𝜎 (𝑠, 𝑋𝑠(𝑛) ) d𝐵𝑠 − 𝜎 (𝑠, 𝑋𝑠 ) d𝐵𝑠 → 0 ,
0 0
∫ 𝑡 ∫ 𝑡
𝑏 (𝑠, 𝑋𝑠(𝑛) ) d𝑠 − 𝑏 (𝑠, 𝑋𝑠 ) d𝑠 → 0 ,
0 0

almost surely. Passing to the limit in (1.1.22) shows that 𝑋 solves the SDE on [0,𝑇 ].
There is one last argument to make, which is to find a solution for the SDE on the
entire time interval [0, ∞). For each 𝑡 > 0, let 𝑇 > 𝑡 and define 𝑋𝑡 to be the solution of the
SDE on [0,𝑇 ] at time 𝑡. The uniqueness assertion in the first half of the theorem shows
that this is well-defined (regardless of the choice of 𝑇 ), and it also shows that the resulting
process 𝑋 is adapted, has continuous sample paths, and solves the SDE on [0, ∞). 

Reflecting on the proof, the basic strategy is to coupling together two diffusions (with
the same driving Brownian motion), use Lipschitz bounds on the coefficients, and apply
Gronwall’s inequality. The same strategy will also be used to obtain convergence bounds
for sampling algorithms, albeit with a more quantitative goal in mind.
A theorem similar in spirit to Theorem 1.1.21 can be established under the assumption
that 𝑏 and 𝜎 are only locally Lipschitz, but in this case the solution to the SDE is not
guaranteed to last for all time. The issue is that when the coefficients grow faster than
linearly, there can be a finite (random) time 𝔢, called the explosion time, such that k𝑋𝑡 k → ∞
as 𝑡 → 𝔢. This phenomenon is already present for ODEs (see Exercise 1.4). For the purposes
of this book, the assumption of Lipschitz coefficients suffices.
1.1. A PRIMER ON STOCHASTIC CALCULUS 15

1.1.4 Appendix: Construction of the Itô Integral


To appreciate the construction, it may be helpful to first discuss some technical subtleties.
First, suppose that (𝜂𝑡 )𝑡 ≥0 is deterministic (non-random). Since (𝐵𝑡 )𝑡 ≥0 has continuous
paths, one could try to define 0 𝜂𝑡 d𝐵𝑡 as a Riemann–Stieltjes integral, but due to the
∫𝑇

irregularity of Brownian motion this only works for integrands (𝜂𝑡 )𝑡 ≥0 which are “locally
of bounded variation”4 ; in particular not all continuous integrands are allowed. To get
around this, a standard idea is to approximate a continuous integrand (𝜂𝑡 )𝑡 ≥0 with a
sequence of integrands {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} which have bounded variation. For each 𝑘, it
is no trouble to define 0 𝜂𝑡(𝑘) d𝐵𝑡 , and then we can define 0 𝜂𝑡 d𝐵𝑡 as a suitable limit.
∫𝑇 ∫𝑇

This idea can be carried out, but the notion of limit that is used is 𝐿 2 . In particular, the
limit may not exist in an almost sure sense.
Now suppose that (𝜂𝑡 )𝑡 ≥0 is a stochastic process. Then, it becomes technically challeng-
ing to carry out the requisite approximations; another idea is required. The new insight
is that if we consider integrands which are adapted to a filtration, then the stochastic
integrals 𝑡 ↦→ 0 𝜂𝑠 d𝐵𝑠 are martingales, and we can leverage their powerful convergence
∫𝑡

theory to streamline the construction. We now proceed to implement this plan.


Throughout, we work on a complete, filtered, and right-continuous probability space
(Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P). We also recall the various classes of processes that we introduced in
Section 1.1.1.

Definition 1.1.23. Let (𝜂𝑡 )𝑡 ≥0 be a stochastic process.

1. (𝜂𝑡 )𝑡 ≥0 is a progressive process if for all 𝑇 ≥ 0, the mapping (𝜔, 𝑡) ↦→ 𝜂𝑡 (𝜔)


is measurable w.r.t. ℱ𝑇 ⊗ ℬ[0,𝑇 ] , where ℬ[0,𝑇 ] is the Borel 𝜎-algebra on [0,𝑇 ].

2. (𝜂𝑡 )𝑡 ≥0 is an elementary process if it is of the form

𝑘−1
𝐻𝑖 1{𝑡 ∈ (𝑡𝑖 , 𝑡𝑖+1 ]} ,
∑︁
𝜂𝑡 = (1.1.24)
𝑖=0

where 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 and for each 𝑖, 𝐻𝑖 is bounded and ℱ𝑡𝑖 -measurable.

3. (𝜂𝑡 )𝑡 ∈[0,𝑇 ] is a square integrable process if it is a progressive process and


moreover E 0 𝜂𝑡2 d𝑡 < ∞.
∫𝑇

The proof of the following technical result is omitted.


4 See Section 3.1 for further discussion.
16 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Lemma 1.1.25 (approximation via elementary processes). For any square integrable
process (𝜂𝑡 )𝑡 ∈[0,𝑇 ] , there is a sequence {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} of elementary processes with
∫ 𝑇
k𝜂 − 𝜂 (𝑘) k 𝐿2 2 (P𝑇 ) =E |𝜂𝑡 − 𝜂𝑡(𝑘) | 2 d𝑡 → 0 as 𝑘 → ∞ .
0

Definition 1.1.26. Let (𝜂𝑡 )𝑡 ≥0 be a progressive process.

1. If (𝜂𝑡 )𝑡 ≥0 is an elementary process of the form (1.1.24), we define the Itô integral
of (𝜂𝑡 )𝑡 ≥0 on [0,𝑇 ] via

𝑘−1
∑︁
I[0,𝑇 ] (𝜂) B 𝐻𝑖 (𝐵𝑡𝑖+1 ∧𝑇 − 𝐵𝑡𝑖 ∧𝑇 ) . (1.1.27)
𝑖=0

2. If (𝜂𝑡 )𝑡 ≥0 is a square integrable process, then let {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} be the


approximating sequence furnished by Lemma 1.1.25. In this case, we define the
Itô integral of (𝜂𝑡 )𝑡 ≥0 via

I[0,𝑇 ] (𝜂) B lim I[0,𝑇 ] (𝜂 (𝑘) )


𝑘→∞

where the limit is taken in 𝐿 2 (P).

The second part of the definition requires some justification. We checked in (1.1.5)
and (1.1.6) that the Itô isometry holds for elementary processes: the mapping I[0,𝑇 ] is an
isometry from elementary processes equipped with the 𝐿 2 (P𝑇 ) norm, to the space 𝐿 2 (P)
of square integrable random variables. Through this isometry, we deduce from the fact
that {𝜂 (𝑘) : 𝑘 ∈ N} is Cauchy that {I[0,𝑇 ] (𝜂 (𝑘) ) : 𝑘 ∈ N} is also Cauchy. Since 𝐿 2 (P) is a
complete metric space, there exists a limit I[0,𝑇 ] (𝜂) of the latter sequence. We can then
deduce that the Itô isometry (1.1.8) holds for square integrable processes as well.
Upon trying to view 𝑡 ↦→ I[0,𝑡] (𝜂) as a stochastic process, we encounter the usual
measure-theoretic difficulty: for fixed 𝑡, I[0,𝑡] (𝜂) is well-defined outside of a measure
zero event, but we have to contend with uncountably many values of 𝑡 and the measure
zero events may accumulate. Overcoming this issue requires some principle that holds
uniformly over 𝑡 ∈ [0,𝑇 ]; in our case, this principle is Doob’s maximal inequality from
the theory of martingales.
1.1. A PRIMER ON STOCHASTIC CALCULUS 17

To set the stage, we can verify from the explicit formula (1.1.27) that 𝑡 ↦→ I[0,𝑡] (𝜂)
is a continuous martingale when (𝜂𝑡 )𝑡 ≥0 is an elementary process. Before proceeding
onwards, we need to further develop the theory of martingales.

Definition 1.1.28. A process (𝑀𝑡 )𝑡 ≥0 is a submartingale w.r.t. the filtration (ℱ𝑡 )𝑡 ≥0


if for all 𝑡 ≥ 0, 𝑀𝑡 is ℱ𝑡 -measurable and integrable, and

E[𝑀𝑡 | ℱ𝑠 ] ≥ 𝑀𝑠 , for all 0 ≤ 𝑠 < 𝑡 .

The class of submartingales is far broader and more useful than the class of martingales.
For example, if (𝑀𝑡 )𝑡 ≥0 is a martingale and 𝜑 : R → R is any convex function with
E|𝜑 (𝑀𝑡 )| < ∞ for all 𝑡 ≥ 0, then Jensen’s inequality for conditional expectations implies
that (𝜑 (𝑀𝑡 ))𝑡 ≥0 is a submartingale.
One of the key facts about submartingales is that they easily converge, which is often
deduced from Doob’s maximal inequality.

Theorem 1.1.29 (Doob’s maximal inequality). Let (𝑀𝑡 )𝑡 ≥0 be a continuous and non-
negative submartingale. Then, for all 𝜆,𝑇 > 0,
  E𝑀
P sup 𝑀𝑡 ≥ 𝜆 ≤
𝑇
.
𝑡 ∈[0,𝑇 ] 𝜆

Proof. We prove the theorem for discrete-time submartingales (𝑀𝑛 )𝑛∈N . The result for
continuous-time submartingales can then be obtained via approximation.
Let 𝜏 B min{𝑘 ∈ N : 𝑀𝑘 ≥ 𝜆}. On the event {𝜏 ≤ 𝑁 }, we have 𝑀𝜏 ≥ 𝜆, so
  𝑁
max 𝑀𝑘 ≥ 𝜆 = 𝜆 P(𝜏 ≤ 𝑁 ) ≤ E[𝑀𝜏 1{𝜏 ≤ 𝑁 }] = E[𝑀𝑘 1{𝜏 = 𝑘 }] .
∑︁
𝜆P
𝑘=0,1,...,𝑁
𝑘=0
Next, since {𝜏 = 𝑘 } is ℱ𝑘 -measurable, the submartingale property yields
E[𝑀𝑘 1{𝜏 = 𝑘 }] ≤ E E[𝑀𝑁 | ℱ𝑘 ] 1{𝜏 = 𝑘 } = E[𝑀𝑁 1{𝜏 = 𝑘 }] .
 

Hence,
  h 𝑁 i
1{𝜏 = 𝑘 } = E[𝑀𝑁 1{𝜏 ≤ 𝑁 }] ≤ E 𝑀𝑁 ,
∑︁
𝜆P max 𝑀𝑘 ≥ 𝜆 ≤ E 𝑀𝑁
𝑘=0,1,...,𝑁
𝑘=0
where we used the assumption that the submartingale is non-negative. 
It yields the following corollary, which we leave as Exercise 1.1.
18 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Corollary 1.1.30 (Doob’s 𝐿𝑝 maximal inequality). Let (𝑀𝑡 )𝑡 ≥0 be a continuous and


non-negative submartingale. Then, for all 𝑝 > 1 and 𝑇 > 0,

sup 𝑀𝑡 𝑝 ≤
𝑝
k𝑀𝑇 k 𝐿𝑝 (P) .
𝐿 (P)
𝑡 ∈[0,𝑇 ] 𝑝 −1

We are now ready to finish the construction of the Itô integral.


Proof of Theorem 1.1.7. Let (𝜂𝑡 )𝑡 ≥0 be a square integrable process, with approximating
sequence {(𝜂𝑡(𝑘) )𝑡 ≥0 : 𝑘 ∈ N} obtained from Lemma 1.1.25. We define

𝑋𝑡(𝑘) B I[0,𝑡] (𝜂 (𝑘) ) ,


where I[0,𝑡] (𝜂 (𝑘) ) is defined by the explicit formula (1.1.27). For any 𝑘, ℓ ∈ N, the process
𝑡 ↦→ |𝑋𝑡(𝑘) − 𝑋𝑡(ℓ) | 2 is a continuous non-negative submartingale, so Doob’s maximal
inequality (Theorem 1.1.29) and the Itô isometry (1.1.8) yield
   
P sup |𝑋𝑡(𝑘) − 𝑋𝑡(ℓ) | ≥ 𝜆 = P sup |𝑋𝑡(𝑘) − 𝑋𝑡(ℓ) | 2 ≥ 𝜆 2
𝑡 ∈[0,𝑇 ] 𝑡 ∈[0,𝑇 ]
(𝑘) (ℓ) 2
E[|𝑋𝑇(𝑘) − 𝑋𝑇(ℓ) | 2 ] k𝜂 − 𝜂 k 𝐿2 (P𝑇 )
≤ = .
𝜆2 𝜆2
As (𝜂 (𝑘) )𝑘∈N is Cauchy, we can pick a sequence 𝑛 0 < 𝑛 1 < 𝑛 2 < · · · of integers such that
𝑘, ℓ ≥ 𝑛 𝑗 implies k𝜂 (𝑘) − 𝜂 (ℓ) k 𝐿2 2 (P ) ≤ 2−3𝑗 . Take 𝜆 = 2−𝑗 to obtain
𝑇
 
(𝑛 𝑗 ) (𝑛 𝑗+1 )
P sup |𝑋𝑡 − 𝑋𝑡 |≥2−𝑗
≤ 2−𝑗 .
𝑡 ∈[0,𝑇 ]

These probabilities are summable, so the Borel–Cantelli lemma implies


(𝑛 𝑗 ) (𝑛 𝑗+1 )
almost surely, sup |𝑋𝑡 − 𝑋𝑡 | ≤ 2−𝑗 for all but finitely many 𝑗 .
𝑡 ∈[0,𝑇 ]

(𝑛 )
In particular, the paths {(𝑋𝑡 𝑗 )𝑡 ≥0 : 𝑗 ∈ N} form a Cauchy sequence in C([0,𝑇 ]) (equipped
with the supremum norm). As C([0,𝑇 ]) is complete, there is a limit (𝑋𝑡 )𝑡 ≥0 which
belongs to C([0,𝑇 ]). A similar argument, this time using Doob’s 𝐿 2 maximal inequality
(𝑛 ) (𝑛 )
(Corollary 1.1.30), shows that 𝑋𝑡 𝑗 → 𝑋𝑡 in 𝐿 2 (P) as well. Since each 𝑡 ↦→ 𝑋𝑡 𝑗 is a
continuous martingale, so is 𝑡 ↦→ 𝑋𝑡 .
(𝑛 )
Finally, for any fixed 𝑡 ∈ [0, 1], on one hand we have I[0,𝑡] (𝜂 (𝑛 𝑗 ) ) = 𝑋𝑡 𝑗 → 𝑋𝑡 in
𝐿 2 (P) as noted above. On the other hand, the Itô isometry implies I[0,𝑡] (𝜂 (𝑛 𝑗 ) ) → I[0,𝑡] (𝜂)
in 𝐿 2 (P). Hence, almost surely, 𝑋𝑡 = I[0,𝑡] (𝜂). 
1.2. MARKOV SEMIGROUP THEORY 19

1.2 Markov Semigroup Theory


More thorough treatments of Markov semigroup theory can be found in [BGL14; Han16].
We will revisit and expand upon many of the topics introduced here in Chapter 2.

1.2.1 Basic Definitions and Kolmogorov’s Equations


The core idea of Markov semigroup theory is to encode the behavior of a Markov process
(𝑋𝑡 )𝑡 ≥0 via operators which act on functions. We will then develop calculus rules for
working with these operators, and we can study the operators via functional analysis. This
is analogous to how the linear algebraic study of the transition matrix of a discrete-time
Markov chain reveals properties (e.g., ergodicity, convergence) of the chain.

Definition 1.2.1. For a time-homogeneous Markov process (𝑋𝑡 )𝑡 ≥0 , its associated


Markov semigroup (𝑃𝑡 )𝑡 ≥0 is the family of operators acting on functions via

𝑃𝑡 𝑓 (𝑥) B E[𝑓 (𝑋𝑡 ) | 𝑋 0 = 𝑥] .

The Markov property and iterated conditioning yields the following lemma (exercise).

Lemma 1.2.2. The Markov semigroup (𝑃𝑡 )𝑡 ≥0 satisfies 𝑃0 = id and 𝑃𝑠 𝑃𝑡 = 𝑃𝑡 𝑃𝑠 = 𝑃𝑠+𝑡


for all 𝑠, 𝑡 ≥ 0.

In order to do calculus, we want to differentiate the semigroup 𝑡 ↦→ 𝑃𝑡 , which is


accomplished via the following definition.

Definition 1.2.3. The infinitesimal generator ℒ associated with a Markov semi-


group (𝑃𝑡 )𝑡 ≥0 is the operator defined by

𝑃𝑡 𝑓 − 𝑓
ℒ𝑓 B lim ,
𝑡&0 𝑡

for all functions 𝑓 for which the above limit exists.

Here we pause to warn the reader of some technical issues. The mathematical diffi-
culties of Markov semigroup theory arise in trying to answer the following questions:
on what space of functions is the generator defined, and in what sense is the above limit
taken? As we shall see, a natural space of functions to consider is 𝐿 2 (𝜋), with 𝜋 denoting
20 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

the stationary distribution of the diffusion. However, the generator is usually a differential
operator, and not all functions in 𝐿 2 (𝜋) have enough regularity to lie in the domain of the
generator. The theory of unbounded linear operators on a Hilbert space was developed to
handle this situation, but it is rife with subtle distinctions such as the difference between
symmetric and self-adjoint operators. We will brush over these issues and focus on the
calculation rules.

Example 1.2.4. Let us compute the generator of the Langevin diffusion (1.E.1). In fact,
the following computation is simply a consequence of Itô’s formula (Theorem 1.1.18),
but it does not hurt to derive this result from scratch. We approximate
∫ 𝑡 √ √
𝑍𝑡 = 𝑍 0 − ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡 = 𝑍 0 − 𝑡 ∇𝑉 (𝑍 0 ) + 2 𝐵𝑡 + 𝑜 (𝑡) .
0

Assuming that 𝑓 ∈ C 2 (R𝑑 ) with bounded derivatives, we perform a Taylor expansion


of 𝑓 to second order.

E 𝑓 (𝑍𝑡 ) = E[𝑓 (𝑍 0 ) + h∇𝑓 (𝑍 0 ), −𝑡 ∇𝑉 (𝑍 0 ) + 2 𝐵𝑡 i + h∇2 𝑓 (𝑍 0 ) 𝐵𝑡 , 𝐵𝑡 i] + 𝑜 (𝑡) .

Since 𝐵𝑡 is mean zero and independent of 𝑍 0 , with E[𝐵𝑡 𝐵𝑡T ] = 𝑡𝐼𝑑 ,

E[𝑓 (𝑍𝑡 ) | 𝑍 0 = 𝑧] = 𝑓 (𝑧) − 𝑡 h∇𝑓 (𝑧), ∇𝑉 (𝑧)i + 𝑡 tr ∇2 𝑓 (𝑧) + 𝑜 (𝑡)


= 𝑓 (𝑧) − 𝑡 h∇𝑓 (𝑧), ∇𝑉 (𝑧)i + 𝑡 Δ𝑓 (𝑧) + 𝑜 (𝑡) .

Hence,
E[𝑓 (𝑍𝑡 ) | 𝑍 0 = 𝑧] − 𝑓 (𝑧)
ℒ𝑓 (𝑧) = lim = Δ𝑓 (𝑧) − h∇𝑉 (𝑧), ∇𝑓 (𝑧)i .
𝑡&0 𝑡

The Markov semigroup and dynamics. As promised, the Markov semigroup captures
the information that was contained in the original Markov process. One way to demon-
strate this is to prove theorems which show that the Markov process can be completely
recovered from its Markov semigroup. Another approach, which we now take up, is to
show that the dynamics of the Markov process are captured via calculation rules involving
the Markov semigroup.
1.2. MARKOV SEMIGROUP THEORY 21

Proposition 1.2.5 (Kolmogorov’s backward equation). For all 𝑡 ≥ 0, it holds that

𝜕𝑡 𝑃𝑡 𝑓 = ℒ𝑃𝑡 𝑓 = 𝑃𝑡 ℒ𝑓 .

In particular, ℒ commutes with the semigroup (𝑃𝑡 )𝑡 ≥0 .


Proof. Observe that
𝑃𝑡+ℎ 𝑓 − 𝑃𝑡 𝑓 𝑃ℎ − id
lim = lim 𝑃𝑡 𝑓 = ℒ𝑃𝑡 𝑓 .
ℎ&0 ℎ ℎ&0 ℎ
Repeating the computation but factoring out 𝑃𝑡 on the left yields the second equality. 
There is∫ a dual to this
∫ equation: let 𝜋0 denote the density of 𝑋 0 . Formally, we can write
E 𝑓 (𝑋𝑡 ) = 𝑃𝑡 𝑓 d𝜋0 = 𝑓 d𝑃𝑡∗ 𝜋0 , where 𝑃𝑡∗ is the adjoint of 𝑃𝑡 . This says that the law 𝜋𝑡
of 𝑋𝑡 is formally given by 𝑃𝑡∗ 𝜋0 . Moreover, by Kolmogorov’s forward equation,
∫ ∫ ∫ ∫ ∫
𝜕𝑡 𝑓 d𝑃𝑡 𝜋0 = 𝜕𝑡

𝑃𝑡 𝑓 d𝜋0 = 𝑃𝑡 ℒ𝑓 d𝜋0 = ℒ𝑓 d𝑃𝑡 𝜋0 =

𝑓 dℒ ∗𝑃𝑡∗ 𝜋0 .

Since this has to hold for all functions 𝑓 , we conclude the following.

Proposition 1.2.6 (Kolmogorov’s forward equation). For all 𝑡 ≥ 0,

𝜕𝑡 𝑃𝑡∗ 𝜋0 = ℒ ∗𝑃𝑡∗ 𝜋0 = 𝑃𝑡∗ ℒ ∗ 𝜋 0 .

Here is another illuminating way to express these equations. Let 𝑢𝑡 B 𝑃𝑡 𝑓 , and let
𝜋𝑡 = 𝑃𝑡∗ 𝜋0 . Then:
𝜕𝑡 𝑢𝑡 = ℒ𝑢𝑡 , (Kolmogorov’s backward equation)
𝜕𝑡 𝜋𝑡 = ℒ ∗ 𝜋𝑡 . (Kolmogorov’s forward equation)
The terms “backward” and “forward” are rather confusing, so we will not use them. Instead,
we will refer to the evolution equation for the density (Kolmogorov’s forward equation)
as the Fokker–Planck equation.
Consequently, we obtain characterizations of stationarity. Recall that 𝜋 is stationary
for the Markov process if, when 𝑋 0 ∼ 𝜋, then 𝑋𝑡 ∼ 𝜋 for all 𝑡 ≥ 0.

Proposition 1.2.7 (stationarity). The following are equivalent.

1. 𝜋 is a stationary distribution for the Markov process.

2. ℒ ∗ 𝜋 = 0.
22 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

3. E𝜋 ℒ𝑓 = 0 for all functions 𝑓 .

Proof. The equivalence between the first two statements is the Fokker–Planck equation.
The third statement is the dual of the second statement. 

Example 1.2.8. Consider again the Langevin diffusion. For functions 𝑓 , 𝑔 : R𝑑 → R,


∫ ∫ ∫
ℒ𝑓 𝑔 = {Δ𝑓 − h∇𝑉 , ∇𝑓 i} 𝑔 = 𝑓 {Δ𝑔 + div(𝑔 ∇𝑉 )}

where the second equality is integration by parts. This shows that

ℒ ∗𝑔 = Δ𝑔 + div(𝑔 ∇𝑉 ) .

From here, we can solve for the stationary distribution. Write

0 = ℒ ∗ 𝜋 = Δ𝜋 + div(𝜋 ∇𝑉 ) = div 𝜋 (∇ ln 𝜋 + ∇𝑉 ) .


This can be solved by setting ln 𝜋 = −𝑉 + constant, i.e. 𝜋 ∝ exp(−𝑉 ).

Corollary 1.2.9. The stationary distribution of the Langevin diffusion (1.E.1) with
potential 𝑉 is 𝜋 ∝ exp(−𝑉 ).

1.2.2 Reversibility and the Spectrum


Consider a Markov semigroup (𝑃𝑡 )𝑡 ≥0 with generator ℒ and stationary distribution 𝜋.
Then, the natural space of functions to study is the Hilbert space 𝐿 2 (𝜋). The analysis of
the Markov process is particularly simple if the following condition holds.

Definition 1.2.10. The Markov semigroup (𝑃𝑡 )𝑡 ≥0 is reversible w.r.t. 𝜋 if for all
𝑓 , 𝑔 ∈ 𝐿 2 (𝜋) and all 𝑡 ≥ 0,
∫ ∫
𝑃𝑡 𝑓 𝑔 d𝜋 = 𝑓 𝑃𝑡 𝑔 d𝜋 .
1.2. MARKOV SEMIGROUP THEORY 23

Equivalently, for all 𝑓 and 𝑔 for which ℒ𝑓 and ℒ𝑔 are defined,


∫ ∫
ℒ𝑓 𝑔 d𝜋 = 𝑓 ℒ𝑔 d𝜋 .

If 𝑋 0 ∼ 𝜋 and we take 𝑓 = 1𝐴 and 𝑔 = 1𝐵 for events 𝐴 and 𝐵, then it implies


P{𝑋𝑡 ∈ 𝐴, 𝑋 0 ∈ 𝐵} = P{𝑋 0 ∈ 𝐴, 𝑋𝑡 ∈ 𝐵} ,
i.e., (𝑋 0, 𝑋𝑡 ) has the same distribution as (𝑋𝑡 , 𝑋 0 ). This is the sense in which the associated
Markov process is time-reversible.
The definition says that 𝑃𝑡 and ℒ are symmetric operators on 𝐿 2 (𝜋), and thus we
expect that 𝑃𝑡 and ℒ have real spectra. Also, since 𝜕𝑡 𝑃𝑡 = ℒ𝑃𝑡 , we can formally write 𝑃𝑡 =
exp(𝑡ℒ), and so we expect 𝑃𝑡 to be a positive operator, meaning ∫ that 𝑓 𝑃∫𝑡 𝑓 d𝜋 ≥ 20 for

2
all 𝑓 ∈ 𝐿 (𝜋), which can be checked from reversibility (write 𝑓 𝑃𝑡 𝑓 d𝜋 = (𝑃𝑡/2 𝑓 ) d𝜋).
Moreover, from the definition of 𝑃𝑡 and Jensen’s inequality,
{𝑃𝑡 𝑓 (𝑥)}2 = E[𝑓 (𝑋𝑡 ) | 𝑋 0 = 𝑥] 2 ≤ E[𝑓 (𝑋𝑡 ) 2 | 𝑋 0 = 𝑥] = 𝑃𝑡 (𝑓 2 )(𝑥) (1.2.11)

and integrating this yields (𝑃𝑡 𝑓 ) 2 d𝜋 ≤ 𝑃𝑡 (𝑓 2 ) d𝜋 = 𝑓 2 d𝜋, where the equality


∫ ∫ ∫

follows from stationarity of 𝜋. This shows that 𝑃𝑡 is a contraction on 𝐿 2 (𝜋) (in fact, on
any 𝐿𝑝 (𝜋), 𝑝 ∈ [1, ∞]). Combining this with 𝑃𝑡 = exp(𝑡ℒ) leads us to predict that ℒ is a
negative operator. Below, we will give a direct proof of this fact; unsurprisingly, the proof
of negativity of ℒ still relies on the crucial fact (1.2.11).

Definition 1.2.12. The carré du champ is the bilinear operator Γ defined via
1
Γ(𝑓 , 𝑔) B
{ℒ(𝑓 𝑔) − 𝑓 ℒ𝑔 − 𝑔 ℒ𝑓 } .
2
The Dirichlet energy is the functional ℰ(𝑓 , 𝑔) B Γ(𝑓 , 𝑔) d𝜋.

Lemma 1.2.13. For any function 𝑓 , Γ(𝑓 , 𝑓 ) ≥ 0.

Proof. Recall from (1.2.11) that 𝑃𝑡 (𝑓 2 ) ≥ (𝑃𝑡 𝑓 ) 2 for all 𝑡 > 0. In terms of ℒ,
𝑓 2 + 𝑡 ℒ(𝑓 2 ) + 𝑜 (𝑡) ≥ [𝑓 + 𝑡 ℒ𝑓 + 𝑜 (𝑡)] 2 = 𝑓 2 + 2𝑡 𝑓 ℒ𝑓 + 𝑜 (𝑡)
and sending 𝑡 & 0 yields ℒ(𝑓 2 ) ≥ 2𝑓 ℒ𝑓 . (This proof does not require reversibility.) 
24 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Theorem 1.2.14 (fundamental integration by parts identity). Suppose that the genera-
tor ℒ and carré du champ Γ are associated with a Markov semigroup which is reversible
w.r.t. 𝜋. Then, for any functions 𝑓 and 𝑔,
∫ ∫ ∫
𝑓 (−ℒ)𝑔 d𝜋 = (−ℒ)𝑓 𝑔 d𝜋 = Γ(𝑓 , 𝑔) d𝜋 = ℰ(𝑓 , 𝑔) .

Since the identity implies that ℒ is symmetric, the integration by parts identity is in
fact equivalent to reversibility.

Corollary 1.2.15. For a reversible Markov semigroup, −ℒ ≥ 0.

Proof of Theorem 1.2.14. Since ℒℎ d𝜋 = 0 for all functions ℎ (due to stationarity of 𝜋),

the definition of Γ yields


1 1
∫ ∫ ∫
Γ(𝑓 , 𝑔) d𝜋 = 𝑓 (−ℒ)𝑔 d𝜋 + 𝑔 (−ℒ) 𝑓 d𝜋 .
2 2
The two terms are equal due to reversibility. 
It is usually convenient for our operators to be positive, so from now on we will instead
refer to the negative generator −ℒ.
When we introduced Kolmogorov’s equations, we ended up with two PDEs, one
involving the generator ℒ and one involving its 𝐿 2 (𝔪) adjoint ℒ ∗ , where 𝔪 is the
Lebesgue measure on R𝑑 . The issue is that we used the “wrong” inner product; ℒ is not
symmetric in 𝐿 2 (𝔪). If we now switch to 𝐿 2 (𝜋), then instead of considering the density
𝜋𝑡 with respect to 𝔪 we should consider the density 𝜌𝑡 B 𝜋𝑡 /𝜋 with respect to 𝜋. Then,
the Fokker–Planck equation becomes
𝜕𝑡 𝜌𝑡 = ℒ𝜌𝑡 . (1.2.16)
For the rest of the section, the Markov semigroup is assumed reversible.

Example 1.2.17. Returning to the fundamental example of the Langevin diffusion, a


computation shows that
1
Γ(𝑓 , 𝑓 ) = {Δ(𝑓 2 ) − h∇𝑉 , ∇(𝑓 2 )i − 2𝑓 (Δ𝑓 − h∇𝑉 , ∇𝑓 i)} = k∇𝑓 k 2 .
2
Incidentally, carré du champ means “square of the field” in French, and it is this
1.2. MARKOV SEMIGROUP THEORY 25

expression which gives it its name. More generally, Γ(𝑓 , 𝑔) = h∇𝑓 , ∇𝑔i.
The identity in Theorem 1.2.14 reads
∫ ∫ ∫
𝑓 (−Δ𝑔 + h∇𝑉 , ∇𝑔i) d𝜋 = 𝑔 (−Δ𝑓 + h∇𝑉 , ∇𝑓 i) d𝜋 = h∇𝑓 , ∇𝑔i d𝜋

which can be checked (using 𝜋 ∝ exp(−𝑉 )) via integration by parts (naturally!),


showing that the Langevin diffusion is indeed reversible w.r.t. 𝜋.

Gradient flow of the Dirichlet energy. It turns out that a reversible Markov process
follows the steepest descent of the Dirichlet energy with respect to 𝐿 2 (𝜋). To justify this,
for a curve 𝑡 ↦→ 𝑢𝑡 in 𝐿 2 (𝜋), write 𝑢¤𝑡 B 𝜕𝑡 𝑢𝑡 for the time derivative. The 𝐿 2 (𝜋) gradient of
the functional 𝑓 ↦→ ℰ(𝑓 ) B ℰ(𝑓 , 𝑓 ) at 𝑓 is defined to be the element ∇𝐿2 (𝜋)ℰ(𝑓 ) ∈ 𝐿 2 (𝜋)
such that for all curves 𝑡 ↦→ 𝑢𝑡 with 𝑢 0 = 𝑓 , it holds that

𝑢¤0 ∇𝐿2 (𝜋)ℰ(𝑓 ) d𝜋 .

𝜕𝑡 𝑡=0 ℰ(𝑢𝑡 , 𝑢𝑡 ) =

From the integration by parts identity,


∫ ∫
𝜕𝑡 𝑡=0 ℰ(𝑢𝑡 , 𝑢𝑡 ) = 𝜕𝑡 𝑡=0 𝑢𝑡 (−ℒ)𝑢𝑡 d𝜋 = 2 𝑢¤0 (−ℒ) 𝑓 d𝜋 .

Therefore, ∇𝐿2 (𝜋)ℰ(𝑓 ) = −2ℒ𝑓 .


The steepest descent of ℰ is the curve 𝑡 ↦→ 𝑢𝑡 such that 𝑢¤𝑡 = −∇𝐿2 (𝜋)ℰ(𝑢𝑡 ) = 2ℒ𝑢𝑡 .
This is, up to a rescaling of time, precisely the equation satisfied by the Markov process.

Spectral gap and convergence. Consider a reversible Markov semigroup (𝑃𝑡 )𝑡 ≥0 and
recall that 𝑃𝑡 𝑓 (𝑥) = E[𝑓 (𝑋𝑡 ) | 𝑋 0 = 𝑥]. We are interested in the long-term behavior of
𝑃𝑡 𝑓 . If the process mixes, then by definition it forgets its initial condition, so∫that 𝑃𝑡 𝑓
converges to a constant; moreover, this constant should be the average value 𝑓 d𝜋 at
stationarity. How do we establish a rate of convergence for 𝑃𝑡 𝑓 → 𝑓 d𝜋?

We may assume that 𝑓 d𝜋 = 0, so we wish to prove 𝑃𝑡 𝑓 → 0. Also recall that,


formally, 𝑃𝑡 = exp(𝑡ℒ) with ℒ ≤ 0. If we have a spectral gap

−ℒ ≥ 𝜆min > 0 ,

then we would expect that

k𝑃𝑡 𝑓 k 𝐿2 2 (𝜋) ≤ exp(−2𝜆min𝑡) k𝑓 k 𝐿2 2 (𝜋) .


26 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

This is indeed the case. However, observe that since 𝑃𝑡 1 = 1 for all 𝑡 ≥ 0 and hence
ℒ1 = 0, the spectral gap condition is only supposed to hold on the subspace of 𝐿 2 (𝜋)
which is orthogonal to constants.

Definition 1.2.18. The Markov process is said to satisfy a Poincaré inequality (PI)
with constant 𝐶 PI if for all functions 𝑓 ∈ 𝐿 2 (𝜋),

1 1

𝑓 (−ℒ)𝑓 d𝜋 = ℰ(𝑓 , 𝑓 ) ≥ k𝑓 − proj1⊥ 𝑓 k 𝐿2 2 (𝜋) = var𝜋 𝑓 .
𝐶 PI 𝐶 PI

The Poincaré constant 𝐶 PI corresponds to the inverse of the spectral gap. Based on
the calculus we have developed so far, it is not too difficult to prove the following result
(differentiate 𝑡 ↦→ k𝑃𝑡 𝑓 k 𝐿2 2 (𝜋) ), so we leave it as Exercise 1.7.

Theorem 1.2.19. The following are equivalent.

1. The Markov process satisfies a Poincaré inequality with constant 𝐶 PI .

2. For all 𝑓 ∈ 𝐿 2 (𝜋) with 𝑓 d𝜋 = 0 and all 𝑡 ≥ 0,


 2𝑡 
k𝑃𝑡 𝑓 k 𝐿2 2 (𝜋) ≤ exp − k𝑓 k 𝐿2 2 (𝜋) .
𝐶 PI

In particular, we can apply this result to the semigroup corresponding to the Langevin
diffusion (1.E.1) to obtain a spectral gap criterion for quantitative convergence. However,
this result is mainly of use when we are interested in a specific test function 𝑓 . More
generally, it is useful to obtain bounds on the rate of convergence of the law 𝜋𝑡 of 𝑋𝑡 to
the stationary distribution 𝜋. Recall (from (1.2.16)) that the relative density 𝜌𝑡 B 𝜋𝑡 /𝜋
solves the equation 𝜕𝑡 𝜌𝑡 = ℒ𝜌𝑡 , i.e., 𝜌𝑡 is given by 𝜌𝑡 = 𝑃𝑡 𝜌 0 . We can therefore apply the
preceding result to 𝑓 B 𝜌 0 − 1.
For a probability measure 𝜇, we define the chi-squared divergence
d𝜇 2 d𝜇
𝜒 2 (𝜇 k 𝜋) B − 1 𝐿2 (𝜋) = var𝜋 if 𝜇  𝜋 ,
d𝜋 d𝜋
with 𝜒 2 (𝜇 k 𝜋) B ∞ otherwise. The result can be formulated as follows.

Theorem 1.2.20. The following are equivalent.


1.2. MARKOV SEMIGROUP THEORY 27

1. The Markov process satisfies a Poincaré inequality with constant 𝐶 PI .

2. For any initial distribution 𝜋0 and all 𝑡 ≥ 0,


 2𝑡 
2
𝜒 (𝜋𝑡 k 𝜋) ≤ exp − 𝜒 2 (𝜋0 k 𝜋) .
𝐶 PI

Example 1.2.21. For the Langevin diffusion, the Poincaré inequality reads

var𝜋 𝑓 ≤ 𝐶 PI E𝜋 [k∇𝑓 k 2 ]

for all functions 𝑓 : R𝑑 → R.

1.2.3 The Log-Sobolev Inequality and Bakry–Èmery Theory


For sampling applications, the convergence result under a Poincaré inequality is not
fully satisfactory because the chi-squared divergence at initialization is typically large,
scaling exponentially in the dimension. The approach we explore next is to use the
Kullback–Leibler (KL) divergence KL(· k 𝜋) as our objective functional, defined via

d𝜇 d𝜇 d𝜇
∫ ∫
KL(𝜇 k 𝜋) B ln d𝜋 = ln d𝜇 if 𝜇  𝜋 ,
d𝜋 d𝜋 d𝜋
and KL(𝜇 k 𝜋) B ∞ otherwise.
Recall the notation 𝜌𝑡 B 𝜋𝑡 /𝜋 for the relative density of the Markov process w.r.t. 𝜋.
Since 𝜕𝑡 𝜌𝑡 = ℒ𝜌𝑡 , we can calculate via the integration by parts identity that
∫ ∫
𝜕𝑡 KL(𝜋𝑡 k 𝜋) = 𝜕𝑡 𝜌𝑡 ln 𝜌𝑡 d𝜋 = (ln 𝜌𝑡 + 1) ℒ𝜌𝑡 d𝜋 = −ℰ(𝜌𝑡 , ln 𝜌𝑡 ) . (1.2.22)

Hence, if ℰ(𝜌𝑡 , ln 𝜌𝑡 ) & KL(𝜋𝑡 k 𝜋), then we obtain exponential ergodicity of the diffusion
in KL divergence.

Definition 1.2.23. The Markov process is said to satisfy a log-Sobolev inequality


(LSI) with constant 𝐶 LSI if for all densities 𝜌 w.r.t. 𝜋,
𝐶 LSI
KL(𝜌𝜋 k 𝜋) ≤ ℰ(𝜌, ln 𝜌) .
2
28 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Theorem 1.2.24. The following are equivalent.

1. The Markov process satisfies a log-Sobolev inequality with constant 𝐶 LSI .

2. For any initial distribution 𝜋0 and all 𝑡 ≥ 0,


2𝑡 

KL(𝜋𝑡 k 𝜋) ≤ exp − KL(𝜋0 k 𝜋) .
𝐶 LSI

By linearizing the LSI, i.e., by taking 𝜌 = 1 + 𝜀 𝑓 for small 𝜀 > 0 and expanding both
sides of the LSI in powers of 𝜀, one can prove that the LSI implies a Poincaré inequality
with constant 𝐶 PI ≤ 𝐶 LSI (Exercise 1.8).

Example 1.2.25. For the Langevin diffusion, the LSI reads

2
√︂

𝜇 𝜇 𝜇 2  𝜇 2 
KL(𝜇 k 𝜋) ≤ E𝜋 ∇ , ∇ ln = E𝜇 ∇ ln = 4 E𝜋 ∇
 
.
𝐶 LSI 𝜋 𝜋 𝜋 𝜋
The right-hand side of the above expression is important; it is known as the (relative)
Fisher information FI(𝜇 k 𝜋) B E𝜇 [k∇ ln(𝜇/𝜋)k 2 ]. In particular, the Fisher infor-
mation plays a central role in the study of non-log-concave sampling in Chapter 11.
The LSI often appears in many equivalent forms. For example, another formulation
is that for all functions 𝑓 : R𝑑 → R, it holds that

ent𝜋 (𝑓 2 ) ≤ 2𝐶 LSI E𝜋 [k∇𝑓 k 2 ] ,

where for a function 𝑔 : R𝑑 → R+ we√︁define ent𝜋 (𝑔) B E𝜋 (𝑔 ln 𝑔) − E𝜋 𝑔 ln E𝜋 𝑔. To


verify the equivalence, consider 𝑓 = 𝜇/𝜋.

Bakry–Émery condition. Although we have derived two criteria for convergence of


the Markov process, namely, the Poincaré inequality and the log-Sobolev inequality, we
have not yet addressed when these criteria hold. Introduce the following definition.
1.2. MARKOV SEMIGROUP THEORY 29

Definition 1.2.26. The iterated carré du champ is the operator Γ2 defined via
1
Γ2 (𝑓 , 𝑔) B {ℒΓ(𝑓 , 𝑔) − Γ(𝑓 , ℒ𝑔) − Γ(𝑔, ℒ𝑓 )} .
2

Recalling that Γ(𝑓 , 𝑔) = 12 {ℒ(𝑓 𝑔) − 𝑓 ℒ𝑔 − 𝑔 ℒ𝑓 }, we see that Γ2 is defined analogously


to Γ, except we replace the bilinear operation of multiplication, (𝑓 , 𝑔) ↦→ 𝑓 𝑔, by the carré
du champ (𝑓 , 𝑔) ↦→ Γ(𝑓 , 𝑔). Also, similarly to how the carré du champ appears when
computing the time derivative of functionals such as the chi-squared divergence and the
KL divergence, the iterated carré du champ appears when computing the second time
derivative. After some calculations, one arrives at the following criterion.

Definition 1.2.27. The Markov semigroup is said to satisfy the Bakry–Émery


criterion with constant 𝛼 > 0 if for all functions 𝑓 ,

Γ2 (𝑓 , 𝑓 ) ≥ 𝛼 Γ(𝑓 , 𝑓 ) .

This condition is also known as the curvature-dimension condition CD(𝛼, ∞).

We will prove the following theorem in Chapter 2.

Theorem 1.2.28 (Bakry–Émery). Consider a diffusion Markov semigroup. Assume


that the curvature-dimension condition CD(𝛼, ∞) holds. Then, a log-Sobolev inequality
holds with constant 𝐶 LSI ≤ 1/𝛼.

We have not explained yet what a diffusion Markov semigroup is, but for now we can
think of the Langevin diffusion as a fundamental example. The key point is that once
the (iterated) carré du champ operators are known, the curvature-dimension condition
amounts to an algebraic condition which can be easily checked, which in turn implies
the log-Sobolev inequality (and hence the Poincaré inequality by Exercise 1.8). For the
Langevin diffusion, this condition amounts to the following theorem.

Theorem 1.2.29. For the Langevin diffusion (1.E.1), the curvature-dimension condition
CD(𝛼, ∞) holds if and only if the potential 𝑉 is 𝛼-strongly convex.

Although we have deferred the Markov semigroup proofs of these results to Chapter 2,
we will shortly prove these results using the calculus of optimal transport.
30 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Another point to address is the origin of the name “curvature-dimension condition”. In


fact this is part of a rich story in which Markov diffusions on Riemannian manifolds capture
the geometric features of the ambient space, such as its curvature. A picture emerges
in which curvature, concentration, and mixing of the diffusion all intertwine, and only
in this context is it appreciated that the curvature-dimension condition is appropriately
named. This is discussed more fully in Chapter 2.

1.3 The Geometry of Optimal Transport


In this section, we explain how the space of probability measures equipped with the
2-Wasserstein distance from optimal transport can be formally viewed as a Rieman-
nian manifold. The textbook [Vil03] is a standard reference for optimal transport; see
also [AGS08; Vil09; San15] for more detailed treatments of Wasserstein calculus. We
remind readers that the “proofs” in this section are only sketched for intuition.

1.3.1 Introduction and Duality Theory


The optimal transport problem can be defined in great generality. Throughout this section,
P (X) denotes the space of probability measures on a space X.

Definition 1.3.1. Let X and Y be complete separable metric spaces, and consider a
cost functional 𝑐 : X × Y → [0, ∞]. The optimal transport cost from 𝜇 ∈ P (X) to
𝜈 ∈ P (Y) with cost 𝑐 is

T𝑐 (𝜇, 𝜈) B inf 𝑐 (𝑥, 𝑦) d𝛾 (𝑥, 𝑦) , (1.3.2)
𝛾 ∈C(𝜇,𝜈)

where C(𝜇, 𝜈) is the space of couplings of (𝜇, 𝜈), i.e. the space of probability measures
𝛾 ∈ P (X × Y) whose marginals are 𝜇 and 𝜈 respectively.
A minimizer in this problem is known as an optimal transport plan.

An equivalent probabilistic formulation is that T𝑐 (𝜇, 𝜈) is the infimum of E 𝑐 (𝑋, 𝑌 )


over all pairs of jointly defined random variables (𝑋, 𝑌 ) such that 𝑋 ∼ 𝜇 and 𝑌 ∼ 𝜈.

Theorem 1.3.3. If the cost 𝑐 is lower semicontinuous, then an optimal transport plan
always exists.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 31

Proof. One can show that the functional 𝛾 ↦→ 𝑐 d𝛾 is lower semicontinuous and that

C(𝜇, 𝜈) is compact, where we use the weak topology on P (X × Y). It is a general fact that
lower semicontinuous functions attain their minima on compact sets. 

Historically, optimal transport began with Monge who considered the Euclidean cost
𝑐 (𝑥, 𝑦) B k𝑥 − 𝑦 k on R𝑑 × R𝑑 . Moreover, he considered a slightly different problem
in which, rather than searching over all couplings in C(𝜇, 𝜈), he restricted attention to
couplings which are induced by a mapping 𝑇 : R𝑑 → R𝑑 satisfying 𝑇# 𝜇 = 𝜈; this is known
as the Monge problem. In the probabilistic interpretation, this corresponds to a pair of
random variables (𝑋,𝑇 (𝑋 )) with 𝑋 ∼ 𝜇 and 𝑇 (𝑋 ) ∼ 𝜈. The physical interpretation of this
additional constraint is that no mass from 𝜇 be split up before it is transported, which
may be reasonable from a modelling perspective but leads to an ill-posed mathematical
problem. Indeed, there may not even exist any such mappings 𝑇 , as is the case when
𝜇 = 𝛿𝑥 places all of its mass on a single point and 𝜈 does not. Consequently, the solution
to the Monge problem remained unknown for centuries.
The breakthrough arrived when Kantorovich formulated the relaxation of the Monge
problem introduced in Definition 1.3.1, which is therefore known as the Kantorovich
problem. As the product measure 𝜇 ⊗ 𝜈 always belongs to C(𝜇, 𝜈), we at least know that
the constraint set is non-empty, and Theorem 1.3.3 shows that the Kantorovich problem
is well-behaved. Moreover, the Kantorovich problem is actually a convex problem on
P (X × Y); indeed, the objective is linear and the constraint set C(𝜇, 𝜈) is convex. Hence,
one can bring to bear the power of convex duality to study the Kantorovich problem
(historically, this study was actually the origin of linear programming).
Although a large part of optimal transport theory can be developed in a general
framework as above, for the rest of the section we will focus on the case 𝑐 (𝑥, 𝑦) B k𝑥 −𝑦 k 2
on R𝑑 × R𝑑 for the sake of simplicity.

Definition 1.3.4. The 2-Wasserstein distance between 𝜇 and 𝜈, denoted 𝑊2 (𝜇, 𝜈),
is defined via

2
𝑊2 (𝜇, 𝜈) B inf k𝑥 − 𝑦 k 2 d𝛾 (𝑥, 𝑦) . (1.3.5)
𝛾 ∈C(𝜇,𝜈)

Write P2 (R𝑑 ) B {𝜇 ∈ P (R𝑑 ) | k·k 2 d𝜇 < ∞} for the space of probability measures

on R𝑑 with finite second moment.

Duality and optimality. For this section, it will actually be convenient to consider the
cost 𝑐 (𝑥, 𝑦) B 21 k𝑥 − 𝑦 k 2 instead, i.e. we consider 12 𝑊22 (𝜇, 𝜈) instead of 𝑊22 (𝜇, 𝜈).
32 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

The key to solving the Kantorovich problem is duality. First observe that the constraint
that the first ∫marginal of 𝛾 is 𝜇 ∫can be written as follows: for every function 𝑓 ∈ 𝐿 1 (𝜇),
it holds that 𝑓 (𝑥) d𝛾 (𝑥, 𝑦) = 𝑓 (𝑥) d𝜇 (𝑥). Doing the same for the constraint on the
second marginal of 𝛾, we can write the Kantorovich problem as an unconstrained min-max
problem
1 2 n∫ k𝑥 − 𝑦 k 2 ∫ ∫
𝑊 (𝜇, 𝜈) = inf sup d𝛾 (𝑥, 𝑦) + 𝑓 d𝜇 − 𝑓 (𝑥) d𝛾 (𝑥, 𝑦)
2 2 𝛾 ∈M+ (R𝑑 ×R𝑑 ) 𝑓 ∈𝐿 1 (𝜇) 2
𝑔∈𝐿 1 (𝜈)
∫ ∫ o
+ 𝑔 d𝜈 − 𝑔(𝑦) d𝛾 (𝑥, 𝑦) .

Here, M+ (R𝑑 × R𝑑 ) denotes the space of non-negative finite measures on R𝑑 × R𝑑 . Next,


if we switch the order of the infimum and the supremum, we arrive at the dual optimal
transport problem:
n∫  k𝑥 − 𝑦 k 2
∫ ∫ o
sup inf 𝑓 d𝜇 + 𝑔 d𝜈 + − 𝑓 (𝑥) − 𝑔(𝑦) d𝛾 (𝑥, 𝑦)

𝑓 ∈𝐿 1 (𝜇) 𝛾 ∈M+ (R ×R )
𝑑 𝑑 2
𝑔∈𝐿 1 (𝜈)
n∫ ∫ o
= sup 𝑓 d𝜇 + 𝑔 d𝜈
(𝑓 ,𝑔)∈D(𝜇,𝜈)

where D(𝜇, 𝜈) is the set of dual feasible potentials


k𝑥 − 𝑦 k 2
D(𝜇, 𝜈) B (𝑓 , 𝑔) ∈ 𝐿 1 (𝜇) × 𝐿 1 (𝜈) 𝑓 (𝑥) + 𝑔(𝑦) ≤ for all 𝑥, 𝑦 ∈ R𝑑 .

2

Definition 1.3.6. Let 𝜇, 𝜈 ∈ P2 (R𝑑 ). The dual optimal transport problem from 𝜇
to 𝜈 is the optimization problem
n∫ ∫ o
sup 𝑓 d𝜇 + 𝑔 d𝜈 . (1.3.7)
(𝑓 ,𝑔)∈D(𝜇,𝜈)

Since inf sup ≥ sup inf, the value of the dual problem is always at most 12 𝑊22 (𝜇, 𝜈).
On the
∫ other hand, if we find ∫ a transport
∫ plan 𝛾 and feasible dual potentials 𝑓 , 𝑔 such
★ ★ ★
2
that k𝑥 − 𝑦 k d𝛾 (𝑥, 𝑦) = 𝑓 d𝜇 + 𝑔 d𝜈, it implies that the primal and dual values
★ ★ ★

coincide and that 𝛾 ★, 𝑓 ★, and 𝑔★ are all optimal.


By carefully studying the dual problem, we will obtain a wealth of information about
the optimal transport problem. Our main goal now is to sketch the following theorem.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 33

Theorem 1.3.8 (fundamental theorem of optimal transport). Let 𝜇, 𝜈 ∈ P2 (R𝑑 ). Then,


the following assertions hold.

1. (strong duality) The value of the dual optimal transport problem from 𝜇 to 𝜈 equals
1 2
2 𝑊2 (𝜇, 𝜈).

2. (existence of optimal dual potentials) There exists an optimal pair (𝑓 ★, 𝑔★) for the
dual optimal transport problem.

3. (characterization of optimality) The optimal dual potentials are of the form

k·k 2 k·k 2
𝑓★ = −𝜑, 𝑔★ = − 𝜑∗ , (1.3.9)
2 2
where 𝜑 : R𝑑 → R ∪ {∞} is a proper, convex, lower semicontinuous function and
𝜑 ∗ is its convex conjugate. If 𝛾 ★ denotes the optimal transport plan, then for 𝛾 ★-a.e.
(𝑥, 𝑦) ∈ R𝑑 × R𝑑 , it holds that 𝜑 (𝑥) + 𝜑 ∗ (𝑦) = h𝑥, 𝑦i, i.e., 𝛾 ★ is supported on the
subdifferential of 𝜑.

4. (Brenier’s theorem) Suppose in addition that 𝜇 is absolutely continuous w.r.t. the


Lebesgue measure on R𝑑 . Then, the optimal transport plan is unique, and moreover
it is induced by an optimal transport map 𝑇 . The mapping 𝑇 is characterized as
the (𝜇-almost surely) unique gradient of a proper convex lower semicontinuous
function 𝜑 which pushes forward 𝜇 to 𝜈: 𝑇 = ∇𝜑 and (∇𝜑) # 𝜇 = 𝜈.

Various parts of this theorem can be proven separately; for example, strong duality
can be established by rigorously justifying the interchange of infimum and supremum via
a high-powered minimax theorem. Instead, we will outline a proof of the theorem which
simultaneously establishes all of the above facts.
Outline. In the proof, we abbreviate “proper convex lower semicontinuous function” to
simply “closed convex function”.
1. Optimal transport plans are cyclically monotone. Let 𝛾 ★ be an optimal transport
plan, and suppose that the pairs (𝑥 1, 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) lie in the support of 𝛾 ★. Then,
it should be the case that we cannot “rematch” these points to lower the optimal
transport cost, i.e. for every permutation 𝜎 of [𝑛] we should have
𝑛 𝑛
2
k𝑥𝑖 − 𝑦𝜎 (𝑖) k 2 .
∑︁ ∑︁
k𝑥𝑖 − 𝑦𝑖 k ≤
𝑖=1 𝑖=1
34 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Equivalently,
𝑛
∑︁ 𝑛
∑︁
h𝑥𝑖 , 𝑦𝑖 i ≥ h𝑥𝑖 , 𝑦𝜎 (𝑖) i . (1.3.10)
𝑖=1 𝑖=1

Indeed, if this condition fails, then it is possible to construct a new transport plan
from 𝛾 ★ by slightly rearranging the mass which has strictly smaller transport cost,
which is a contradiction; see [GM96, Theorem 2.3].
A subset 𝑆 ⊆ R𝑑 × R𝑑 is said to be cyclically monotone if for all 𝑛 ∈ N+ , all
pairs (𝑥 1, 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ), and all permutations 𝜎 of [𝑛], the condition (1.3.10) holds.
Thus, optimal transport plans are supported on cyclically monotone sets.

2. Characterization of cyclically monotone sets. Remarkably, a complete characteriza-


tion of cyclically monotone sets is known. Suppose 𝜑 is convex and differentiable,
let 𝑥 1, . . . , 𝑥𝑛 ∈ R𝑑 , and let 𝜎 be a permutation of [𝑛]. Then, from convexity,

𝜑 (𝑥𝜎 −1 (𝑖) ) − 𝜑 (𝑥𝑖 ) ≥ h∇𝜑 (𝑥𝑖 ), 𝑥𝜎 −1 (𝑖) − 𝑥𝑖 i . (1.3.11)

Summing this over 𝑖 ∈ [𝑛], we obtain


𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
h∇𝜑 (𝑥𝑖 ), 𝑥𝑖 i ≥ h∇𝜑 (𝑥𝑖 ), 𝑥𝜎 −1 (𝑖) i = h∇𝜑 (𝑥𝜎 (𝑖) ), 𝑥𝑖 i .
𝑖=1 𝑖=1 𝑖=1

More generally, if 𝜑 is not differentiable, then the subdifferential of 𝜑 at 𝑥𝑖 is defined


to be the set of vectors 𝑦𝑖 ∈ R𝑑 such that (1.3.11) holds with 𝑦𝑖 replacing ∇𝜑 (𝑥𝑖 ). This
reasoning shows that the set 𝜕𝜑 B {(𝑥, 𝑦) ∈ R𝑑 × R𝑑 | 𝑦 ∈ 𝜕𝜑 (𝑥)} is a cyclically
monotone subset of R𝑑 × R𝑑 .
The converse is also true: if 𝑆 ⊆ R𝑑 × R𝑑 is cyclically monotone, then it is contained
in the subdifferential of a closed convex function 𝜑. To prove this, one can pick any
(𝑥 0, 𝑦0 ) ∈ 𝑆 and consider
𝑛
n∑︁ o
𝜑 (𝑥) B sup h𝑦𝑖 , 𝑥𝑖+1 − 𝑥𝑖 i 𝑛 ∈ N+, (𝑥 1, 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) ∈ 𝑆, 𝑥𝑛+1 = 𝑥 .

𝑖=0

This characterization is due to Rockafellar.

3. Characterization of dual optimality. Now that we see the connection between con-
vexity and the primal problem, it is time to do the same for the dual problem.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 35

Suppose (𝑓 , 𝑔) is a feasible dual pair; if we hold 𝑓 fixed, can we improve 𝑔? The


constraint on 𝑔 says that for all 𝑥, 𝑦 ∈ R𝑑 ,

k𝑥 − 𝑦 k 2
𝑔(𝑦) ≤ − 𝑓 (𝑥) .
2
Hence, writing 𝜑 B k·k 2 /2 − 𝑓 , the optimal choice is

 k𝑥 − 𝑦 k 2 k𝑦 k 2
𝑔(𝑦) = inf − 𝑓 (𝑥) = − sup {h𝑥, 𝑦i − 𝜑 (𝑥)} .
𝑥 ∈R𝑑 2 2 𝑥 ∈R𝑑

The function 𝜑 ∗ defined by 𝜑 ∗ (𝑦) B sup𝑥 ∈R𝑑 {h𝑥, 𝑦i − 𝜑 (𝑥)} is known as the convex
conjugate of 𝜑. To summarize, we have shown that for fixed 𝑓 = k·k 2 /2 − 𝜑, the
optimal choice for 𝑔 is k·k 2 /2 − 𝜑 ∗ . Similarly, if we fix 𝑔 = k·k 2 /2 − 𝜑 ∗ , the optimal
choice for 𝑓 is k·k 2 /2 − 𝜑 ∗∗ .
We have not yet established existence, but suppose for the moment that an optimal
dual pair (𝑓 ★, 𝑔★) exists. The preceding reasoning shows that 𝑓 ★ = k·k 2 /2 − 𝜑 and
𝑔★ = k·k 2 /2 − 𝜑 ∗ , where 𝜑 ∗∗ = 𝜑; otherwise the dual pair could be improved. Next,
it is known from convex analysis that 𝜑 = 𝜑 ∗∗ if and only if 𝜑 is a closed convex
function. Thus, optimal dual potentials have the representation (1.3.9).

4. Proof of strong duality. Now consider the optimal transport plan 𝛾 ★ (which exists;
see Theorem 1.3.3). We know that 𝛾 ★ is supported on a cyclically monotone set,
which in turn is contained in the subdifferential of a closed convex function 𝜑.
Define the functions 𝑓 ★ B k·k 2 /2 − 𝜑 and 𝑔★ B k·k 2 /2 − 𝜑 ∗ ; these are dual feasible
potentials. Also, it is a standard fact of convex analysis that (𝑥, 𝑦) ∈ 𝜕𝜑 if and only
if 𝜑 (𝑥) + 𝜑 ∗ (𝑦) = h𝑥, 𝑦i. Since the support of 𝛾 ★ is contained in 𝜕𝜑,

1 k𝑥 k 2 k𝑦 k 2
∫ ∫
2
k𝑥 − 𝑦 k d𝛾 (𝑥, 𝑦) = − h𝑥, 𝑦i d𝛾 ★ (𝑥, 𝑦)
★ 
+
2 2 2
2 k𝑦 k 2

k𝑥 k
− 𝜑 (𝑥) − 𝜑 ∗ (𝑦) d𝛾 ★ (𝑥, 𝑦)

= +
2 2

k·k 2 ∫
k·k 2
− 𝜑 d𝜇 + − 𝜑 ∗ d𝜈
 
=
∫ 2 ∫ 2
= 𝑓 ★ d𝜇 + 𝑔★ d𝜈 .

This simultaneously proves that strong duality holds and that (𝑓 ★, 𝑔★) is a pair of
optimal dual potentials.
36 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

5. For regular measures, optimal transport plans are induced by transport maps. An-
other fact from convex analysis is that convex functions enjoy some regularity:
a closed convex function 𝜑 is differentiable at Lebesgue-a.e. points of the interior of
its domain. Consequently, if 𝜇 is absolutely continuous w.r.t. Lebesgue measure,
then 𝜑 is differentiable 𝜇-a.e. This says that for 𝜇-a.e. 𝑥 ∈ R𝑑 , the gradient ∇𝜑 (𝑥)
exists and 𝜕𝜑 (𝑥) = {∇𝜑 (𝑥)}. Therefore, we can write 𝛾 ★ = (id, ∇𝜑) # 𝜇. In particular,
(∇𝜑) # 𝜇 = 𝜈, and ∇𝜑 is the optimal transport map from 𝜇 to 𝜈.
In our entire discussion so far, we started off with an arbitrary optimal transport
plan 𝛾 ★. Hence, we have shown that every optimal transport plan is of the form
(id, ∇𝜑) # 𝜇 for some closed convex function 𝜑.
6. Uniqueness of the optimal transport map. So far, we have not discussed uniqueness
of the solution to the Kantorovich problem, and in general uniqueness does not
hold. However, in the setting we are currently dealing with (the cost is the squared
Euclidean distance and 𝜇 is absolutely continuous), we can use additional arguments
to establish uniqueness. We will show that if 𝛾¯★ = (id, ∇𝜑) ¯ # 𝜇 is another optimal
transport plan where 𝜑¯ is a closed convex function, then ∇𝜑 = ∇𝜑¯ (𝜇-a.e.). Note that
in particular, it implies that there is only one gradient of a closed convex function
which pushes forward 𝜇 to 𝜈.
From our above arguments, we see that (k·k 2 /2 − 𝜑, ¯ k·k 2 /2 − 𝜑¯ ∗ ) is a dual optimal
pair. Therefore,
∫ ∫ ∫ ∫ ∫
{𝜑¯ (𝑥) + 𝜑¯ (𝑦)} d𝛾 (𝑥, 𝑦) =
∗ ★
𝜑¯ d𝜇 + 𝜑¯ d𝜈 =

𝜑 d𝜇 + 𝜑 ∗ d𝜈
∫ ∫
= {𝜑 (𝑥) + 𝜑 (𝑦)} d𝛾 (𝑥, 𝑦) = h𝑥, 𝑦i d𝛾 ★ (𝑥, 𝑦) .
∗ ★

Using 𝛾 ★ = (id, ∇𝜑) # 𝜇, it yields



{𝜑¯ (𝑥) + 𝜑¯ ∗ (∇𝜑 (𝑥)) − h𝑥, ∇𝜑 (𝑥)i} d𝜇 (𝑥) = 0 .

On the other hand, by the definition of 𝜑¯ ∗ , we have 𝜑¯ (𝑥) + 𝜑¯ ∗ (𝑦) ≥ h𝑥, 𝑦i for all
𝑥, 𝑦 ∈ R𝑑 , with equality if and only if 𝑦 ∈ 𝜕𝜑¯ (𝑥). So, the integrand of the above
expression is always non-negative but the integral is zero, which combined with
the previous fact shows that ∇𝜑 (𝑥) ∈ 𝜕𝜑¯ (𝑥) for 𝜇-a.e. 𝑥. But for 𝜇-a.e. 𝑥, we also
know that 𝜕𝜑¯ (𝑥) = {∇𝜑¯ (𝑥)}, and we conclude that ∇𝜑 = ∇𝜑¯ (𝜇-a.e.). 
We refer to 𝜑 as a Brenier potential. From convex duality, ∇𝜑 ∗ = (∇𝜑) −1 . So, if 𝜈 is
also absolutely continuous, then the optimal transport map from 𝜈 to 𝜇 is ∇𝜑 ∗ . We often
write 𝑇𝜇→𝜈 = ∇𝜑 for the optimal transport map from 𝜇 to 𝜈.
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 37

In light of this discussion, it is natural to focus on the following class of measures.

Definition 1.3.12. The space P2,ac (R𝑑 ) is the set of measures in P2 (R𝑑 ) which are
absolutely continuous w.r.t. Lebesgue measure.

Remarks on other costs. Many of the arguments can be generalized to other costs
𝑐. For example, the supports of optimal transport plans can be characterized via 𝑐-
cyclical monotonicity (generalizing cyclical monotonicity) and optimal dual potentials
can be characterized via 𝑐-concavity (generalizing convexity). Arguing that the optimal
transport plan is induced by a transport map requires additional information about the
differentiability of 𝑐.

1.3.2 Riemannian Structure of the Wasserstein Space


Wasserstein space as a metric space. The following lemma is used to establish that
the triangle inequality holds for 𝑊2 .

Lemma 1.3.13 (gluing lemma). If 𝛾 1,2 ∈ P (X1 × X2 ) and 𝛾 2,3 ∈ P (X2 × X3 ) have the
same marginal distribution on X2 , then there exists 𝛾 ∈ P (X1 × X2 × X3 ) such that its
first two marginals are 𝛾 1,2 and its last two marginals are 𝛾 2,3 .

Proof. Let 𝜇 denote the common X2 -marginal of 𝛾 1,2 and 𝛾 2,3 . The idea is to first draw
𝑋 2 ∼ 𝜇. Then, draw 𝑋 1 from its conditional distribution given 𝑋 2 (according to 𝛾 1,2 ), and
similarly draw 𝑋 3 from its conditional distribution given 𝑋 2 (according to 𝛾 2,3 ). Then, take
𝛾 to be the law of the triple (𝑋 1, 𝑋 2, 𝑋 3 ).
The way to formalize this argument is via disintegration of measure. 

Proposition 1.3.14. The space (P2 (R𝑑 ),𝑊2 ) is a metric space.

Proof. Clearly, 𝑊2 is symmetric in its two arguments. It is also clear that 𝜇 = 𝜈 implies
𝑊2 (𝜇, 𝜈) = 0. Conversely, if 𝑊2 (𝜇, 𝜈) = 0, then there exists a coupling (𝑋, 𝑌 ) of (𝜇, 𝜈) such
that k𝑋 − 𝑌 k 2 = 0 a.s., or equivalently 𝑋 = 𝑌 a.s., which gives 𝜇 = 𝜈.
To verify the triangle inequality, we use the gluing lemma. Let 𝜇1, 𝜇2, 𝜇3 ∈ P2 (R𝑑 ), let
★ be optimal for (𝜇 , 𝜇 ), and let 𝛾 ★ be optimal for (𝜇 , 𝜇 ). Let 𝛾 be obtained by gluing
𝛾 1,2 1 2 2,3 2 3
38 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

★ and 𝛾 ★ , and let 𝛾


1,3 ∈ C(𝜇 1, 𝜇 3 ) denote the (1, 3)-marginal of 𝛾 . Then,
𝛾 1,2 ★
2,3
√︄∫ √︄∫
𝑊2 (𝜇1, 𝜇3 ) ≤ k𝑥 1 − 𝑥 3 k 2 d𝛾 1,3 (𝑥 1, 𝑥 3 ) ≤ {k𝑥 1 − 𝑥 2 k + k𝑥 2 − 𝑥 3 k}2 d𝛾 (𝑥 1, 𝑥 2, 𝑥 3 ) .

Let 𝑓 (𝑥 1, 𝑥 2, 𝑥 3 ) B k𝑥 1 − 𝑥 2 k and 𝑔(𝑥 1, 𝑥 2, 𝑥 3 ) B k𝑥 2 − 𝑥 3 k. The above expression can be


written as k 𝑓 + 𝑔k 𝐿2 (𝛾) . By applying the triangle inequality in 𝐿 2 (𝛾),

𝑊2 (𝜇 1, 𝜇3 ) ≤ k𝑓 k 𝐿2 (𝛾) + k𝑔k 𝐿2 (𝛾)


√︄∫ √︄∫
= k𝑥 1 − 𝑥 2 k 2 d𝛾 (𝑥 1, 𝑥 2, 𝑥 3 ) + k𝑥 2 − 𝑥 3 k 2 d𝛾 (𝑥 1, 𝑥 2, 𝑥 3 )
√︄∫ √︄∫
= k𝑥 1 − 𝑥 2 k 2 d𝛾 1,2
★ (𝑥 , 𝑥 ) +
1 2 k𝑥 2 − 𝑥 3 k 2 d𝛾 2,3
★ (𝑥 , 𝑥 )
2 3

= 𝑊2 (𝜇1, 𝜇2 ) + 𝑊2 (𝜇2, 𝜇3 ) . 

Since the next result is technical, we omit the proof.

Proposition 1.3.15. The metric space (P2 (R𝑑 ),𝑊2 ) is complete and separable. Also,
2
we have 𝑊2 (𝜇𝑛 , 𝜇) → 0 if and only if 𝜇𝑛 → 𝜇 weakly and k·k d𝜇𝑛 → k·k d𝜇.2
∫ ∫

The continuity equation. Next, we are going to consider dynamics in the space of
measures, i.e., curves of measures 𝑡 ↦→ 𝜇𝑡 . Throughout, we assume these curves are
sufficiently nice, in the following sense.

Definition 1.3.16 (informal). We say that a curve 𝑡 ↦→ 𝜇𝑡 ∈ P2,ac (R𝑑 ) is absolutely


continuous if for all 𝑡,
𝑊2 (𝜇𝑠 , 𝜇𝑡 )
| 𝜇|(𝑡)
¤ B lim < ∞.
𝑠→𝑡 |𝑠 − 𝑡 |
The quantity | 𝜇|
¤ is called the metric derivative of the curve.

More generally, the metric derivative can be defined on any metric space and represents
the magnitude of the velocity of the curve, see [AGS08].
It is helpful to adopt a fluid dynamics analogy in which we think of 𝜇𝑡 as the mass
density of a fluid at time 𝑡. There are two complementary perspectives on fluid flows:
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 39

the Lagrangian perspective which emphasizes particle trajectories, and the Eulerian
perspective which tracks the evolution of the fluid density.
Suppose that 𝑋 0 ∼ 𝜇 0 and that 𝑡 ↦→ 𝑋𝑡 evolves according to the ODE 𝑋¤𝑡 = 𝑣𝑡 (𝑋𝑡 ). Here,
(𝑣𝑡 )𝑡 ≥0 is a family of vector fields, i.e. mappings R𝑑 → R𝑑 . Since the ODE describes the
evolution of the particle trajectory, it is the Lagrangian description of the dynamics. The
corresponding Eulerian description is the continuity equation.

Theorem 1.3.17. Let 𝑡 ↦→ 𝑣𝑡 be a family of vector fields and suppose that the random
variables 𝑡 ↦→ 𝑋𝑡 evolve according to 𝑋¤𝑡 = 𝑣𝑡 (𝑋𝑡 ). Then, the law 𝜇𝑡 of 𝑋𝑡 evolves
according to the continuity equation

𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 . (1.3.18)

Proof. Given a test function 𝜑 : R𝑑 → R,


∫ ∫
𝜑 𝜕𝑡 𝜇𝑡 = 𝜕𝑡 𝜑𝜇𝑡 = 𝜕𝑡 E 𝜑 (𝑋𝑡 ) = Eh∇𝜑 (𝑋𝑡 ), 𝑋¤𝑡 i = Eh∇𝜑 (𝑋𝑡 ), 𝑣𝑡 (𝑋𝑡 )i
∫ ∫
= h∇𝜑, 𝑣𝑡 i 𝜇𝑡 = − 𝜑 div(𝜇𝑡 𝑣𝑡 ) .

Since this holds for every 𝜑, we obtain 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0. 


The punchline is that every nice curve of measures 𝑡 ↦→ 𝜇𝑡 can be interpreted as the
fluid flow along a family of vector fields, i.e., we can find vector fields 𝑡 ↦→ 𝑣𝑡 such that
the continuity equation (1.3.18) holds. First, however, note that there is no uniqueness:
if 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 and for each 𝑡, 𝑤𝑡 is a vector field satisfying div(𝜇𝑡 𝑤𝑡 ) = 0, then
the continuity equation also holds with the new vector fields e 𝑣𝑡 B 𝑣𝑡 + 𝑤𝑡 . We will show
how to pick a distinguished choice of vector fields 𝑡 ↦→ 𝑣𝑡 which can be described in
two equivalent ways. First, among all ∫ vector fields making the continuity equation hold
true, we can choose 𝑣𝑡 to minimize k𝑣𝑡 k 2 d𝜇𝑡 , which has the physical interpretation of
minimizing kinetic energy. Second, we can choose 𝑣𝑡 to be the gradient of a function; we
will see that this is natural in light of the characterization of optimal transport maps.

Theorem 1.3.19 (curves of measures as fluid flows). Let 𝑡 ↦→ 𝜇𝑡 be an absolutely


continuous curve of measures.

1. For any family of vector fields 𝑡 ↦→ 𝑣˜𝑡 such that the continuity equation (1.3.18)
holds, we have | 𝜇|(𝑡)
¤ ≤ k𝑣˜𝑡 k 𝐿2 (𝜇𝑡 ) for all 𝑡.
40 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

2. Conversely, there exists a unique choice of vector fields 𝑡 ↦→ 𝑣𝑡 such that the
continuity equation (1.3.18) holds and k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) ≤ | 𝜇|(𝑡)
¤ for all 𝑡. The choice
of vector fields is also characterized by the following property: the continuity
equation (1.3.18) holds and for each 𝑡, 𝑣𝑡 = ∇𝜓𝑡 for a function 𝜓𝑡 : R𝑑 → R.

Moreover, the distinguished vector field 𝑣𝑡 satisfies

𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − id
𝑣𝑡 = lim (1.3.20)
𝛿&0 𝛿

where 𝑇𝜇𝑡 →𝜇𝑡 +𝛿 is the optimal transport map from 𝜇𝑡 to 𝜇𝑡+𝛿 .

Proof. 1. Proof of the first statement. Let 𝛿 > 0 and consider the flow map 𝐹𝑡,𝑡+𝛿 de-
fined as follows. Given any initial point 𝑥𝑡 ∈ R𝑑 , consider the ODE 𝑥¤𝑡 = 𝑣˜𝑡 (𝑥𝑡 )
started at 𝑥𝑡 . Then, 𝐹𝑡,𝑡+𝛿 maps 𝑥𝑡 to the solution 𝑥𝑡+𝛿 of the ODE at time 𝑡 + 𝛿.
If 𝑋𝑡 ∼ 𝜇𝑡 , then the continuity equation implies 𝐹𝑡,𝑡+𝛿 (𝑋𝑡 ) ∼ 𝜇𝑡+𝛿 , i.e., 𝐹𝑡,𝑡+𝛿 is a valid
transport map from 𝜇𝑡 to 𝜇𝑡+𝛿 . Hence, we can estimate
√︄
k𝐹𝑡,𝑡+𝛿 − idk 2

𝑊2 (𝜇𝑡 , 𝜇𝑡+𝛿 )
≤ d𝜇𝑡 .
𝛿 𝛿2

However, 𝐹𝑡,𝑡+𝛿 − id = 𝛿 𝑣˜𝑡 + 𝑜 (𝛿), so letting 𝛿 & 0 we obtain | 𝜇|(𝑡)


¤ ≤ k𝑣˜𝑡 k 𝐿2 (𝜇𝑡 ) .
(Actually, to prove this statement we should also consider the limit 𝛿 % 0 for
negative 𝛿, but it is clear that the same argument works.)

2. Uniqueness of the optimal vector field. Suppose we find 𝑡 ↦→ 𝑣𝑡 satisfying the con-
tinuity equation and such that k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) ≤ | 𝜇|(𝑡).
¤ In light of the first statement, it
implies that the zero vector field is the minimizer of k𝑣𝑡 + 𝑤𝑡 k 𝐿2 (𝜇𝑡 ) among all vector
fields 𝑤𝑡 such that div(𝜇𝑡 𝑤𝑡 ) = 0. This is a strictly convex problem so the minimizer
is unique, meaning that the family 𝑡 ↦→ 𝑣𝑡 is uniquely determined.

3. Gradient vector fields are optimal. Here, we show that if the continuity equation
holds for the family of vector fields 𝑡 ↦→ 𝑣𝑡 and that 𝑣𝑡 = ∇𝜓𝑡 for all 𝑡, then the
vector fields are optimal.
There are at least two ways of seeing why gradient vector fields should be optimal.
First, the continuity equation∫ is equivalent to requiring that for all test functions
𝜑 : R → R, it holds that 𝜕𝑡 𝜑 d𝜇𝑡 = h∇𝜑, 𝑣𝑡 i d𝜇𝑡 . In this expression, the vector

𝑑

field 𝑣𝑡 only enters through inner products with gradients. To put it another way, if
we consider the space 𝑆 B {∇𝜓 | 𝜓 : R𝑑 → R} of gradients, viewed as a subspace
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 41

of 𝐿 2 (𝜇𝑡 ), then we can write 𝐿 2 (𝜇𝑡 ) = 𝑆 ⊕ 𝑆 ⊥ (actually, to make this valid we should
take the closure of 𝑆, but we will ignore this detail). If we decompose 𝑣𝑡 according
to this direct sum, then 𝑣𝑡 = ∇𝜓𝑡 + 𝑤𝑡 for some function 𝜓𝑡 and some 𝑤𝑡 which is
orthogonal (in 𝐿 2 (𝜇𝑡 )) to 𝑆. If we replace 𝑣𝑡 by ∇𝜓𝑡 , then the continuity equation
continues to hold, but we have only made the norm k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) smaller, hence the
optimal choice of 𝑣𝑡 should be a gradient.
The second line of reasoning comes from the proof of the first statement: the reason
why the metric derivative | 𝜇|(𝑡)
¤ was upper bounded than k𝑣˜𝑡 k 𝐿2 (𝜇𝑡 ) is because the
flow map corresponding to 𝑡 ↦→ 𝑣˜𝑡 furnishes a possibly suboptimal transport map.
To fix this, the flow map for the optimal 𝑡 ↦→ 𝑣𝑡 should be approximately equal to the
optimal transport map, i.e. 𝑣𝑡 ≈ (𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − id)/𝛿. From the fundamental theorem of
optimal transport, however, 𝑇𝜇𝑡 →𝜇𝑡 +𝛿 is the gradient of a convex function, so in the
limit 𝑣𝑡 should be as well.
Instead of using these arguments, we will instead provide a proof based on direct
computation. If 𝑣𝑡 = ∇𝜓𝑡 , the continuity equation shows that
𝜓𝑡 d𝜇𝑡+𝛿 − 𝜓𝑡 d𝜇𝑡
∫ ∫ ∫ ∫
= 𝜓𝑡 𝜕𝑡 𝜇𝑡 + 𝑜 (1) = − 𝜓𝑡 div(𝜇𝑡 ∇𝜓𝑡 ) + 𝑜 (1)
𝛿 ∫
= k∇𝜓𝑡 k 2 d𝜇𝑡 + 𝑜 (1) .

On the other hand,


𝜓𝑡 d𝜇𝑡+𝛿 − 𝜓𝑡 d𝜇𝑡
∫ ∫
𝜓𝑡 ◦ 𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − 𝜓𝑡

= d𝜇𝑡
𝛿 𝛿
𝑇𝜇𝑡 →𝜇𝑡 +𝛿 − id

d𝜇𝑡 + 𝑜 (1)

= ∇𝜓𝑡 ,
𝛿
√︄∫
𝑊2 (𝜇𝑡 , 𝜇𝑡+𝛿 )
≤ k∇𝜓𝑡 k 2 d𝜇𝑡 + 𝑜 (1) .
𝛿

Taking 𝛿 & 0 yields k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) = k∇𝜓𝑡 k 𝐿2 (𝜇𝑡 ) ≤ | 𝜇|(𝑡).


¤
4. Existence of optimal vector fields. Finally, one can show for instance that vector
fields defined via limits of transport maps as in (1.3.20) indeed satisfy the continuity
equation and are gradient vector fields, and are therefore optimal. However, the
details are omitted. 
From the theorem, we learn that the optimal vector field 𝑣𝑡 satisfies k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) = | 𝜇|(𝑡).
¤
On the other hand, the metric derivative is supposed to be the “magnitude of the velocity”.
42 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Our next goal is to interpret 𝑣𝑡 as the velocity vector to the curve, and k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) as its
norm, all through the lens of Riemannian geometry.

Background on Riemannian geometry. In the spirit of informality, we give a descrip-


tion of what a Riemannian manifold entails, rather than a precise definition. A manifold
M is a space which is locally homeomorphic to a Euclidean space. At each point 𝑝 ∈ M,
there is an associated vector space 𝑇𝑝 M, called the tangent space at 𝑝, which is the
space of all possible velocities of curves passing through 𝑝. The whole structure should
be smooth: the tangent spaces should vary smoothly in a suitable sense.
A Riemannian metric is a smoothly varying choice of inner products 𝑝 ↦→ h·, ·i𝑝 on
the tangent spaces. The metric allows us to, e.g., locally measure the angles between two
intersecting curves. For our purposes, it is important to note that the metric allows us to
define the steepest descent direction for an objective function, which in turn allows us to
consider gradient flows.
The Riemannian metric induces a distance function (in the sense of metric spaces) via
n∫ 1 o
d(𝑝, 𝑞) B inf k𝛾¤ (𝑡)k𝛾 (𝑡) d𝑡 𝛾 : [0, 1] → M, 𝛾 (0) = 𝑝, 𝛾 (1) = 𝑞 . (1.3.21)

0

Here, 𝛾¤ (𝑡) denotes the tangent vector to the curve at time 𝑡. Note that the norm of the
tangent vector is measured w.r.t. the inner product on the tangent space 𝑇𝛾 (𝑡) M, hence
we write k𝛾¤ (𝑡)k𝛾 (𝑡) . If the infimum is achieved by a curve 𝛾, then 𝛾 is referred to as a
geodesic (a shortest path); if 𝑡 ↦→ k𝛾¤ (𝑡)k𝛾 (𝑡) is constant, then it is called a constant-speed
geodesic. From now on, we will only consider constant-speed geodesics, and the words
“constant speed” will be dropped for brevity.
Given a functional F : M → R, the gradient of F at 𝑝 is defined to be the unique
element ∇F(𝑝) ∈ 𝑇𝑝 M such that for all curves (𝑝𝑡 )𝑡 ∈R passing through 𝑝 at time 0 with
velocity 𝑣 ∈ 𝑇𝑝 M, it holds that 𝜕𝑡 |𝑡=0 F(𝑝𝑡 ) = h∇F(𝑝), 𝑣i𝑝 .

Wasserstein space as a Riemannian manifold. Based on our discussion thus far, it


is natural to define the tangent space at 𝜇 ∈ P2,ac (R𝑑 ) as

𝐿 2 (𝜇)
𝑑
𝑇𝜇 P2,ac (R ) B {∇𝜓 | 𝜓 ∈ Cc∞ (R𝑑 )} ,

where the notation denotes taking the 𝐿 2 (𝜇) closure. Equivalently,

𝐿 2 (𝜇)
𝑇𝜇 P2,ac (R𝑑 ) = {𝜆 (𝑇 − id) | 𝜆 > 0, 𝑇 is an optimal transport map} .
1.3. THE GEOMETRY OF OPTIMAL TRANSPORT 43

We equip the tangent space 𝑇𝜇 P2,ac (R𝑑 ) with the 𝐿 2 (𝜇) norm, which gives a Riemannian
metric. This does not define a genuine Riemannian manifold (e.g., it is not locally homeo-
morphic to a Euclidean space or even a Hilbert space), but we will treat it as one for the
purpose of developing calculation rules.
If the continuity equation 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 holds and 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ), then 𝑣𝑡 is
the tangent vector to the curve at time 𝑡. The condition 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ) is equivalent to
saying that 𝑣𝑡 is the optimal vector field considered in Theorem 1.3.19.
There are two questions to address. First, is this Riemannian structure compatible
with the 2-Wasserstein distance? In other words, we know that a Riemannian metric
induces a distance function; is the distance function induced by the Riemannian structure
of P2,ac (R𝑑 ) equal to 𝑊2 ? Second, what are the geodesics? We answer both questions via
the following theorem.

Theorem 1.3.22 (Wasserstein geodesics). Let 𝜇0, 𝜇1 ∈ P2,ac (R𝑑 ). Then,


n∫ 1 o
𝑊2 (𝜇 0, 𝜇1 ) = inf k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 . (1.3.23)

0

The infimum is achieved as follows. Let 𝑋 0 ∼ 𝜇 0 and 𝑋 1 ∼ 𝜇 1 be optimally coupled


random variables, let 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 , and let 𝜇𝑡 B law(𝑋𝑡 ). Then, 𝑡 ↦→ 𝜇𝑡 is the
unique constant-speed geodesic joining 𝜇 0 to 𝜇1 .

∫1 ∫1
Proof. Suppose that 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0. Then, 0
k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡 ≥ 0
| 𝜇|(𝑡)
¤ d𝑡. For a
partition 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑘 ≤ 1,
𝑘 𝑘
∑︁ ∑︁ 𝑊2 (𝑡𝑖−1, 𝑡𝑖 )
𝑊2 (𝜇0, 𝜇1 ) ≤ 𝑊2 (𝑡𝑖−1, 𝑡𝑖 ) = (𝑡𝑖 − 𝑡𝑖−1 ) .
𝑖=1 𝑖=1
𝑡𝑖 − 𝑡𝑖−1
∫1
As the size of the partition tends to zero, we obtain 𝑊2 (𝜇 0, 𝜇1 ) ≤ 0 | 𝜇|(𝑡) ¤ d𝑡. This shows
that 𝑊2 (𝜇0, 𝜇1 ) is at most the value of the infimum.
To show that equality holds, let 𝑋𝑡 be defined as in the theorem statement and note that
E[k𝑋¤𝑡 k 2 ] = k𝑣𝑡 k 𝐿2 2 (𝜇 ) by the correspondence of the Lagrangian and Eulerian perspectives.
𝑡
(This can be verified by writing the vector field explicitly as 𝑣𝑡 = (𝑇1 − id) ◦ 𝑇𝑡−1 , where
𝑇𝑡 B (1 − 𝑡) id + 𝑡 𝑇𝜇0 →𝜇1 .) Since E[k𝑋¤𝑡 k 2 ] = E[k𝑋 1 − 𝑋 0 k 2 ] = 𝑊22 (𝜇0, 𝜇1 ) does not depend
∫1
on time, the curve has constant speed, and 0 k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡 = 𝑊2 (𝜇0, 𝜇1 ).
To show uniqueness, again work in the Lagrangian perspective: suppose we have an
evolution 𝑡 ↦→ 𝑋𝑡 of random variables such that 𝑡 ↦→ E[k𝑋¤𝑡 k 2 ] is constant, and 𝑋 0 ∼ 𝜇0 ,
44 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

𝑋 1 ∼ 𝜇1 . Then, we have
h ∫ 1 2 i ∫ 1 ∫ 1 2
𝑊22 (𝜇 0, 𝜇1 ) 2
≤ E[k𝑋 1 − 𝑋 0 k ] = E 𝑋𝑡 d𝑡 ≤ E
¤ 2
k𝑋𝑡 k d𝑡 =
¤ k𝑣𝑡 k 𝐿2 (𝜇𝑡 ) d𝑡

0 0 0

where the last equality follow from the constant-speed assumption. In order for the first
inequality to be equality, (𝑋 0, 𝑋 1 ) is an optimal coupling. In order for the second inequality
to be equality, 2
∫ 1 strict convexity of k·k implies that 𝑋𝑡 is constant in time and equal to its
¤
average 0 𝑋¤𝑡 d𝑡 = 𝑋 1 − 𝑋 0 . 

Definition 1.3.24. Let 𝜇0, 𝜇1 ∈ P2,ac (R𝑑 ), and let 𝑋 0 ∼ 𝜇 0 , 𝑋 1 ∼ 𝜇1 be optimally


coupled. Let 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 , and let 𝜇𝑡 B law(𝑋𝑡 ). Then, the curve 𝑡 ↦→ 𝜇𝑡 is
called the Wasserstein geodesic joining 𝜇 0 to 𝜇1 . It is also called the displacement
interpolation or McCann’s interpolation.

Exponential map and logarithmic map. On a Riemannian manifold M, the Rie-


mannian exponential map exp𝑝 : 𝑇𝜇 M → M takes a tangent vector 𝑣 ∈ 𝑇𝜇 M to the
endpoint at time 1 of the constant-speed geodesic emanating from 𝑝 with velocity 𝑣. The
logarithmic map is then defined to be the inverse mapping log𝑝 : M → 𝑇𝜇 M. Actually,
in general, the exponential map is only defined on a subset of the tangent space, because in
many manifolds (e.g., the sphere), geodesics cannot continue indefinitely while remaining
shortest paths between their endpoints. On Euclidean space R𝑑 , we have exp𝑝 (𝑣) = 𝑝 + 𝑣
and log𝑝 (𝑞) = 𝑞 − 𝑝.
We can identify these maps for (P2,ac (R𝑑 ),𝑊2 ). If (𝜇𝑡 )𝑡 ∈[0,1] is a Wasserstein geodesic
and ∇𝜑 𝜇0 →𝜇1 is the optimal transport map from 𝜇0 to 𝜇1 , then the tangent vector to the
geodesic at time 0 is ∇𝜑 𝜇0 →𝜇1 − id. This implies that log𝜇0 (𝜇 1 ) = ∇𝜑 𝜇0 →𝜇1 − id. The inverse
mapping is then given as follows: if ∇𝜓 ∈ 𝑇𝜇0 P2,ac (R𝑑 ) is such that id + ∇𝜓 is the gradient
of a convex function, then exp𝜇0 (∇𝜓 ) = (id + ∇𝜓 ) # 𝜇 0 .

Geodesically convex functionals. Over a Riemannian manifold M, the correct way


to define convexity is as follows.

Definition 1.3.25. Let M be a Riemannian manifold and let F : M → R ∪ {∞} be


smooth. For 𝛼 ∈ R, we say that F is 𝛼-geodesically convex if one of the following
equivalent conditions hold:
1.4. THE LANGEVIN SDE AS A WASSERSTEIN GRADIENT FLOW 45

1. For all geodesics (𝑝𝑡 )𝑡 ∈[0,1] and 𝑡 ∈ [0, 1],

𝛼 𝑡 (1 − 𝑡)
F(𝑝𝑡 ) ≤ (1 − 𝑡) F(𝑝 0 ) + 𝑡 F(𝑝 1 ) − d(𝑝 0, 𝑝 1 ) 2 ,
2
where d is the induced Riemannian distance (1.3.21).

2. For all 𝑝, 𝑞 ∈ M,

d(𝑝, 𝑞) 2 .
𝛼
F(𝑞) ≥ F(𝑝) + h∇F(𝑝), log𝑝 (𝑞)i𝑝 +
2
Here, ∇ denotes the Riemannian gradient.

3. For all constant-speed geodesics (𝑝𝑡 )𝑡 ∈[0,1] with tangent vector 𝑣 0 ∈ 𝑇𝑝 0 M at


time 0, it holds that

𝜕𝑡2 𝑡=0 F(𝑝𝑡 ) ≥ 𝛼 k𝑣 0 k 𝑝2 0 .


1.4 The Langevin SDE as a Wasserstein Gradient Flow


We are now ready to interpret the Langevin diffusion (1.E.1) as a gradient flow in the
Wasserstein space (P2,ac (R𝑑 ),𝑊2 ). Once we have done so, we will quickly deduce conver-
gence results for the Langevin diffusion based on gradient flow computations.

1.4.1 Derivation of the Gradient Flow


Let F : P2,ac (R𝑑 ) → R∪ {∞} be a functional over the Wasserstein space. We now compute
the Wasserstein gradient of F at 𝜇, i.e., the element ∇𝑊2 F(𝜇) ∈ 𝑇𝜇 P2,ac (R𝑑 ) such that for
every curve 𝑡 ↦→ 𝜇𝑡 with 𝜇0 = 𝜇, if 𝑣 0 is the tangent vector to the curve at time 0, then
𝜕𝑡 |𝑡=0 F(𝜇𝑡 ) = h∇𝑊2 F(𝜇), 𝑣 0 i𝜇 , where h·, ·i𝜇 is the inner product on 𝑇𝜇 P2,ac (R𝑑 ).
We will give a formula in terms of the ∫ first variation of F at 𝜇, denoted 𝛿F(𝜇).
The first variation satisfies 𝜕𝑡 |𝑡=0 F(𝜇𝑡 ) = 𝛿F(𝜇) 𝜕𝑡 |𝑡=0 𝜇𝑡 . This is almost the same as
the 𝐿 2 (𝔪) gradient of F, where 𝔪 is the Lebesgue measure on R𝑑 , except for a few
differences: (1) there is no guarantee that 𝛿F(𝜇) ∈ 𝐿 2 (𝔪); (2) in order to consider the
𝐿 2 (𝔪) gradient, we would want F to be a functional defined over all of 𝐿 2 (𝔪), not just
probability densities, and similarly we would have to consider all curves in 𝐿 2 (𝔪) rather
than curves of probability densities.
As a consequence of looking only at probability densities, the first variation is only
defined up to an additive constant. Indeed, 𝜕𝑡 |𝑡=0 𝜇𝑡 always integrates to 0, so we can add
46 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

any constant to the first variation. This does not cause any ambiguity, as we now see.
Recall that 𝑣𝑡 is the tangent vector to the curve of measures at time 𝑡 if the continuity
equation 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 holds and 𝑣𝑡 ∈ 𝑇𝜇𝑡 P2,ac (R𝑑 ). Using the continuity equation
with a curve such that 𝜇0 = 𝜇,
∫ ∫ ∫
𝜕𝑡 𝑡=0 F(𝜇𝑡 ) = 𝛿F(𝜇) 𝜕𝑡 𝑡=0 𝜇𝑡 = − 𝛿F(𝜇) div(𝑣 0 𝜇) = h∇𝛿F(𝜇), 𝑣 0 i d𝜇 .

(Here, the ∇ is the Euclidean gradient.) Since ∇𝛿F(𝜇) is the gradient of a function, from
our characterization of the tangent space we know that ∇𝛿F(𝜇) ∈ 𝑇𝜇 P2,ac (R𝑑 ). Therefore,
the equation above says that the Wasserstein gradient of F at 𝜇 is ∇𝛿F(𝜇).

Theorem 1.4.1. Let F : P2,ac (R𝑑 ) → R ∪ {∞} be a functional. Then, its Wasserstein
gradient at 𝜇 is

∇𝑊2 F(𝜇) = ∇𝛿F(𝜇) ,

where 𝛿F(𝜇) is a first variation of F at 𝜇.

Since we take the Euclidean gradient of the first variation, the fact that the first variation
is only defined up to additive constant does not bother us.
The Wasserstein gradient flow of F is by definition a curve of measures 𝑡 ↦→ 𝜇𝑡 such
that its tangent vector 𝑣𝑡 at time 𝑡 is 𝑣𝑡 = −∇𝑊2 F(𝜇𝑡 ). Substituting this into the continuity
equation (1.3.18), we obtain the gradient flow equation

𝜕𝑡 𝜇𝑡 = div 𝜇𝑡 ∇𝑊2 F(𝜇𝑡 ) = div 𝜇𝑡 ∇𝛿F(𝜇𝑡 ) .


 

Example 1.4.2. Consider F = KL(· k 𝜋) where 𝜋 = exp(−𝑉 ). This functional can


also be written as
∫ ∫ ∫
𝜇
F(𝜇) = 𝜇 ln = 𝑉 d𝜇 + 𝜇 ln 𝜇 .
𝜋
These two terms have the interpretation of energy and (negative) entropy. From this,
we can compute that

𝛿F(𝜇) = 𝑉 + ln 𝜇 + constant
1.4. THE LANGEVIN SDE AS A WASSERSTEIN GRADIENT FLOW 47

and therefore
𝜇
∇𝑊2 F(𝜇) = ∇𝑉 + ∇ ln 𝜇 = ∇ ln .
𝜋
The Wasserstein gradient flow of F satisfies
𝜇𝑡 
𝜕𝑡 𝜇𝑡 = div 𝜇𝑡 ∇ ln .
𝜋
Comparing with the Fokker–Planck equation 𝜕𝑡 𝜋𝑡 = ℒ ∗ 𝜋𝑡 and the form of the
adjoint generator ℒ ∗ for the Langevin diffusion (see Example 1.2.8), we obtain a
truly remarkable fact: the law 𝑡 ↦→ 𝜋𝑡 of the Langevin diffusion with potential 𝑉 is the
Wasserstein gradient flow of KL(· k 𝜋).

The calculus of optimal transport was introduced by Otto in [Ott01], and it is often
known as Otto calculus; the interpretation of the Langevin diffusion in this context
was put forth in the seminal work [JKO98]. The paper [Ott01] also raises and answers a
salient question: given that we can view dynamics as gradient flows in different ways (e.g.
the Langevin diffusion can be either viewed as the gradient flow of the Dirichlet energy
in 𝐿 2 (𝜋), or the gradient flow of the KL divergence in P2,ac (R𝑑 )), what makes us prefer
one gradient flow structure over another? Otto argues that the Wasserstein geometry
is particularly natural because it cleanly separates out two aspects of the problem: the
geometry of the ambient space, which is reflected in the metric on P2,ac (R𝑑 ), and the
objective functional. Moreover, the objective functional in the Wasserstein perspective is
physically intuitive because it has an interpretation in thermodynamics. From a sampling
standpoint, the Wasserstein geometry is undoubtedly more compelling and useful.
In our exposition, we focused on the calculation rules for Wasserstein gradient flows,
but this is not how they are normally defined. Instead, one usually considers a sequence
of discrete approximations to the gradient flow and proves that there is a limiting curve;
this is called the minimizing movements scheme and it is developed in detail in [AGS08].

1.4.2 Convexity of the KL Divergence


The key to studying gradient flows is to understand the convexity properties of the
objective functional. For the specific functional F B KL(· k 𝜋) with target 𝜋 = exp(−𝑉 ),
our next goal is therefore to compute the Wasserstein Hessian of F. When we computed
Wasserstein gradients, we were free to differentiate F along any curve 𝑡 ↦→ 𝜇𝑡 of measures,
but we have to be more careful when computing the Hessian. If we take a function
48 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

𝑓 : R𝑑 → R on Euclidean space and a curve 𝑡 ↦→ 𝑥𝑡 , then 𝜕𝑡 𝑓 (𝑥𝑡 ) = h∇𝑓 (𝑥𝑡 ), 𝑥¤𝑡 i and

𝜕𝑡2 𝑓 (𝑥𝑡 ) = h∇2 𝑓 (𝑥𝑡 ) 𝑥¤𝑡 , 𝑥¤𝑡 i + h∇𝑓 (𝑥𝑡 ), 𝑥¥𝑡 i .

Here, instead of just obtaining the Hessian, we have an additional term. However, if
𝑡 ↦→ 𝑥𝑡 is a constant-speed geodesic, then it has no acceleration (𝑥¥𝑡 = 0), and the extra
term vanishes.
In the same way, let (𝜇𝑡 )𝑡 ∈[0,1] denote a Wasserstein geodesic. Explicitly, if 𝑇 denotes
the optimal transport map from 𝜇0 to 𝜇 1 , then 𝜇𝑡 = [(1 − 𝑡) id + 𝑡 𝑇 ] # 𝜇 0 . We will calculate
𝜕𝑡2 |𝑡=0 F(𝜇𝑡 ), as a function of the tangent vector 𝑇 − id ∈ 𝑇𝜇0 P2,ac (R𝑑 ); this is interpreted
as h∇𝑊 2 F(𝜇 ) (𝑇 − id),𝑇 − idi . If we can lower bound this by 𝛼 k𝑇 − idk 2 , for all 𝜇 and
2
0 𝜇0 𝜇0 0
all optimal transport maps 𝑇 , it means that F is 𝛼-strongly convex.
Write E(𝜇) B 𝑉 d𝜇 for the energy and H(𝜇) B 𝜇 ln 𝜇 for the entropy. We deal
∫ ∫

with the two terms separately. First, for 𝑋𝑡 = (1 − 𝑡) 𝑋 0 + 𝑡 𝑇 (𝑋 0 ) and 𝑋 0 ∼ 𝜇0 ,

𝜕𝑡 E(𝜇𝑡 ) = 𝜕𝑡 E 𝑉 (𝑋𝑡 ) = Eh∇𝑉 (𝑋𝑡 ), 𝑋¤𝑡 i = Eh∇𝑉 (𝑋𝑡 ),𝑇 (𝑋 0 ) − 𝑋 0 i ,


𝜕2
𝑡 𝑡=0 E(𝜇𝑡 ) = Eh∇2𝑉 (𝑋 0 ) (𝑇 (𝑋 0 ) − 𝑋 0 ),𝑇 (𝑋 0 ) − 𝑋 0 i .

If 𝑉 is 𝛼-strongly convex, then this is lower bounded by 𝛼 k𝑇 − idk 2𝜇0 , which means that E
is 𝛼-strongly convex.
The entropy is slightly trickier. Write 𝑇𝑡 B (1 − 𝑡) id + 𝑡 𝑇 . Since (𝑇𝑡 ) # 𝜇 0 = 𝜇𝑡 , the
change of variables formula shows that
𝜇0
= det ∇𝑇𝑡 .
𝜇𝑡 ◦ 𝑇𝑡
Therefore,
∫ ∫ ∫
𝜇0
H(𝜇𝑡 ) = 𝜇𝑡 ln 𝜇𝑡 = 𝜇0 ln(𝜇𝑡 ◦ 𝑇𝑡 ) = 𝜇0 ln
∫ det ∇𝑇𝑡
= H(𝜇0 ) − 𝜇0 ln det (1 − 𝑡) 𝐼𝑑 + 𝑡 ∇𝑇 .


Already from the fact that − ln det is convex on the space of positive semidefinite matrices,
we can see that 𝜕𝑡2 H(𝜇𝑡 ) ≥ 0. A more careful computation based on the derivatives of
− ln det shows that (exercise!)

2
𝜕𝑡 𝑡=0 H(𝜇𝑡 ) = k∇𝑇 − 𝐼𝑑 k 2HS d𝜇 0 ≥ 0 . (1.4.3)

We have obtained the following result.


1.4. THE LANGEVIN SDE AS A WASSERSTEIN GRADIENT FLOW 49

Theorem 1.4.4. If 𝜋 ∝ exp(−𝑉 ), where 𝑉 is 𝛼-strongly convex, then KL(· k 𝜋) is also


𝛼-strongly convex along Wasserstein geodesics.

Consequences of strong convexity. The strong convexity of the KL divergence implies


the following statement.

Theorem 1.4.5. If 𝜋 ∝ exp(−𝑉 ), where 𝑉 is 𝛼-strongly convex, then


𝜇 𝛼
KL(𝜈 k 𝜋) ≥ KL(𝜇 k 𝜋) + ∇ ln ,𝑇𝜇→𝜈 − id 𝜇 + 𝑊22 (𝜇, 𝜈)


𝜋 2
for all 𝜇, 𝜈 ∈ P2,ac (R𝑑 ).

We now explore the implications of this fact for the gradient flow. If 𝑡 ↦→ 𝜇𝑡 is the gradient
flow for a functional F with inf F = 0, then
𝜕𝑡 F(𝜇𝑡 ) = h∇𝑊2 F(𝜇𝑡 ), −∇𝑊2 F(𝜇𝑡 ) i𝜇𝑡 = −k∇𝑊2 F(𝜇𝑡 )k 2𝜇𝑡 .
| {z }
tangent vector of the curve

So, 𝑡 ↦→ F(𝜇𝑡 ) is always decreasing, and if the condition


k∇𝑊2 F(𝜇)k 2𝜇 ≥ 2𝛼 F(𝜇) for all 𝜇 ∈ P2,ac (R𝑑 )
holds, then the gradient flow converges exponentially fast, F(𝜇𝑡 ) ≤ exp(−2𝛼𝑡) F(𝜇0 ), as a
consequence of Grönwall’s lemma (Lemma 1.1.20). This condition is known as a gradient
domination condition, or a Polyak–Łojasiewicz (PL) inequality. It is implied by strong
convexity: starting from
𝛼 2
F(𝜈) ≥ F(𝜇) + h∇𝑊2 F(𝜇),𝑇𝜇→𝜈 − idi𝜇 + 𝑊 (𝜇, 𝜈) , (1.4.6)
2 2
we take 𝜈 = 𝜇★ B arg min F so that F(𝜈) = 0, yielding
𝛼 2 1
F(𝜇) ≤ −h∇𝑊2 F(𝜇),𝑇𝜇→𝜇★ − idi𝜇 − 𝑊2 (𝜇, 𝜇★) ≤ k∇𝑊2 F(𝜇)k 2𝜇 ,
2 2𝛼
where the last line uses Young’s inequality and k𝑇𝜇→𝜇★ − idk 𝜇 = 𝑊2 (𝜇, 𝜇★). Therefore,
strong convexity implies exponentially fast convergence of the gradient flow. If we apply
this to the Langevin diffusion, we deduce that 𝛼-strong convexity of 𝑉 implies
1 𝜇 2 1
KL(𝜇 k 𝜋) ≤ ∇ ln 𝜇 = FI(𝜇 k 𝜋) for all 𝜇 ∈ P2,ac (R𝑑 ) . (1.4.7)
2𝛼 𝜋 2𝛼
50 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

This is precisely the log-Sobolev inequality (see Example 1.2.25); we have recovered the
Bakry–Émery theorem (Theorem 1.2.28) that CD(𝛼, ∞) implies an LSI, as well as Theo-
rem 1.2.24 which asserted that an LSI yields exponentially fast decay in the KL divergence.
Next, starting from the strong convexity inequality (1.4.6), we take 𝜇 = 𝜇★ so that
∇𝑊2 F(𝜇★) = 0, and hence
𝛼 2
F(𝜈) ≥ 𝑊2 (𝜈, 𝜇★) for all 𝜈 ∈ P2,ac (R𝑑 ) .
2
This is a quadratic growth inequality; as the name suggests, it asserts that F must grow
quadratically away from its minimizer. For the Langevin diffusion, it says
𝛼 2
KL(𝜇 k 𝜋) ≥ 𝑊 (𝜇, 𝜋) for all 𝜇 ∈ P2,ac (R𝑑 ) . (1.4.8)
2 2
This is known as Talagrand’s T2 inequality and it is an example of a transportation
inequality. Such inequalities have been closely studied in relation to the concentration of
measure phenomenon (see [Han16, Chapter 4]).
It is a general fact that the PL inequality implies the quadratic growth inequality.
When applied to the Langevin diffusion, it says that the LSI implies the T2 inequality,
which is known as the Otto–Villani theorem [OV00]. See Exercise 1.16 for a proof.
Strong convexity also implies another fact: the gradient flow contracts exponentially
fast. Namely, if we have two gradient flows 𝑡 ↦→ 𝜇𝑡 and 𝑡 ↦→ 𝜈𝑡 for a strongly convex
functional F, then

𝑊22 (𝜇𝑡 , 𝜈𝑡 ) ≤ exp(−2𝛼𝑡) 𝑊22 (𝜇0, 𝜈 0 ) . (1.4.9)

In particular, if we take 𝜈𝑡 = 𝜇★ for all 𝑡, then we obtain exponentially fast convergence to


the minimizer in Wasserstein distance. For the Langevin diffusion, inequality (1.4.9) is
implied by the following theorem (see Exercise 1.17).

Theorem 1.4.10. Suppose that ∇2𝑉  𝛼𝐼𝑑 for some 𝛼 ∈ R. If (𝑍𝑡 )𝑡 ≥0 and (𝑍𝑡0)𝑡 ≥0
denote two copies of the Langevin diffusion (1.E.1) with potential 𝑉 and driven by the
same Brownian motion, then

E[k𝑍𝑡 − 𝑍𝑡0 k 2 ] ≤ exp(−2𝛼𝑡) E[k𝑍 0 − 𝑍 00 k 2 ] .

Finally, in the case 𝛼 = 0, so that F is weakly convex, we can also obtain a convergence
result by considering the Lyapunov functional L𝑡 B 𝑡 F(𝜇𝑡 ) + 21 𝑊22 (𝜇𝑡 , 𝜇★). In order to
differentiate this Lyapunov functional, we need the following theorem.
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 51

Theorem 1.4.11. For 𝜈 ∈ P2,ac (R𝑑 ), the Wasserstein gradient of 𝜇 ↦→ 𝑊22 (𝜇, 𝜈) at 𝜇 is
given by −2 (𝑇𝜇→𝜈 − id).

Proof. See [Vil09, Theorem 23.9]. 

In general, on a Riemannian manifold, the gradient of d(·, 𝑞) 2 at 𝑝 is −2 log𝑝 (𝑞). Check


that this formula makes sense on Euclidean space R𝑑 .
Differentiating in time and applying (1.4.6) and the lemma,

𝜕𝑡 L𝑡 = F(𝜇𝑡 ) − 𝑡 k∇𝑊2 F(𝜇𝑡 )k 2𝜇𝑡 + h∇𝑊2 F(𝜇𝑡 ),𝑇𝜇𝑡 →𝜇★ − idi𝜇𝑡 ≤ 0 .


| {z }
≤−F(𝜇𝑡 )

Hence, L𝑡 ≤ L0 , which implies


1 2
F(𝜇𝑡 ) ≤ 𝑊 (𝜇 0, 𝜇★) . (1.4.12)
2𝑡 2

1.5 Overview of the Convergence Results


1.5.1 Convergence Results
The main convergence results we have developed can be summarized as follows.

• KL(· k 𝜋) is 𝛼-strongly convex along 𝑊2 geodesics if and only if 𝑉 is strongly convex,


if and only if: for all 𝜇0, 𝜈 0 ∈ P2 (R𝑑 ), if (𝜇𝑡 )𝑡 ≥0 , (𝜈𝑡 )𝑡 ≥0 are Langevin diffusions
started at 𝜇0 and 𝜈 0 respectively, then 𝑊22 (𝜇𝑡 , 𝜈𝑡 ) ≤ exp(−2𝛼𝑡) 𝑊22 (𝜇 0, 𝜈 0 ).

• The target 𝜋 satisfies the log-Sobolev inequality (LSI) with constant 1/𝛼 if and only
if for all 𝜋0 ∈ P2,ac (R𝑑 ), along the Langevin dynamics 𝑡 ↦→ 𝜋𝑡 started at 𝜋 0 it holds
that KL(𝜋𝑡 k 𝜋) ≤ exp(−2𝛼𝑡) KL(𝜋0 k 𝜋). The LSI is a gradient domination condition
in Wasserstein space.

• The target 𝜋 satisfies the Poincaré inequality (PI) with constant 1/𝛼 if and only if
for all 𝜋0 ∈ P2,ac (R𝑑 ), along the Langevin dynamics 𝑡 ↦→ 𝜋𝑡 started at 𝜋0 it holds
that 𝜒 2 (𝜋𝑡 k 𝜋) ≤ exp(−2𝛼𝑡) 𝜒 2 (𝜋 0 k 𝜋). The Poincaré inequality is a spectral gap
condition for the generator of the Langevin diffusion.

The conditions are listed from strongest to weakest: 𝛼-strong log-concavity implies
𝛼 −1 -LSI,
which implies 𝛼 −1 -Poincaré. In addition:
52 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

• If the target 𝜋 is log-concave, then along the Langevin dynamics 𝑡 ↦→ 𝜋𝑡 it holds


that KL(𝜋𝑡 k 𝜋) ≤ 2𝑡1 𝑊22 (𝜋0, 𝜋).

When we turn towards discretization analysis, there are two main ways in which the
continuous-time result affects the analysis: the strength of the continuous-time result,
and the metric in which we must perform the analysis.
Regarding the first point, the first two results are generally more useful because at
initialization, we typically have 𝑊22 (𝜋0, 𝜋), KL(𝜋0 k 𝜋) = 𝑂 (𝑑) (at least when 𝜋 is strongly
log-concave). Hence, exponential convergence in 𝑊22 and KL both imply that the amount
of time it takes for the Langevin diffusion to reach 𝜀 error is 𝑂 (log(𝑑/𝜀)). In contrast, the
chi-squared divergence is typically much larger at initialization: 𝜒 2 (𝜋 0 k 𝜋) = exp(𝑂 (𝑑)).
Therefore, the chi-squared result implies that the Langevin diffusion takes 𝑂 (𝑑 ∨ log(1/𝜀))
time to reach 𝜀 error.
Regarding the second point, the 𝑊2 contraction under strong log-concavity is the
easiest to turn into a sampling guarantee for the discretized algorithm. This is because
to bound the 𝑊2 distance, we can use straightforward coupling techniques. On the other
hand, a continuous-time result in KL or 𝜒 2 often requires the discretization analysis to
also be carried out in KL or 𝜒 2 , which is substantially trickier.

1.5.2 Appendix: Divergences between Probability Measures


As we have already seen, the analysis of Langevin introduces many different notions of
divergences between probability measures. Therefore, it is important to develop a healthy
understanding of the relationships between these divergences.
First of all, there is a distinction between the Wasserstein metric, which is a transport
distance (measuring how far we must move the mass of one measure to the other), and
information divergences which are defined directly in terms of the densities such as the
KL divergence and the chi-squared divergence. Note that the latter two divergences are
infinite unless the first argument is absolutely continuous w.r.t. the second, which is
certainly not the case for the Wasserstein metric.
We introduce another important metric.

Definition 1.5.1. The total variation (TV) distance between probability measures
𝜇, 𝜈 ∈ P (X) is defined via
∫ ∫
k𝜇 − 𝜈 k TV B sup |𝜇 (𝐴) − 𝜈 (𝐴)| = sup 𝑓 d𝜇 − 𝑓 d𝜈

𝐴⊆X 𝑓 :X→[0,1]
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 53

1 d𝜇 d𝜈
∫ ∫
= inf 1{𝑥 ≠ 𝑦} d𝛾 (𝑥, 𝑦) = − d𝜆 ,
𝛾 ∈C(𝜇,𝜈) 2 d𝜆 d𝜆

where 𝜆 is a common dominating measure for 𝜇 and 𝜈.

The TV metric is indeed a metric on the space P (X) (in fact, it can be extended to a
norm on the space M (X) of signed measures). The TV distance can be thought of as both
a transport metric (with cost (𝑥, 𝑦) ↦→ 1{𝑥 ≠ 𝑦}; in fact, the TV distance is a special case
of the 𝑊1 metric introduced in Exercise 1.11) and an information divergence.
The family of information divergences can be further expanded by introducing the
following definition.

Definition 1.5.2. Let 𝜇, 𝜈 ∈ P (X), and let 𝑓 : R+ → R be a convex function with


𝑓 (1) = 0. Then, the 𝑓 -divergence of 𝜇 from 𝜈 is

d𝜇 

D 𝑓 (𝜇 k 𝜈) B 𝑓 d𝜈 , if 𝜇  𝜈 .
d𝜈
In general, if 𝜇 3 𝜈, we let 𝑝 𝜇 , 𝑝𝜈 denote the respective densities of 𝜇 and 𝜈 w.r.t. a
common dominating measure. Then,

𝑝𝜇 
D 𝑓 (𝜇 k 𝜈) B 𝑓 d𝜈 + 𝑓 0 (∞) 𝜇{𝑝𝜈 = 0} .
𝑝 𝜈 >0 𝑝𝜈

For example, the TV distance corresponds to 𝑓 (𝑥) = 12 |𝑥 − 1|, the KL divergence corre-
sponds to 𝑓 (𝑥) = 𝑥 ln 𝑥, and the 𝜒 2 divergence corresponds to 𝑓 (𝑥) = (𝑥 − 1) 2 . When 𝑓
has superlinear growth, then 𝑓 0 (∞) = ∞ and hence D 𝑓 (𝜇 k 𝜈) = ∞ unless 𝜇  𝜈, but the
second more general definition given above is necessary to recover the TV distance.
We always have 𝜒 2 ≥ ln(1 + 𝜒 2 ) ≥ KL ≥ 2 k·k 2TV (the last inequality is Pinsker’s
inequality, see Exercise 2.10), and under a T2 transport inequality with constant 𝛼 −1
(which is implied by 𝛼 −1 -LSI) we have KL ≥ 𝛼2 𝑊22 . This chain of inequalities helps to
explain why, if the KL divergence is of order 𝑑, then the 𝜒 2 divergence is of order exp 𝑑.
In Section 2.2.4, we will also introduce the closely related family of divergences known
as Rényi divergences.
We conclude by stating a few key facts (without complete proofs) about 𝑓 -divergences.
The first is the data-processing inequality.
54 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Theorem 1.5.3 (data-processing inequality). Suppose that 𝜇, 𝜈 ∈ P (X) and that 𝑃 is


any Markov kernel. Then, for any 𝑓 -divergence, it holds that

D 𝑓 (𝜇𝑃 k 𝜈𝑃) ≤ D 𝑓 (𝜇 k 𝜈) .

Equivalently, D 𝑓 (· k ·) is jointly convex in its two arguments.

Proof sketch. To simplify, we will abuse notation and identify all probability measures
with densities. Then, by Jensen’s inequality,

𝜇 (𝑥) 𝑃 (𝑥, 𝑦) 
D 𝑓 (𝜇 k 𝜈) = 𝑓 𝜈 (d𝑥) 𝑃 (𝑥, d𝑦)
𝜈 (𝑥) 𝑃 (𝑥, 𝑦)

𝜇 (𝑥) 𝑃 (𝑥, 𝑦)  𝜈 (d𝑥) 𝑃 (𝑥, 𝑦)
= 𝑓 𝜈𝑃 (d𝑦)
𝜈 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈𝑃 (𝑦)
∫ ∫ ∫
𝜇 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈 (d𝑥) 𝑃 (𝑥, 𝑦)  𝜇𝑃 (𝑦) 
≥ 𝑓 𝜈𝑃 (d𝑦) = 𝑓 𝜈𝑃 (d𝑦) . 
𝜈 (𝑥) 𝑃 (𝑥, 𝑦) 𝜈𝑃 (𝑦) 𝜈𝑃 (𝑦)

The remaining facts are specific to the KL divergence. The Donsker–Varadhan theorem
expresses the KL divergence via a variational principle.

Theorem 1.5.4 (Donsker–Varadhan variational principle). Suppose that 𝜇, 𝜈 ∈ P (X),


where X is a Polish space. Then,

KL(𝜇 k 𝜈) = sup E𝜇 𝑔 − ln E𝜈 exp 𝑔 𝑔 : X → R is bounded and measurable .




The theorem asserts that the functionals 𝜇 ↦→ KL(𝜇 k𝜈) and 𝑔 ↦→ ln E𝜈 exp 𝑔 are convex
conjugates of each other. See [DZ10, Lemma 6.2.13] or [RS15, Theorem 5.4] for careful
proofs, or see the remark after Lemma 2.3.4.
Lastly, we have the chain rule for the KL divergence.

Lemma 1.5.5 (chain rule for KL divergence). Let X1 , X2 be Polish spaces and suppose
we are given two probability measures 𝜇, 𝜈 ∈ P (X1 × X2 ) with 𝜇  𝜈. Let 𝜇 1 be the X1
marginal of 𝜇, and let 𝜇2|1 (· | ·) be the conditional distribution for 𝜇 on X2 conditioned
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 55

on X1 ; likewise define 𝜈 1 and 𝜈 2|1 . Then, it holds that




KL(𝜇 k 𝜈) = KL(𝜇1 k 𝜈 1 ) + KL 𝜇2|1 (· | 𝑥 1 ) 𝜈 2|1 (· | 𝑥 1 ) 𝜇 1 (d𝑥 1 ) .

We invite the reader to prove the chain rule in the discrete case (X1 and X2 are finite
sets), free of measure-theoretic guilt.

Bibliographical Notes
Much of the material in this chapter is foundational, with entire textbooks giving com-
prehensive treatments of the topics. For stochastic calculus, there is of course a long list
of textbooks, but as a starting place we suggest [Ste01; Le 16]. For Markov semigroup
theory, see [BGL14; Han16]. For optimal transport, the core theory is developed in [Vil03;
San15], and for a rigorous development of Otto calculus see [AGS08; Vil09].
The notion of solution used in Section 1.1.3 is more typically called a strong solution to
the SDE, because given any Brownian motion 𝐵 we can find a process 𝑋 which is driven
by 𝐵 and which satisfies the SDE. There is also a notion of weak solution, in which we are
allowed to construct the probability space (Ω, ℱ, (ℱ𝑡 )𝑡 ≥0, P) and the Brownian motion 𝐵
together with the solution 𝑋 . We will not worry about the distinction in this book, since
strong solutions suffice for our purposes.
See [San15, §1.6.3] for an elegant proof of strong duality for the optimal transport
problem via convex duality.
The perspective of the Langevin diffusion as a Wasserstein gradient flow was intro-
duced in [JKO98]; the application of Otto calculus to functional inequalities was given
in [OV00]; and the calculation rules for Otto calculus were set out in [Ott01]. These three
papers are seminal and are worth reading carefully. An alternative (but related) approach
to functional inequalities via optimal transport is given in [Cor02]. The formal proof of
the Otto–Villani theorem in Exercise 1.16 was made rigorous via entropic interpolations
in [Gen+20]; see [BB18] for a generalization.
The Efron–Stein inequality in Exercise 1.2 is just one example of the use of martingales
to derive concentration inequalities; see [BLM13; Han16] for more on this topic. We will
also revisit the martingale method in the next chapter; see Exercise 2.12.
The upper bound (1.E.2) in Exercise 1.10 is surprisingly sharp: it holds that
1 1/2
kΣ0 − Σ1/2
1 k 2HS ≤ 𝑊22 (𝜇0, 𝜇1 ) − k𝑚 0 − 𝑚 1 k 2 ≤ kΣ1/2
0 − Σ1/2
1 k 2HS ,
2
see [CV21, Lemma 3.5].
56 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

The proof of the dynamical formulation of dual optimal transport in Exercise 1.12 is
carried out rigorously in [Vil03, §8.1]. We mention that the Hamilton–Jacobi equation
has a close connection with classical mechanics; in particular, the characteristics of the
Hamilton–Jacobi equation are precisely Hamilton’s equations of motion [Eva10, §3.3].
In the context of optimal transport, the Hamiltonian consists only of kinetic energy (no
potential energy) and hence the characteristics are straight lines traversed at constant
speed; this is of course consistent with the description of Wasserstein geodesics. The
Hamilton–Jacobi equation, the Hopf–Lax semigroup, and their connection with optimal
transport can also be generalized to other costs; see [Vil03, §5.4].

Exercises
A Primer on Stochastic Calculus

D Exercise 1.1 Doob’s 𝐿𝑝 maximal inequality


Prove Doob’s 𝐿𝑝 maximal inequality (Corollary 1.1.30) for discrete-time submartingales.
Hint: Start with the inequality 𝜆 P(𝑀𝑁∗ ≥ 𝜆) ≤ E[𝑀𝑁 1{𝑀𝑁∗ ≥ 𝜆}] from the proof
of Theorem 1.1.29, where 𝑀𝑁∗ B max𝑘=0,1,...,𝑁 𝑀𝑘 . Compute the 𝐿𝑝 (P) norm of 𝑀𝑁∗ by
integrating the tails, and apply the above inequality together with Hölder’s inequality.

D Exercise 1.2 orthogonality of martingale increments


Let (𝑀𝑛 )𝑛∈N be a discrete-time martingale which is adapted to a filtration (ℱ𝑛 )𝑛∈N and
satisfies E[𝑀𝑛2 ] < ∞ for all 𝑛 ∈ N. Let Δ𝑛 B 𝑀𝑛+1 − 𝑀𝑛 denote the martingale increment.

1. Prove that for 𝑚, 𝑛 ∈ N with 𝑚 ≠ 𝑛, E[Δ𝑚 Δ𝑛 ] = 0: the martingale increments are


orthogonal. In particular, if 𝑀0 = 0, then E[𝑀𝑛2 ] = 𝑛−1 2
Í
𝑘=0 E[Δ𝑘 ].

2. Let (𝑋𝑖 )𝑛𝑖=1 be independent random variables taking values in some space X, and
suppose that the function 𝑓 : X𝑛 → R is bounded and measurable. Check that if
𝑀𝑘 B E[𝑓 (𝑋 1, . . . , 𝑋𝑛 ) | 𝑋 1, . . . , 𝑋𝑘 ], then the Doob martingale (𝑀𝑘 )𝑛𝑘=1 is indeed
a martingale. Then, using the previous part, prove the following tensorization
property of the variance:
𝑛
∑︁
var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑘 ,

𝑘=1

where 𝑋 −𝑘 B (𝑋 1, . . . , 𝑋𝑘−1, 𝑋𝑘+1, . . . , 𝑋𝑛 ).


1.5. OVERVIEW OF THE CONVERGENCE RESULTS 57

3. Define the discrete derivative

𝐷𝑘 𝑓 (𝑥) B sup 𝑓 (𝑥 1, . . . , 𝑥𝑘−1, 𝑥𝑘0 , 𝑥𝑘+1, . . . , 𝑥𝑛 )


𝑥𝑘0 ∈X

− inf
0
𝑓 (𝑥 1, . . . , 𝑥𝑘−1, 𝑥𝑘0 , 𝑥𝑘+1, . . . , 𝑥𝑛 ) .
𝑥𝑘 ∈X

Prove the inequality

1 ∑︁
𝑛
var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E {𝐷𝑘 𝑓 (𝑋 1, . . . , 𝑋𝑛 )}2 .
4
𝑘=1

This inequality, known as the Efron–Stein inequality, expresses the fact that
a function 𝑓 of independent random variables which is not too sensitive to any
individual coordinate has controlled variance. This is a concentration inequality
which has useful consequences in many probabilistic settings, see, e.g., [BLM13].
Hint: First prove that a random variable which takes values in [𝑎, 𝑏] has variance
bounded by 14 (𝑏 − 𝑎) 2 .

D Exercise 1.3 𝐿 2 bounded martingale convergence theorem


Let (𝑀𝑛 )𝑛∈N be a discrete-time martingale which is adapted to a filtration (ℱ𝑛 )𝑛∈N . Assume
that the martingale is bounded in 𝐿 2 (P): sup𝑛∈N E[𝑀𝑛2 ] ≤ 𝐵 2 < ∞. Prove that the
martingale converges a.s. and in 𝐿 2 (P) to a limit 𝑀∞ with E[𝑀∞ 2 ] ≤ 𝐵2.

Hint: For the a.s. convergence, it suffices to show that for all 𝜀 > 0, the event

lim sup |𝑀𝑚 − 𝑀𝑛 | ≥ 𝜀



𝑚→∞ 𝑛∈N, 𝑛≥𝑚

has probability zero. To do so, use Doob’s maximal inequality (Theorem 1.1.29) and
orthogonality of martingale increments (Exercise 1.2).

D Exercise 1.4 explosion of ODEs


Solve the following ODE on R: 𝑥¤𝑡 = 𝑏 (𝑥𝑡 ) with initial condition 𝑥 0 ∈ R, where 𝑏 (𝑥) = |𝑥 |𝛼 ,
𝛼 > 0. Show that

1. when 0 < 𝛼 < 1, there are multiple solutions to the ODE (with initial condition
𝑥 0 = 0) so that uniqueness fails;

2. when 𝛼 = 1 (and hence 𝑏 is globally Lipschitz), there is a unique solution to the


ODE which is finite for all time;
58 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

3. when 𝛼 > 1, then the solution to the ODE blows up in finite time.

D Exercise 1.5 Ornstein–Uhlenbeck process


One of the most important diffusions that we will encounter is the Ornstein–Uhlenbeck
(OU) process, which solves the SDE

d𝑋𝑡 = −𝑋𝑡 d𝑡 + 2 d𝐵𝑡 .

Give an explicit expression for 𝑋𝑡 in terms of 𝑋 0 and an Itô integral involving (𝐵𝑡 )𝑡 ≥0 .
From this expression, can you read off the stationary distribution of this process?
Hint: Apply Itô’s formula to 𝑓 (𝑡, 𝑋𝑡 ) = 𝑋𝑡 exp 𝑡. To find the stationary distribution,
justify the following fact: if (𝜂𝑡 )𝑡 ≥0 is a deterministic function, then 0 𝜂𝑡 d𝐵𝑡 is a Gaussian
∫𝑇

with mean zero and variance 0 𝜂𝑡2 d𝑡.


∫𝑇

Markov Semigroup Theory

D Exercise 1.6 basic properties of the Markov semigroup


Let (𝑃𝑡 )𝑡 ≥0 be a Markov semigroup with carré du champ Γ.

1. Prove that if 𝜙 : R → R is convex, then 𝑃𝑡 𝜙 (𝑓 ) ≥ 𝜙 (𝑃𝑡 𝑓 ) and ℒ𝜙 (𝑓 ) ≥ 𝜙 0 (𝑓 ) ℒ𝑓


whenever the expressions are well-defined.

2. If (𝑋𝑡 )𝑡 ≥0 denotes the Markov process


∫ 𝑡 associated with the semigroup and 𝑓 is smooth,
then the process 𝑡 ↦→ 𝑓 (𝑋𝑡 ) − 0 ℒ𝑓 (𝑋𝑠 ) d𝑠 is a continuous local martingale. In
particular, (𝑓 (𝑋𝑡 ))𝑡 ≥0 is a continuous local martingale if and only if ℒ𝑓 = 0.

3. Prove that for 𝑓 , 𝑔 ∈ 𝐿 2 (𝜋) in the domain of the carré du champ, we have the
Cauchy–Schwarz inequality Γ(𝑓 , 𝑔) ≤ Γ(𝑓 , 𝑓 ) Γ(𝑔, 𝑔).
√︁

Hint: For 𝜆 ∈ R, consider 0 ≤ Γ(𝑓 + 𝜆𝑔, 𝑓 + 𝜆𝑔) and use bilinearity.

D Exercise 1.7 functional inequalities and exponential decay


Prove the equivalence between the Poincaré inequality and exponential decay of variance
(Theorem 1.2.19), and the equivalence between the log-Sobolev inequality and exponential
decay of entropy (Theorem 1.2.20).
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 59

D Exercise 1.8 log-Sobolev implies Poincaré


Linearize the log-Sobolev inequality to obtain
∫ the Poincaré inequality.
Hint: Argue that if 𝑓 ∈ Cc (R ) satisfies 𝑓 d𝜋 = 0, then
∞ 𝑑

 𝜀2

KL (1 + 𝜀 𝑓 ) 𝜋 𝜋 = 𝑓 2 d𝜋 + 𝑜 (𝜀 2 ) . (1.E.1)
2

D Exercise 1.9 mixing of the Ornstein–Uhlenbeck process


Consider the Ornstein–Uhlenbeck process (𝑋𝑡 )𝑡 ≥0 introduced in Exercise 1.5. Note that
2
this is just an instance of the Langevin diffusion with potential 𝑉 (𝑥) = k𝑥2k .
1. Using the explicit solution of the OU process, show that the semigroup has the
explicit expression
√︁
𝑃𝑡 𝑓 (𝑥) = E 𝑓 exp(−𝑡) 𝑥 + 1 − exp(−2𝑡) 𝜉 , 𝜉 ∼ normal(0, 1) .


Using this expression for the semigroup, compute the generator by hand and check
that it agrees with the general formula obtained in Example 1.2.4.
2. Show that for the OU process, ∇𝑃𝑡 𝑓 = exp(−𝑡) 𝑃𝑡 ∇𝑓 . Next, by differentiating the
Dirichlet energy 𝑡 ↦→ ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ), show that ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−2𝑡) ℰ(𝑓 , 𝑓 ). Ex-
plain why this implies a Poincaré inequality for the standard Gaussian distribution.
∫∞
Hint: For a general Markov semigroup, show that var𝜋 𝑓 = 2 0 ℰ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝑡 by
differentiating 𝑡 ↦→ var𝜋 (𝑃𝑡 𝑓 ).
In Chapter 2, we will generalize these calculations to prove Theorem 1.2.28.

The Geometry of Optimal Transport

D Exercise 1.10 optimal transport between Gaussians


Let 𝜇0 B normal(𝑚 0, Σ0 ) and 𝜇 1 B normal(𝑚 1, Σ1 ); assume that Σ0  0. Compute the
optimal transport map from 𝜇 0 to 𝜇1 , as well as the cost 𝑊2 (𝜇0, 𝜇1 ). [By Brenier’s theorem,
it suffices to find the gradient of a convex function which pushes forward 𝜇 0 to 𝜇1 .]
Also, exhibit a coupling to prove the upper bound

𝑊22 (𝜇 0, 𝜇1 ) ≤ k𝑚 0 − 𝑚 1 k 2 + kΣ1/2 1/2 2


0 − Σ1 k HS . (1.E.2)

Finally, suppose that 𝜈 0 , 𝜈 1 are probability measures, and suppose that 𝜇 0 , 𝜇 1 are
Gaussians whose means and covariances match those of 𝜈 0 and 𝜈 1 respectively. Then,
prove that 𝑊2 (𝜈 0, 𝜈 1 ) ≥ 𝑊2 (𝜇 0, 𝜇1 ).
60 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

Hint: For the last statement, use the fact that the dual potentials for optimal transport
between Gaussians are quadratic functions.

D Exercise 1.11 optimal transport with other costs


1. Consider optimal transport with a general cost function 𝑐 as in (1.3.2). By following
the proof of Theorem 1.3.8, argue that the optimal dual potentials (𝑓 , 𝑔) are 𝑐-
conjugates, i.e., 𝑓 = 𝑔𝑐 and 𝑔 = 𝑓 𝑐 where

𝑔𝑐 (𝑥) B inf {𝑐 (𝑥, 𝑦) − 𝑔(𝑦)} , 𝑓 𝑐 (𝑦) B sup {𝑐 (𝑥, 𝑦) − 𝑓 (𝑥)} , (1.E.3)


𝑦∈Y 𝑥 ∈X

and that
n∫ ∫ o
T𝑐 (𝜇, 𝜈) ≥ sup 𝑓 d𝜇 + 𝑓 d𝜈 .
𝑐
𝑓 ∈𝐿 1 (𝜇)

Under general conditions, equality holds; see [Vil09, Theorem 5.10].


Functions of the form (1.E.3) are called 𝑐-concave.

2. Let X = Y be a metric space with metric d. For all 𝑝 ≥ 1, we can define the
𝑝-Wasserstein distance

𝑊𝑝 (𝜇, 𝜈) = inf d(𝑥, 𝑦) 𝑝 d𝛾 (𝑥, 𝑦) .
𝑝
𝛾 ∈C(𝜇,𝜈)

Let P𝑝 (X)
∫ denote𝑝 the space of probability measures 𝜇 over X such that for some
𝑥 0 ∈ X, d(𝑥 0, ·) d𝜇 < ∞. Show that (P𝑝 (X),𝑊𝑝 ) is a metric space.

3. In the case 𝑝 = 1, show that if (𝑓 , 𝑔) are d-conjugates, then 𝑓 = −𝑔 and 𝑓 is


1-Lipschitz. Deduce the duality formula
n∫ ∫ o
𝑊1 (𝜇, 𝜈) = sup 𝑓 d𝜇 − 𝑓 d𝜈 𝑓 : X → R is 1-Lipschitz . (1.E.4)

D Exercise 1.12 dynamical formulations of optimal transport


The formula (1.3.23) shows that the 𝑊2 distance between 𝜇0 and 𝜇 1 equals the smallest
arc length of any curve joining 𝜇 0 and 𝜇1 . It is also true that the squared 𝑊2 distance
minimizes the energy or action of any curve joining 𝜇0 and 𝜇1 , in the following sense:
n∫ 1 o
2
𝑊2 (𝜇0, 𝜇1 ) = inf k𝑣𝑡 k 𝐿2 2 (𝜇𝑡 ) d𝑡 𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 . (1.E.5)

0
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 61

Although the problems (1.3.23) and (1.E.5) both identify geodesics in the Wasserstein space,
the latter problem has more favorable properties. Namely, the minimizing curves in (1.3.23)
are geodesics, but they may not have constant speed (indeed, the arc length functional
is invariant under time reparameterization of the curve); in contrast, minimizing curves
in (1.E.5) necessarily have constant speed. Also, we can reparameterize problem (1.E.5)
by introducing the momentum density 𝑝𝑡 B 𝜇𝑡 𝑣𝑡 and write
n∫ 1 ∫ k𝑝 k 2  o
2
𝑊2 (𝜇 0, 𝜇1 ) = inf d𝑡 𝜕𝑡 𝜇𝑡 + div 𝑝𝑡 = 0 , (1.E.6)
𝑡
0 𝜇𝑡
which is now a strictly convex problem in the variables (𝜇, 𝑝). This convenient reformula-
tion is known as the Benamou–Brenier formula [BB99].
Just as (1.E.5) describes the dynamical version of the static optimal transport prob-
lem (1.3.5), there is a dynamical formulation of the dual optimal transport problem (1.3.7),
in which the dual potential evolves according to the Hamilton–Jacobi equation
1
𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k 2 = 0 . (1.E.7)
2
Then, it holds that
1 2 1
n ∫ ∫ o
𝑊2 (𝜇0, 𝜇1 ) = sup 𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0 𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k 2 = 0 . (1.E.8)

2 2
The goal of this exercise is to justify and understand these facts.
1. Show that the mapping R>0 × R𝑑 → R, (𝜇, 𝑝) ↦→ k𝑝 k 2 /𝜇 is strictly convex. Also,
compute the convex conjugate of this mapping. Deduce that the Benamou–Brenier
reformulation (1.E.6) is a strictly convex problem.
2. Ignoring issues of regularity, show that the solution 𝑢𝑡 of the Hamilton–Jacobi
equation with initial condition 𝑢 0 = 𝑓 is described by the Hopf–Lax semigroup
1
𝑢𝑡 (𝑥) = 𝑄𝑡 𝑓 (𝑥) B inf 𝑓 (𝑦) + k𝑦 − 𝑥 k 2 .

𝑦∈R𝑑 2𝑡

3. Following the proof of Theorem 1.3.8, show that the dual optimal transport prob-
lem (1.3.7) can be written
1 2 n∫ ∫ o
𝑊2 (𝜇0, 𝜇1 ) = sup 𝑄 1 𝑓 d𝜇1 − 𝑓 d𝜇0 ,
2 𝑓 ∈𝐿 1 (𝜇 0 )

where 𝑄 1 denotes the Hopf–Lax semigroup at time 1. From this, deduce that the
formula (1.E.8) holds.
62 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

4. Although the previous part gives a proof of the dynamical formulation (1.E.8), it
is unsatisfactory because it only involves an analysis of the static primal and dual
problems. Here, we present a purely dynamical proof. The continuity constraint
𝜕𝑡 𝜇𝑡 + div 𝑝𝑡 = 0 in (1.E.6) can be reformulated as follows: for any curve of functions
[0, 1] × R𝑑 → R, (𝑡, 𝑥) ↦→ 𝑢𝑡 (𝑥),
∫ ∫ ∫ 1 ∫  ∫ 1 ∫ 
𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0 = 𝜕𝑡 𝑢𝑡 d𝜇𝑡 d𝑡 = (𝜕𝑡 𝑢𝑡 𝜇𝑡 + 𝑢𝑡 𝜕𝑡 𝜇𝑡 ) d𝑡
0 0
∫ 1 ∫ 
𝑝𝑡 
d𝜇𝑡 d𝑡 .

= 𝜕𝑡 𝑢𝑡 + ∇𝑢𝑡 ,
0 𝜇𝑡

This can be incorporated as a Lagrange multiplier in (1.E.6):

1 2 n∫ 1 ∫ k𝑝 k 2 ∫ ∫
inf sup d𝑡 + 𝑢 1 d𝜇1 − 𝑢 0 d𝜇 0
𝑡
𝑊 (𝜇0, 𝜇1 ) =
2 2 𝜇:[0,1]×R𝑑 →R+ 𝑢:[0,1]×R𝑑 →R 0 2𝜇𝑡
𝑝:[0,1]×R𝑑 →R𝑑
∫ 1 ∫  o
𝑝𝑡 
d𝜇𝑡 d𝑡

− 𝜕𝑡 𝑢𝑡 + ∇𝑢𝑡 ,
0 𝜇𝑡

Assume that the infimum and supremum can be interchanged (here, we are invoking
an abstract minimax theorem, which crucially relies on the convexity of the problem
established in the first part). Use this to prove that
1 2 n∫ 1
∫ o
2
𝑊 (𝜇0, 𝜇1 ) = sup 𝑢 1 d𝜇 1 − 𝑢 0 d𝜇0 𝜕𝑡 𝑢𝑡 + k∇𝑢𝑡 k ≤ 0

2 2 2
and that equality holds only if the Hamilton–Jacobi equation (1.E.7) holds, and
if ∇𝑢𝑡 = 𝑣𝑡 = 𝑝𝑡 /𝜇𝑡 . Note that this also establishes that the optimal vector fields
(𝑣𝑡 )𝑡 ∈[0,1] are gradients of functions.

5. Let 𝑡 ↦→ 𝜇𝑡 be a Wasserstein geodesic and let 𝑡 ↦→ 𝑣𝑡 be its associated curve of


tangent vectors. Prove that 𝜕𝑡 𝑣𝑡 + ∇𝑣𝑡 𝑣𝑡 = 0. (This statement follows from the
previous part by differentiating the Hamilton–Jacobi equation in space. Try to also
give a more direct proof of this equation.)
Hint: If 𝑥¤𝑡 = 𝑣𝑡 (𝑥𝑡 ), then because particles travel with constant velocity along
Wasserstein geodesics, 𝑡 ↦→ 𝑥¤𝑡 is constant.
1.5. OVERVIEW OF THE CONVERGENCE RESULTS 63

D Exercise 1.13 Wasserstein space has non-negative curvature


Let 𝑡 ↦→ 𝜇𝑡 denote a Wasserstein geodesic. By finding an appropriate coupling, prove that
for all 𝑡 ∈ [0, 1] and all 𝜈 ∈ P2 (R𝑑 ),
𝑊22 (𝜇𝑡 , 𝜈) ≥ (1 − 𝑡) 𝑊22 (𝜇0, 𝜈) + 𝑡 𝑊22 (𝜇1, 𝜈) − 𝑡 (1 − 𝑡) 𝑊22 (𝜇 0, 𝜇1 ) . (1.E.9)
Compare this to the following equality on R𝑑 : if 𝑥𝑡 = (1 − 𝑡) 𝑥 0 + 𝑡 𝑥 1 , then
k𝑥𝑡 − 𝑦 k 2 = (1 − 𝑡) k𝑥 0 − 𝑦 k 2 + 𝑡 k𝑥 1 − 𝑦 k 2 − 𝑡 (1 − 𝑡) k𝑥 0 − 𝑥 1 k 2 . (1.E.10)
The equality (1.E.10) expresses the fact that R𝑑 is flat, whereas the inequality (1.E.9)
expresses the fact that P2 (R𝑑 ) (equipped with the 𝑊2 metric) is non-negatively curved,
like a sphere. See Section 2.6.2.

The Langevin SDE as a Wasserstein Gradient Flow

D Exercise 1.14 reconciling the SDE and Wasserstein perspectives


Let 𝜑 : R𝑑 → R be a smooth √ test function and let 𝛿 > 0. First, consider the Langevin
diffusion d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 started at 𝑍 0 ∼ 𝜇 0 , and compute E 𝜑 (𝑍𝛿 ) up to first
order in 𝛿.
Next, if 𝜕𝑡 𝜇𝑡 = div(𝜇𝑡 ∇ ln(𝜇𝑡 /𝜋)), with 𝜋 ∝ exp(−𝑉 ), then we can interpret this as a
fluid flow: let 𝑋 0 ∼ 𝜇0 , and 𝑋¤𝑡 = −∇ ln(𝜇𝑡 /𝜋)(𝑋𝑡 ), so that 𝑋𝑡 ∼ 𝜇𝑡 . Compute the quantity
E 𝜑 (𝑋𝛿 ) up to first order in 𝛿.
Check that the two expressions you computed match (up to first order in 𝛿). Note that
these calculations are implicit in §1.2 and §1.4, but it is illuminating to directly connect
the Langevin diffusion to the Wasserstein gradient flow.

D Exercise 1.15 Wasserstein calculus for 𝑓 -divergences


Compute the Wasserstein gradient of the functional 𝜒 2 (· k 𝜋). Use the rules of Wasserstein
calculus to compute 𝜕𝑡 𝜒 2 (𝜋𝑡 k 𝜋), where 𝑡 ↦→ 𝜋𝑡 is the law of the Langevin diffusion with
stationary distribution 𝜋. Check that the result agrees with a calculation based on Markov
semigroup theory.
More generally, let 𝑓 : R → R+ and consider the 𝑓 -divergence

𝜇
𝐷 𝑓 (𝜇 k 𝜋) B 𝑓 d𝜋 .
𝜋
Compute the Wasserstein gradient of 𝐷 𝑓 (· k𝜋). For bonus points, calculate the Wasserstein
Hessian as well.
64 CHAPTER 1. THE LANGEVIN DIFFUSION IN CONTINUOUS TIME

D Exercise 1.16 Otto–Villani theorem


Consider the gradient flow 𝑡 ↦→ 𝜇𝑡 of a functional F with inf F = 0. Assume that the
PL inequality k∇𝑊2 F(𝜇)k 2𝜇 ≥ 2𝛼 F(𝜇) holds, and that the gradient flow converges to the
minimizer of F. Argue that 𝜕𝑡𝑊2 (𝜇𝑡 , 𝜇0 ) ≤ k∇𝑊2 F(𝜇𝑡 )k 𝜇𝑡 , and then show that
√︂ 𝛼 √︁ 
𝜕𝑡 𝑊2 (𝜇𝑡 , 𝜇0 ) + F(𝜇𝑡 ) ≤ 0 .
2
Conclude that a quadratic growth inequality holds.

D Exercise 1.17 contraction of the Langevin diffusion


1. Suppose that 𝑉 : R𝑑 → R is 𝛼-strongly convex. Let 𝑡 ↦→ 𝑥𝑡 , 𝑡 ↦→ 𝑦𝑡 be two gradient
flows for 𝑉 . Show that k𝑥𝑡 − 𝑦𝑡 k 2 ≤ exp(−2𝛼𝑡) k𝑥 0 − 𝑦0 k 2 .

2. Next, prove Theorem 1.4.10.


Hint: Apply Itô’s formula (Theorem 1.1.18) to 𝑓 (𝑧, 𝑧 0) B k𝑧 − 𝑧 0 k 2 .

3. Let F : P2,ac (R𝑑 ) → R be an 𝛼-convex functional, and let (𝜇𝑡 )𝑡 ≥0 , (𝜈𝑡 )𝑡 ≥0 be gradient
flows for F. Prove that

𝑊22 (𝜇𝑡 , 𝜈𝑡 ) ≤ exp(−2𝛼𝑡) 𝑊22 (𝜇 0, 𝜈 0 )

using the following steps. First, compute the derivative of 𝑡 ↦→ 𝑊22 (𝜇𝑡 , 𝜈𝑡 ) us-
ing Theorem 1.4.11. Next, apply the strong convexity inequality (1.4.6) to obtain
two inequalities F(𝜇𝑡 ) ≥ F(𝜈𝑡 ) + · · · and F(𝜈𝑡 ) ≥ F(𝜇𝑡 ) + · · · . Adding these two
inequalities, deduce that 𝜕𝑡𝑊22 (𝜇𝑡 , 𝜈𝑡 ) ≤ −2𝛼 𝑊22 (𝜇𝑡 , 𝜈𝑡 ).

Overview of the Convergence Results

D Exercise 1.18 divergences at initialization


Let 𝜋 ∝ exp(−𝑉 ) where 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Assume in addition that 𝑉 is minimized at
0. Let 𝜅 B 𝛽/𝛼 denote the condition number. Show that if we initialize at the measure
𝜇 0 = normal(0, 𝛽 −1𝐼𝑑 ), then ln sup(𝜇 0 /𝜋) ≤ 𝑑2 ln 𝜅. What does this imply about the size of
KL(𝜇 0 k 𝜋) and 𝜒 2 (𝜇0 k 𝜋) at initialization?
What can you say about 𝑊22 (𝜇 0, 𝜋)?
CHAPTER 2

Functional Inequalities

In this chapter, we explore the connection between functional inequalities, such as the
Poincaré and log-Sobolev inequalities, and the concentration of measure phenomenon.

2.1 Overview of the Inequalities


2.1.1 Relationships between the Inequalities
The main inequalities that we study in this chapter are the following:
• the Poincaré inequality (PI), as specialized to the Langevin diffusion (see Exam-
ple 1.2.21):

var𝜋 (𝑓 ) ≤ 𝐶 PI E𝜋 [k∇𝑓 k 2 ] , for all smooth 𝑓 : R𝑑 → R ,

• the log-Sobolev inequality (LSI), as specialized to the Langevin diffusion (see Ex-
ample 1.2.25):

ent𝜋 (𝑓 2 ) ≤ 2𝐶 LSI E𝜋 [k∇𝑓 k 2 ] , for all smooth 𝑓 : R𝑑 → R ,

• and Talagrand’s T2 inequality


1
KL(𝜇 k 𝜋) ≥ 𝑊22 (𝜇, 𝜋) , for all 𝜇 ∈ P2 (R𝑑 ) .
2𝐶 T2

65
66 CHAPTER 2. FUNCTIONAL INEQUALITIES

In addition, using the 𝑊1 metric introduced in Exercise 1.11, we consider

• Talagrand’s T1 inequality
1
KL(𝜇 k 𝜋) ≥ 𝑊 2 (𝜇, 𝜋) , for all 𝜇 ∈ P1 (R𝑑 ) .
2𝐶 T1 1

In many cases, arguments involving Poincaré and log-Sobolev inequalities hold more
generally in the context of reversible Markov processes, and when this is case we will try
to use notation from Markov semigroup theory (e.g., writing E𝜋 Γ(𝑓 , 𝑓 ) or ℰ(𝑓 , 𝑓 ) instead
of E𝜋 [k∇𝑓 k 2 ]) to indicate that this is the case. However, for clarity of exposition, we do
not dwell on this point, and we urge readers to focus on the case in which the Markov
process is the Langevin diffusion.
Although the Poincaré and log-Sobolev inequalities are stated above for smooth func-
tions 𝑓 : R𝑑 → R, once established they can be extended to a wider class of functions (e.g.,
locally Lipschitz functions) by arguing that smooth functions are dense w.r.t. appropriate
norms. Throughout the chapter, we omit mention of such approximation arguments.
Write PI(𝐶) to denote that the Poincaré inequality holds with constant 𝐶, and similarly
for the other inequalities. We have the following relationships.

• The Bakry–Émery theorem (Theorem 1.2.28) shows that 𝛼-strong log-concavity of


𝜋 implies that 𝜋 satisfies LSI(𝛼 −1 ).

• The Otto–Villani theorem (Exercise 1.16) shows that LSI(𝐶) implies T2 (𝐶).

• Since 𝑊1 ≤ 𝑊2 , then T2 (𝐶) obviously implies T1 (𝐶). On the other hand, we will
show below that T2 (𝐶) implies PI(𝐶) as well. Combined with the previous point,
this shows that LSI(𝐶) implies PI(𝐶), which was shown directly in Exercise 1.8.

• In general, PI and T1 are incomparable (Exercise 2.11).

2.1.2 Linearization of Transport Inequalities


To prove that T2 (𝐶) implies PI(𝐶), we linearize the transport cost. It will be convenient
for future purposes to prove a more general version of the linearization principle.

Proposition 2.1.1 (linearization of transport cost). Let 𝑐 : R𝑑 × R𝑑 → R+ be a lower


semicontinuous cost function. Assume that 𝑐 (𝑥, 𝑥) = 0 for all 𝑥 ∈ R𝑑 , that there exists
𝛿 > 0 for which 𝑐 (𝑥, 𝑦) ≥ 𝛿 k𝑥 − 𝑦 k 2 for all 𝑥, 𝑦 ∈ R𝑑 , and that there is a measurable
2.1. OVERVIEW OF THE INEQUALITIES 67

mapping 𝑥 ↦→ 𝐻𝑥  0 such that for each compact 𝐾 ⊆ R𝑑 ,


1
sup 𝑐 (𝑥 + ℎ, 𝑥) − hℎ, 𝐻𝑥 ℎi = 𝑜 (kℎk 2 ) as ℎ → 0 .

𝑥 ∈𝐾 2

Then, for any 𝜇 ∈ P (R𝑑 ) and 𝑓 ∈ Cc∞ (R𝑑 ) with 𝑓 d𝜇 = 0, it holds that

2
( 𝑓 2 d𝜇)

1
lim inf 2 T𝑐 𝜇, (1 + 𝜀 𝑓 ) 𝜇 ≥ ∫

.
𝜀&0 𝜀 2 h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i d𝜇 (𝑥)

Proof sketch. Fix 𝜆 ∈ R. Using the dual formulation given in Exercise 1.11,
∫ ∫ ∫
T𝑐 (𝜇, 𝜈) ≥ 𝜆𝜀 𝑓 d𝜇 + (𝜆𝜀 𝑓 ) d𝜈 =
𝑐
(𝜆𝜀 𝑓 )𝑐 d𝜈 .

Here, (𝜆𝜀 𝑓 )𝑐 (𝑥) = inf ℎ∈R𝑑 {𝑐 (𝑥 +ℎ, 𝑥) −𝜆𝜀 𝑓 (𝑥 +ℎ)}, and using the assumption on 𝑐 together
with the compact support of 𝑓 , one can justify that the infimum is attained at a point ℎ
with kℎk = 𝑂 (𝜀). Then,
1
(𝜆𝜀 𝑓 )𝑐 (𝑥) = inf hℎ, 𝐻𝑥 ℎi − 𝜆𝜀 𝑓 (𝑥) − 𝜆𝜀 h∇𝑓 (𝑥), ℎi + 𝑜 (𝜀 2 )

ℎ∈R𝑑 2
𝜆 2𝜀 2
≥ −𝜆𝜀 𝑓 (𝑥) − h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i + 𝑜 (𝜀 2 ) .
2
Hence,

𝜆 2𝜀 2
∫ ∫
2 2
𝑓 d𝜇 − h∇𝑓 (𝑥), 𝐻𝑥−1 ∇𝑓 (𝑥)i d𝜇 (𝑥)

T𝑐 𝜇, (1 + 𝜀 𝑓 ) 𝜇 ≥ −𝜆𝜀
2
and the result follows by optimizing over 𝜆. 

Corollary 2.1.2 (T2 implies PI). If 𝜋 satisfies T2 (𝐶), then it satisfies PI(𝐶).

Proof. Let 𝑓 ∈ Cc∞ (R𝑑 ) and apply the linearization in the preceding proposition to the
quadratic cost 𝑐 (𝑥, 𝑦) = 21 k𝑥 − 𝑦 k 2 with 𝐻𝑥 = 𝐼𝑑 for all 𝑥 ∈ R𝑑 . Then, T2 (𝐶) yields

2 ( 𝑓 2 d𝜋) 2

𝜀
2𝐶 KL (1 + 𝜀 𝑓 ) 𝜋 𝜋 ≥ 𝑊22 𝜋, (1 + 𝜀 𝑓 ) 𝜋 ≥ ∫ + 𝑜 (𝜀 2 ) .
 
2
k∇𝑓 k d𝜋
68 CHAPTER 2. FUNCTIONAL INEQUALITIES

On the other hand, the linearization (1.E.1) of the KL divergence in Exercise 1.8 yields
 𝜀2

KL (1 + 𝜀 𝑓 ) 𝜋 𝜋 =
𝑓 2 d𝜋 + 𝑜 (𝜀 2 ) .
2
Comparing terms proves the result. 
In Exercise 2.1, we explore a perhaps more intuitive approach to linearizing the 2-
Wasserstein distance via the Monge–Ampère equation.

2.2 Proofs via Markov Semigroup Theory


2.2.1 Commutation and Curvature
In Section 1.2.3, we introduced the iterated carré du champ operator Γ2 , as well as the
curvature-dimension condition Γ2 (𝑓 , 𝑓 ) ≥ 𝛼 Γ(𝑓 , 𝑓 ) (denoted CD(𝛼, ∞)). Since this con-
dition plays a key role in the subsequent calculations, our goal is to demystify this idea.
By definition, the iterated carré du champ is
1
Γ2 (𝑓 , 𝑓 ) = {ℒΓ(𝑓 , 𝑓 ) − 2 Γ(𝑓 , ℒ𝑓 )} .
2
For the case of the Langevin diffusion with carré du champ Γ(𝑓 , 𝑓 ) = k∇𝑓 k 2 ,
1
Γ2 (𝑓 , 𝑓 ) = {ℒ(k∇𝑓 k 2 ) − 2 h∇𝑓 , ∇ℒ𝑓 i} . (2.2.1)
2
Recall that ℒ𝑓 = Δ𝑓 − h∇𝑉 , ∇𝑓 i, where 𝑉 is the potential. Let√us begin with the simple
case in which 𝑉 = 0, so ℒ is the Laplacian Δ (the generator of 2 𝐵, where 𝐵 is standard
Brownian motion). In this case, the iterated carré du champ turns out to simply be the
operator Γ2 (𝑓 , 𝑓 ) = k∇2 𝑓 k 2HS , which is known as the Bochner identity:
1
Δ(k∇𝑓 k 2 ) = h∇Δ𝑓 , ∇𝑓 i + k∇2 𝑓 k 2HS . (2.2.2)
2
Consequently, Δ satisfies CD(0, ∞).
It may seem strange at first sight to give such a fancy name to the seemingly innocuous
identity (2.2.2), which is a simple exercise in calculus. However, the importance of the
Bochner identity begins to reveal itself through the following fact: the identity continues
to make sense on a Riemannian manifold, except that there is an extra term involving the
Ricci curvature of the manifold.
1
Δ(k∇𝑓 k 2 ) = h∇Δ𝑓 , ∇𝑓 i + k∇2 𝑓 k 2HS + Ric(∇𝑓 , ∇𝑓 ) .
2
2.2. PROOFS VIA MARKOV SEMIGROUP THEORY 69

We will defer a fuller discussion of Riemannian geometry for later, but for now we can
get a hint at the role of the curvature by observing that the Bochner identity (2.2.2) on R𝑑
follows from the equation1

∇Δ𝑓 − Δ∇𝑓 = 0 (2.2.3)

by taking the inner product with ∇𝑓 and applying the identity


1
Δ(k∇𝑓 k 2 ) = div(∇2 𝑓 ∇𝑓 ) = hΔ∇𝑓 , ∇𝑓 i + k∇2 𝑓 k 2HS .
2
In turn, the equation (2.2.3) shows that the Laplacian commutes with the gradient operator,
which is true because partial derivatives commute on R𝑑 ; this is a manifestation of the
fact that R𝑑 is flat. In contrast, the very definition of curvature on a Riemannian manifold
is usually based upon the lack of commutativity of differential operators.2
Turning now to the Langevin generator ℒ, the identity (2.2.3) is replaced by

∇ℒ𝑓 − ℒ∇𝑓 = −∇2𝑉 ∇𝑓 . (2.2.4)

Hence, the commutator of ∇ and ℒ brings out the curvature of the measure 𝜋 ∝ exp(−𝑉 ),
and the plan is to exploit this in order to prove functional inequalities. The identity (2.2.4)
then yields the following formula for the iterated carré du champ:

Γ2 (𝑓 , 𝑓 ) = k∇2 𝑓 k 2HS + h∇𝑓 , ∇2𝑉 ∇𝑓 i . (2.2.5)

In particular, if ∇2𝑉  𝛼𝐼𝑑  0, then the curvature-dimension condition CD(𝛼, ∞) holds,


which was asserted as Theorem 1.2.29.

2.2.2 The Brascamp–Lieb Inequality


As a first illustration of the use of curvature, we prove the Brascamp–Lieb inequality,
which is a strong form of the Poincaré inequality. This inequality will also gain a natural
interpretation via a diffusion process in Section 10.2.
The proof method in this section is known as Hörmander’s 𝐿 2 method. The starting
point is to write down a dual form of the Poincaré inequality.3

1 Here, Δ acts on ∇𝑢 component by component.


2 Loosely speaking, the idea of curvature is that travelling in direction 𝑢 and then direction 𝑣 is not
exactly the same as travelling in direction 𝑣 and direction 𝑢. Algebraically, this is captured by studying the
difference between differentiating along vector field 𝑋 and then vector field 𝑌 , or vice versa.
3 The idea of dualizing the Poincaré inequality also appears in Exercise 2.1, in which the Poincaré

inequality is deduced from an inequality on (−ℒ) −1 .


70 CHAPTER 2. FUNCTIONAL INEQUALITIES

Lemma 2.2.6. Let 𝜋 ∝ exp(−𝑉 ) be a probability measure on R𝑑 , where 𝑉 is contin-


uously differentiable; let ℒ be the corresponding Langevin generator. Suppose that
𝐴 : R𝑑 → PD(𝑑) is a matrix-valued function mapping into the space of symmetric
positive definite matrices such that for all smooth 𝑢 : R𝑑 → R,

E𝜋 [(ℒ𝑢) 2 ] ≥ E𝜋 h∇𝑢, 𝐴 ∇𝑢i . (2.2.7)

Then, for all smooth 𝑓 : R𝑑 → R it holds that

var𝜋 𝑓 ≤ E𝜋 h∇𝑓 , 𝐴−1 ∇𝑓 i .

Proof. Assume that E𝜋 𝑓 = 0. Recall that E𝜋 ℒ𝑢 = 0 for any 𝑢, so that E𝜋 𝑓 = 0 is a


necessary condition for the solvability of the Poisson equation −ℒ𝑢 = 𝑓 . For simplicity,
we will assume that this condition is also sufficient.
If we express E𝜋 [𝑓 2 ] terms of 𝑢 and apply integration by parts (Theorem 1.2.14) and
Cauchy–Schwarz, we obtain

E𝜋 [𝑓 2 ] = 2 E𝜋 [𝑓 (−ℒ) 𝑢] − E𝜋 [(ℒ𝑢) 2 ]
≤ 2 E𝜋 h∇𝑓 , ∇𝑢i − Eh∇𝑢, 𝐴 ∇𝑢i
√︁
≤ 2 E𝜋 h∇𝑓 , 𝐴−1 ∇𝑓 i E𝜋 h∇𝑢, 𝐴 ∇𝑢i − Eh∇𝑢, 𝐴 ∇𝑢i ≤ E𝜋 h∇𝑓 , 𝐴−1 ∇𝑓 i . 

The point is that the condition (2.2.7) can now be checked with the help of curvature.
Suppose that 𝜋 ∝ exp(−𝑉 ) where 𝑉 is twice continuously differentiable and strictly
convex. Then, using integration by parts (Theorem 1.2.14),
1
E𝜋 [(ℒ𝑢) 2 ] = − E𝜋 h∇𝑢, ∇ℒ𝑢i = E𝜋 Γ2 (𝑢, 𝑢) − ℒ(k∇𝑢 k 2 ) = E𝜋 Γ2 (𝑢, 𝑢)
 
2 | {z }
because E𝜋 ℒ=0
| {z }
by (2.2.1)

= E𝜋 [k∇2𝑢 k 2HS + h∇𝑢, ∇2𝑉 ∇𝑢i] .


| {z }
by (2.2.5)

Applying the lemma, we obtain the following result.

Theorem 2.2.8 (Brascamp–Lieb inequality). Let 𝜋 ∝ exp(−𝑉 ), where 𝑉 is strictly


2.2. PROOFS VIA MARKOV SEMIGROUP THEORY 71

convex on R𝑑 and twice continuously differentiable. Then, for all 𝑓 : R𝑑 → R,

var𝜋 𝑓 ≤ E𝜋 h∇𝑓 , (∇2𝑉 ) −1 ∇𝑓 i .

When ∇2𝑉  𝛼𝐼𝑑  0, then this implies that a Poincaré inequality holds for 𝜋 with
constant 𝐶 PI ≤ 1/𝛼. However, the Brascamp–Lieb inequality is much stronger, as it allows
us to take advantage of non-uniform convexity.
In Exercise 2.3, we give another proof of Theorem 2.2.8 by linearizing a transport
inequality. First, we introduce the transport cost.

Definition 2.2.9. The Bregman transport cost for the potential 𝑉 , denoted D𝑉 , is
the transport cost associated with the Bregman divergence

𝐷𝑉 (𝑥, 𝑦) B 𝑉 (𝑥) − 𝑉 (𝑦) − h∇𝑉 (𝑦), 𝑥 − 𝑦i ,

i.e., we set

D𝑉 (𝜇, 𝜈) B inf 𝐷𝑉 (𝑥, 𝑦) d𝛾 (𝑥, 𝑦) .
𝛾 ∈C(𝜇,𝜈)

The Bregman transport cost will also play a key role in Section 10.2, in which we will
prove the following transport inequality.

Theorem 2.2.10 (Bregman transport inequality). Suppose that 𝑉 : R𝑑 → R is contin-


uously differentiable. Then, for 𝜋 ∝ exp(−𝑉 ) and all 𝜇 ∈ P (R𝑑 ),

D𝑉 (𝜇, 𝜋) ≤ KL(𝜇 k 𝜋) .

Actually, convexity of 𝑉 is not necessary for the theorem to hold, although the Bregman
transport cost D𝑉 is only guaranteed to be non-negative when 𝑉 is convex. Notice that
when 𝑉 is strongly convex, ∇2𝑉  𝛼𝐼𝑑  0, then 𝐷𝑉 (𝑥, 𝑦) ≥ 𝛼2 k𝑥 − 𝑦 k 2 , so the Bregman
transport inequality implies T2 (𝛼 −1 ).

2.2.3 Proof of the Bakry–Émery Theorem


In this section, we generalize the calculation in Exercise 1.9 from the Ornstein–Uhlenbeck
diffusion to the Langevin diffusion, and thereby prove the Bakry–Émery theorem (Theo-
rem 1.2.28). Recall from that exercise that the Ornstein–Uhlenbeck semigroup satisfies the
72 CHAPTER 2. FUNCTIONAL INEQUALITIES

identity ∇𝑃𝑡 𝑓 = exp(−𝑡) 𝑃𝑡 ∇𝑓 . In the next result, we show that more generally, CD(𝛼, ∞)
implies Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−2𝛼𝑡) 𝑃𝑡 Γ(𝑓 , 𝑓 ).

Theorem 2.2.11 (local Poincaré inequality). Assume the Markov semigroup (𝑃𝑡 )𝑡 ≥0 is
reversible and let 𝛼 ∈ R. Then, the following are equivalent.

1. The curvature-dimension condition CD(𝛼, ∞) holds.

2. For all 𝑓 and 𝑡 ≥ 0,

Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−2𝛼𝑡) 𝑃𝑡 Γ(𝑓 , 𝑓 ) .

3. For all 𝑓 and 𝑡 ≥ 0,

1 − exp(−2𝛼𝑡)
𝑃𝑡 (𝑓 2 ) − (𝑃𝑡 𝑓 ) 2 ≤ 𝑃𝑡 Γ(𝑓 , 𝑓 ) .
𝛼

Proof. (2) =⇒ (3): Markov semigroup calculus yields

𝜕𝑠 [𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )] = 𝑃𝑠 ℒ((𝑃𝑡−𝑠 𝑓 ) 2 ) − 2𝑃𝑡−𝑠 𝑓 ℒ𝑃𝑡−𝑠 𝑓 .




On the other hand, recall the definition of the carré du champ: ℒ(𝑓 2 ) − 2𝑓 ℒ𝑓 = 2Γ(𝑓 , 𝑓 ).
Using this along with (2),
𝜕𝑠 [𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )] = 2𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) ≤ 2 exp −2𝛼 (𝑡 − 𝑠) 𝑃𝑡 Γ(𝑓 , 𝑓 ) .


Integrating this from 𝑠 = 0 to 𝑠 = 𝑡 yields (3).


(1) =⇒ (2): Similarly, differentiating

𝜕𝑠 [𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 )] = 𝑃𝑠 ℒΓ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) − 2Γ(𝑃𝑡−𝑠 𝑓 , ℒ𝑃𝑡−𝑠 𝑓 )




and applying the definition of the iterated carré du champ yields


𝜕𝑠 [𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 )] = 2𝑃𝑠 Γ2 (𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) ≥ 2𝛼 𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) .
Integrating this from 𝑠 = 0 to 𝑠 = 𝑡 yields 𝑃𝑡 Γ(𝑓 , 𝑓 ) ≥ exp(2𝛼𝑡) Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ).
(3) =⇒ (1): We leave this as Exercise 2.5. 
Observe that if we take expectations of both sides of the third statement above w.r.t.
𝜋 and send 𝑡 → ∞, then we see that CD(𝛼, ∞) for 𝛼 > 0 implies PI(𝛼 −1 ). However, the
local decay asserted above is much stronger than PI(𝛼 −1 ).
To proceed further, we introduce the notion of a diffusion semigroup.
2.2. PROOFS VIA MARKOV SEMIGROUP THEORY 73

Definition 2.2.12. The Markov semigroup (𝑃𝑡 )𝑡 ≥0 is a diffusion semigroup if for


all functions 𝑓 , 𝑔 ∈ 𝐿 2 (𝜋) in the domain of the carré du champ Γ and all 𝜙 : R → R,
the chain rule holds:

Γ(𝜙 ◦ 𝑓 , 𝑔) = 𝜙 0 (𝑓 ) Γ(𝑓 , 𝑔) .

More generally, for functions 𝑓1, . . . , 𝑓𝑘 and Ψ : R𝑘 → R,


𝑘
 ∑︁
Γ Ψ(𝑓1, . . . , 𝑓𝑘 ), 𝑔 = (𝜕𝑖 Ψ)(𝑓1, . . . , 𝑓𝑘 ) Γ(𝑓𝑖 , 𝑔) .
𝑖=1

The chain rule is satisfied for the Langevin diffusion whose carré du champ is given by
Γ(𝑓 , 𝑔) = h∇𝑓 , ∇𝑔i, and more generally this assumption encodes the fact that the Markov
process is a diffusion. Since we are mainly interested in diffusion processes, this is not
a restrictive assumption, but it indicates that the following proof will fail for Markov
processes on discrete state spaces.

Proof of the Bakry–Émery theorem (Theorem 1.2.28). ∫ Given a smooth positive function 𝑓
and a function 𝜙 : R+ → R, we differentiate 𝑡 ↦→ 𝜙 (𝑃𝑡 𝑓 ) d𝜋. We are primarily interested
in the case 𝜙 (𝑥) B 𝑥 ln 𝑥, but carrying out the calculation for a general 𝜙 clarifies the
structure of the argument. Using the Markov semigroup calculus,
∫ ∫
𝜕𝑡 𝜙 (𝑃𝑡 𝑓 ) d𝜋 = 𝜙 0 (𝑃𝑡 𝑓 ) ℒ𝑃𝑡 𝑓 d𝜋 = −ℰ(𝜙 0 ◦ 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) .

This yields the representation


∫ ∫  ∫ ∞ ∫ 
ent𝜋 𝑓 B 𝜙 (𝑓 ) d𝜋 − 𝜙 𝑓 d𝜋 = − 𝜙 (𝑃𝑡 𝑓 ) d𝜋 d𝑡
𝜙
𝜕𝑡
0
∫ ∞
= ℰ(𝜙 0 ◦ 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝑡 .
0

We now specialize our calculations to the entropy function 𝜙 (𝑥) = 𝑥 ln 𝑥 and use
reversibility of the semigroup.
∫ ∫
0
ℰ(𝜙 ◦ 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) = (ln 𝑃𝑡 𝑓 ) (−ℒ)𝑃𝑡 𝑓 d𝜋 = (ln 𝑃𝑡 𝑓 ) 𝑃𝑡 (−ℒ𝑓 ) d𝜋
∫ ∫
= 𝑃𝑡 ln 𝑃𝑡 𝑓 (−ℒ) 𝑓 d𝜋 = Γ(𝑃𝑡 ln 𝑃𝑡 𝑓 , 𝑓 ) d𝜋
74 CHAPTER 2. FUNCTIONAL INEQUALITIES
√︄∫
Γ(𝑓 , 𝑓 )

≤ d𝜋 𝑓 Γ(𝑃𝑡 ln 𝑃𝑡 𝑓 , 𝑃𝑡 ln 𝑃𝑡 𝑓 ) d𝜋
𝑓

where the last line uses the Cauchy–Schwarz inequality (Exercise 1.6). By the chain rule
for the carré du champ, we have
Γ(𝑓 , 𝑓 )
Γ(ln 𝑓 , 𝑓 ) = .
𝑓
By applying the local Poincaré inequality (Theorem 2.2.11) and the chain rule,
√︄∫ ∫
ℰ(ln 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 𝑓 𝑃𝑡 Γ(ln 𝑃𝑡 𝑓 , ln 𝑃𝑡 𝑓 ) d𝜋
√︄∫ ∫
= exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 𝑃𝑡 𝑓 Γ(ln 𝑃𝑡 𝑓 , ln 𝑃𝑡 𝑓 ) d𝜋
√︄∫ ∫
= exp(−𝛼𝑡) Γ(ln 𝑓 , 𝑓 ) d𝜋 Γ(ln 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝜋

which is rearranged to yield


ℰ(ln 𝑃𝑡 𝑓 , 𝑓 ) ≤ exp(−2𝛼𝑡) ℰ(ln 𝑓 , 𝑓 ) .
This shows that under CD(𝛼, ∞), the Fisher information (introduced in Example 1.2.25)
decays exponentially fast.
Substituting this into the representation above,
1
∫ ∞ ∫ ∞
ent𝜋 𝑓 = ℰ(ln 𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) d𝑡 ≤ ℰ(ln 𝑓 , 𝑓 ) exp(−2𝛼𝑡) d𝑡 ≤ ℰ(ln 𝑓 , 𝑓 ) ,
0 0 2𝛼
which is the log-Sobolev inequality. 

2.2.4 Convergence in Rényi Divergence


One curiosity is that the log-Sobolev inequality directly implies a Poincaré inequality
(Exercise 1.8), and yet the convergence guarantees implied by these inequalities for the
Langevin diffusion are incomparable, because they apply to different metrics (𝜒 2 vs. KL).
It turns out that these convergence guarantees can be placed in the same framework
by introducing the family of Rényi divergences. Rényi divergences have also gained
importance in recent research due to applications to differential privacy [Mir17].
2.2. PROOFS VIA MARKOV SEMIGROUP THEORY 75

Definition 2.2.13. For 𝑞 > 1, the Rényi divergence of order 𝑞 between 𝜇 and 𝜋 is
defined by

1 d𝜇  𝑞

R𝑞 (𝜇 k 𝜋) B ln d𝜋 (2.2.14)
𝑞−1 d𝜋

if 𝜇  𝜋, and R𝑞 (𝜇 k 𝜋) B +∞ otherwise.

Rényi divergences are monotonic in the order: if 1 < 𝑞 ≤ 𝑞0 < ∞, then R𝑞 ≤ R𝑞 0 (this
follows from Jensen’s inequality). Some notable special cases include:
1. For 𝑞 & 1, we have R𝑞 → KL.
2. For 𝑞 = 2, we have R𝑞 = ln(1 + 𝜒 2 ).
d𝜇
3. For 𝑞 % ∞, we have R𝑞 % R∞ , where R∞ (𝜇 k 𝜋) B ln k d𝜋 k 𝐿∞ (𝜋) .
Remarkably, Vempala and Wibisono [VW19] show that a Poincaré inequality or a log-
Sobolev inequality imply convergence of the Langevin diffusion in every Rényi divergence.
We will prove the following theorem.

Theorem 2.2.15 ([VW19]). Let (𝑃𝑡 )𝑡 ≥0 be a reversible diffusion Markov semigroup, and
let (𝜋𝑡 )𝑡 ≥0 denote the law of the Markov process associated with the semigroup.

1. Suppose that a log-Sobolev inequality holds with constant 𝐶 LSI . Then, for all 𝑞 ≥ 1,
 2𝑡 
R𝑞 (𝜋𝑡 k 𝜋) ≤ exp − R𝑞 (𝜋0 k 𝜋) .
𝑞𝐶 LSI

2. Suppose that a Poincaré inequality holds with constant 𝐶 PI . Then, for all 𝑞 ≥ 2,
2𝑡
 R𝑞 (𝜋0 k 𝜋) − 𝑞𝐶 PI , if R𝑞 (𝜋𝑡 k 𝜋) ≥ 1 ,




R𝑞 (𝜋𝑡 k 𝜋) ≤

 2𝑡 
exp − R𝑞 (𝜋0 k 𝜋) , if R𝑞 (𝜋0 k 𝜋) ≤ 1 .




 𝑞𝐶 PI

Proof. We begin by differentiating the Rényi divergence in time. Let 𝜌𝑡 B d𝜋d𝜋 = 𝑃𝑡 𝜌 0 .


𝑡

Applying the chain rule for the carré du champ,

1 𝜕𝑡 𝜌𝑡 d𝜋 𝜌𝑡 ℒ𝜌𝑡 d𝜋 Γ(𝜌𝑡 , 𝜌𝑡 ) d𝜋
∫ 𝑞 ∫ 𝑞−1 ∫ 𝑞−1
𝑞 𝑞
𝜕𝑡 R(𝜋𝑡 k 𝜋) = = =−
𝑞−1 𝜌𝑡 d𝜋 𝑞−1 𝜌𝑡 d𝜋 𝑞−1 𝜌𝑡 d𝜋
∫ 𝑞 ∫ 𝑞 ∫ 𝑞
76 CHAPTER 2. FUNCTIONAL INEQUALITIES

4 ℰ(𝜌𝑡 , 𝜌𝑡 )
𝑞/2 𝑞/2
=− .
𝜌𝑡 d𝜋
∫ 𝑞
𝑞

Log-Sobolev case. The log-Sobolev inequality reads (due to the chain rule) as

2𝐶 LSI ℰ(𝑓 , 𝑓 ) ≥ ent𝜋 (𝑓 2 ) .

Applying this to 𝑓 = 𝜌 𝑞/2 , we obtain


∫ ∫  ∫ 
2𝐶 LSI ℰ(𝜌 , 𝜌 ) ≥ 𝑞
𝑞/2 𝑞/2
𝜌 ln 𝜌 d𝜋 −
𝑞
𝜌 d𝜋 ln
𝑞
𝜌 𝑞 d𝜋
∫ ∫  ∫ 
= 𝑞 𝜕𝑞 𝜌 d𝜋 −
𝑞
𝜌 d𝜋 ln
𝑞
𝜌 d𝜋
𝑞

and hence
4 ℰ(𝜌 𝑞/2, 𝜌 𝑞/2 ) 2 2
∫ ∫
≥ 𝜕𝑞 ln 𝜌 d𝜋 −
𝑞
ln 𝜌 𝑞 d𝜋
𝜌 d𝜋

𝑞 𝑞 𝐶 LSI 𝑞𝐶 LSI
2 2 (𝑞 − 1)
= 𝜕𝑞 [(𝑞 − 1) R𝑞 (𝜌𝜋 k 𝜋)] − R𝑞 (𝜌𝜋 k 𝜋)
𝐶 LSI 𝑞𝐶 LSI
2 2 (𝑞 − 1) 2 (𝑞 − 1)
= R𝑞 (𝜌𝜋 k 𝜋) + 𝜕𝑞 R𝑞 (𝜌𝜋 k 𝜋) − R𝑞 (𝜌𝜋 k 𝜋)
𝐶 LSI 𝐶 LSI | {z } 𝑞𝐶 LSI
≥0
2
≥ R𝑞 (𝜌𝜋 k 𝜋)
𝑞𝐶 LSI
where we used the fact that the Rényi divergence is monotonic in the order.
Poincaré case. Next, applying a Poincaré inequality to 𝑓 = 𝜌 𝑞/2 ,
∫ ∫ 2
𝐶 PI ℰ(𝜌 , 𝜌 ) ≥ var𝜋 (𝜌 ) =
𝑞/2 𝑞/2 𝑞/2
𝜌 d𝜋 −
𝑞
𝜌 𝑞/2 d𝜋
∫ h exp((𝑞 − 2) R𝑞/2 (𝜌𝜋 k 𝜋)) i
= 𝜌 𝑞 d𝜋 1 −
exp((𝑞 − 1) R𝑞 (𝜌𝜋 k 𝜋))
∫ 
𝜌 𝑞 d𝜋 1 − exp −R𝑞 (𝜌𝜋 k 𝜋)


where we used the monotonicity of the Rényi divergence in the order. Hence,

4 ℰ(𝜌 𝑞/2, 𝜌 𝑞/2 ) 4 


1 − exp −R𝑞 (𝜌𝜋 k 𝜋)


𝜌 𝑞 d𝜋

𝑞 𝑞𝐶 PI
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 77
(
2 1, if R𝑞 (𝜌𝜋 k 𝜋) ≥ 1 ,
≥ 
𝑞𝐶 PI R𝑞 (𝜌𝜋 k 𝜋) , if R𝑞 (𝜌𝜋 k 𝜋) ≤ 1 .

To interpret this theorem, the Poincaré result states that after an initial waiting period
of time 𝑂 (𝑞𝐶 PI R𝑞 (𝜋0 k 𝜋)), the Rényi divergence starts decaying exponentially fast. On
the other hand, the log-Sobolev inequality implies exponentially fast convergence from
the outset. In particular, for 𝑞 = 2, we see that whereas a Poincaré inequality implies
exponential decay of 𝜒 2 , a log-Sobolev inequality implies exponential decay of ln(1 + 𝜒 2 ),
which is substantially stronger.

2.3 Operations Preserving Functional Inequalities


To further expand the class of distributions known to satisfy functional inequalities, we
will show in this section that functional inequalities are stable under various common
operations on probability measures.
We let 𝐶 PI (𝜋) denote the Poincaré constant of a probability measure 𝜋, and similarly
write 𝐶 LSI (𝜋), 𝐶 T2 (𝜋), etc.

2.3.1 Bounded Perturbation


Suppose that 𝜋 satisfies a functional inequality, and that 𝜇 is another probability measure
d𝜇
satisfying 0 < 𝑐 ≤ d𝜋 ≤ 𝐶 < ∞. Then, it often follows that 𝜇 also satisfies the same
functional inequality, with a worse constant. This furnishes a large class of examples of
non-log-concave measures satisfying functional inequalities.

Proposition 2.3.1 (Holley–Stroock perturbation). Suppose that 𝜋 satisfies either a


d𝜇
Poincaré or log-Sobolev inequality. Then, if 𝜇 satisfies 0 < 𝑐 ≤ d𝜋 ≤ 𝐶 < ∞, then 𝜇 also
satisfies the corresponding functional inequality with constant
𝐶 𝐶
𝐶 PI (𝜇) ≤ 𝐶 PI (𝜋) or 𝐶 LSI (𝜇) ≤ 𝐶 LSI (𝜋)
𝑐 𝑐
respectively.

Proof. The key is to find a variational principle for the variance or for the entropy. For
the variance, for any 𝜈 ∈ P (R𝑑 ) and 𝑓 : R𝑑 → R,

var𝜈 𝑓 = inf E𝜈 [|𝑓 − 𝑚| 2 ] .


𝑚∈R
78 CHAPTER 2. FUNCTIONAL INEQUALITIES

From this,

var𝜇 𝑓 = inf E𝜇 [|𝑓 − 𝑚| 2 ] ≤ 𝐶 inf E𝜋 [|𝑓 − 𝑚| 2 ] = 𝐶 var𝜋 𝑓


𝑚∈R 𝑚∈R
𝐶
≤ 𝐶 𝐶 PI (𝜋) E𝜋 Γ(𝑓 , 𝑓 ) ≤ 𝐶 PI (𝜋) E𝜇 Γ(𝑓 , 𝑓 ) .
𝑐
The proof for the log-Sobolev inequality is similar once we have the variational principle

𝑓
ent𝜈 𝑓 = inf E𝜈 𝑓 ln − 𝑓 + 𝑡
 
𝑡 >0 𝑡
| {z }
≥0

for any 𝑓 : R𝑑 → R+ , which we leave for the reader to check. 

One can also state a perturbation principle for the T2 inequality, but it is more involved
and the constants are less precise, see [BGL14, Proposition 9.6.3].

Proposition 2.3.2. Suppose that 𝜋 satisfies a T2 inequality. If 𝜇 ∈ P2 (R𝑑 ) satisfies


d𝜇
0 < 𝐶 −1 ≤ d𝜋 ≤ 𝐶 < ∞, then 𝜇 also satisfies a T2 inequality, where 𝐶 T2 (𝜇) is bounded
in terms of 𝐶 and 𝐶 T2 (𝜋) only.

2.3.2 Contractive Mapping


Another simple but useful condition which enables us to transfer functional inequalities
from 𝜋 to 𝜇 is the existence of a Lipschitz mapping which pushes forward 𝜋 to 𝜇.

Proposition 2.3.3. Suppose that 𝜋 ∈ P (R𝑑 ) satisfies either a Poincaré or a log-Sobolev


inequality, and that there exists an 𝐿-Lipschitz map 𝑇 : R𝑑 → R𝑑 such that 𝜇 = 𝑇# 𝜋.
Then, 𝜇 also satisfies the corresponding functional inequality with constant

𝐶 PI (𝜇) ≤ 𝐿 2 𝐶 PI (𝜋) or 𝐶 LSI (𝜇) ≤ 𝐿 2 𝐶 LSI (𝜋)

respectively.

Proof. Assume for simplicity that 𝑇 is continuously differentiable, so that k∇𝑇 k op ≤ 𝐿.


Then, for 𝑓 : R𝑑 → R, by applying the Poincaré inequality for 𝜋,

var𝜇 𝑓 = var𝜋 (𝑓 ◦ 𝑇 ) ≤ 𝐶 PI (𝜋) E𝜋 [k∇(𝑓 ◦ 𝑇 )k 2 ] ≤ 𝐶 PI (𝜋) E𝜋 [k∇𝑇 k 2op k∇𝑓 ◦ 𝑇 k 2 ]


2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 79

≤ 𝐶 PI (𝜋) 𝐿 2 E𝜋 [k∇𝑓 ◦ 𝑇 k 2 ] = 𝐶 PI (𝜋) 𝐿 2 E𝜇 [k∇𝑓 k 2 ] .

The proof for the log-Sobolev inequality is similar. 


This result becomes particularly powerful when combined with Caffarelli’s contrac-
tion theorem, which states that the optimal transport map from the standard Gaussian
to an 𝛼-strongly log-concave measure is 𝛼 −1/2 -Lipschitz. As it is often easier to prove
functional inequalities for the standard Gaussian, this principle then quickly implies
Poincaré and log-Sobolev inequalities (as well as many other functional inequalities) for
strongly log-concave measures. We will return to this in Chapter 3.

2.3.3 Convolution
Next, we show that if 𝜋 = 𝜋1 ∗ 𝜋2 is a convolution of two measures, where both 𝜋1 and 𝜋2
satisfy a functional inequality, then so does 𝜋. This is a consequence of the subadditivity
of variance and entropy. We begin with a variational principle for the entropy.

Lemma 2.3.4 (variational principle for entropy). For 𝑓 : R𝑑 → R>0 ,

ent𝜋 𝑓 = sup{E𝜋 [𝑓 𝑔] | 𝑔 : R𝑑 → R such that E𝜋 exp 𝑔 ≤ 1} .

d𝜇
Proof. We may assume that E𝜋 exp 𝑔 = 1, and define 𝜇 via d𝜋 B exp 𝑔. Then,

𝑓  h 𝑓 exp(−𝑔) i
ent𝜋 𝑓 = E𝜋 𝑓 ln = E𝜇 𝑓 exp(−𝑔) ln

+ E𝜋 [𝑓 𝑔] .
E𝜋 𝑓 E𝜇 [𝑓 exp(−𝑔)]
| {z }
=ent𝜇 (𝑓 exp(−𝑔))≥0

Equality holds if 𝑔 = ln(𝑓 /E𝜋 𝑓 ). 


Remark 2.3.5. The variational principle above is essentially a reformulation of the
Donsker–Varadhan variational principle (Theorem 1.5.4).

Lemma 2.3.6 (subadditivity of variance and entropy). If 𝑋 1, . . . , 𝑋𝑛 are independent


random variables, then
𝑛
∑︁
var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑖 ,

𝑖=1
80 CHAPTER 2. FUNCTIONAL INEQUALITIES

𝑛
∑︁
ent 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E ent 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑖 .

𝑖=1

Here, var(· | 𝑋 −𝑖 ) and ent(· | 𝑋 −𝑖 ) denote the conditional variance and entropy respec-
tively when all variables except 𝑋𝑖 are held fixed, i.e.,

var 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑖 = 𝑥 −𝑖 = var 𝑓 (𝑥 1, . . . , 𝑥𝑖−1, 𝑋𝑖 , 𝑥𝑖+1, . . . , 𝑥𝑛 ) ,




ent 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) 𝑋 −𝑖 = 𝑥 −𝑖 = ent 𝑓 (𝑥 1, . . . , 𝑥𝑖−1, 𝑋𝑖 , 𝑥𝑖+1, . . . , 𝑥𝑛 ) .




Proof. The subadditivity of the variance was established in Exercise 1.2, so we turn towards
the entropy. Let 𝑍 B 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) and
Δ𝑖 = ln E[𝑍 | 𝑋 1, . . . , 𝑋𝑖 ] − ln E[𝑍 | 𝑋 1, . . . , 𝑋𝑖−1 ] ,
so that
𝑛
h ∑︁ i
ent 𝑍 = E[𝑍 (ln 𝑍 − E ln 𝑍 )] = E 𝑍 Δ𝑖 .
𝑖=1
Since
E[E[𝑍 | 𝑋 1, . . . , 𝑋𝑖 ] | 𝑋 −𝑖 ]
E[exp Δ𝑖 | 𝑋 −𝑖 ] = = 1,
E[𝑍 | 𝑋 1, . . . , 𝑋𝑖−1 ]
the variational principle yields
E[𝑍 Δ𝑖 ] = E E[𝑍 Δ𝑖 | 𝑋 −𝑖 ] ≤ E ent(𝑍 | 𝑋 −𝑖 ) . 

Proposition 2.3.7. Suppose that 𝜋 = 𝜋1 ∗ 𝜋2 ∈ P (R𝑑 ), where 𝜋1 and 𝜋2 both satisfy


either a Poincaré or a log-Sobolev inequality. Then, 𝜋 also satisfies the corresponding
functional inequality with constant

𝐶 PI (𝜋) ≤ 𝐶 PI (𝜋1 ) + 𝐶 PI (𝜋2 ) or 𝐶 LSI (𝜋) ≤ 𝐶 LSI (𝜋1 ) + 𝐶 LSI (𝜋2 )

respectively.

Proof. Let 𝑋 ∼ 𝜋1 and 𝑌 ∼ 𝜋2 be independent, and let 𝑓 : R𝑑 → R. Using the subadditivity


of the variance,
var𝜋 𝑓 = var 𝑓 (𝑋 + 𝑌 ) ≤ E var 𝑓 (𝑋 + 𝑌 ) 𝑌 + E var 𝑓 (𝑋 + 𝑌 ) 𝑋
 

≤ {𝐶 PI (𝜋1 ) + 𝐶 PI (𝜋2 )} E[k∇𝑓 (𝑋 + 𝑌 )k 2 ] ,


and a similar argument holds for the entropy. 
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 81

2.3.4 Mixtures
Suppose that 𝜋 is a mixture, 𝜋 = 𝜇𝑃 B 𝑃𝑥 d𝜇 (𝑥), where 𝜇 ∈ P (X) is the mixing measure

and (𝑃𝑥 )𝑥 ∈X is a family of probability measures on R𝑑 indexed by X (in other words, a


Markov kernel). For example, when X = [𝑘], then 𝜇𝑃 is a mixture of 𝑘 distributions
𝑃 1, . . . , 𝑃𝑘 with mixing weights given by 𝜇. When X = R𝑑 and 𝑃𝑥 is the translation of a
fixed probability measure 𝜈 ∈ P (R𝑑 ) by 𝑥, then 𝜇𝑃 = 𝜇 ∗ 𝜈 is the convolution of 𝜇 and 𝜈.
Under general conditions on the mixture, it turns out that if each 𝑃𝑥 satisfies a func-
tional inequality, then so does the mixture 𝜇𝑃. The simplest demonstration of this idea is
for the Poincaré inequality. Although the arguments in this section apply more generally
to mixtures 𝜇𝑃 on arbitrary state spaces, we focus on the R𝑑 case for simplicity.

Proposition 2.3.8 (PI for mixtures, [Bar+18]). Let 𝜇𝑃 be a mixture and assume that
each 𝑃𝑥 satisfies a Poincaré inequality with constant 𝐶 PI (𝑃). Also, assume that

𝐶𝜒2 B sup 𝜒 2 (𝑃𝑥 k 𝑃𝑥 0 ) < ∞ . (2.3.9)


𝑥,𝑥 0 ∈supp(𝜇)

Then, 𝜇𝑃 satisfies a Poincaré inequality with constant


𝐶𝜒2 
𝐶 PI (𝜇𝑃) ≤ 1 + 𝐶 PI (𝑃) .
2

i.i.d.
Proof. Let 𝑓 : R𝑑 → R, and let 𝑋, 𝑋 0 ∼ 𝜇. By the total law of variance,

var𝜇𝑃 𝑓 = E var𝑃𝑋 𝑓 + var E𝑃𝑋 𝑓 .

The first term is easy to control, because we can apply the Poincaré inequality for 𝑃𝑋
inside the expectation, so the main difficulty lies in the second term. Here,

1 1 h d𝑃𝑋 2 i

2
var E𝑃𝑋 𝑓 = E[|E𝑃𝑋 𝑓 − E𝑃𝑋 0 𝑓 | ] = E 𝑓 − 1 d𝑃𝑋 0

2 2 d𝑃𝑋 0
1 𝐶 2
≤ E[(var𝑃𝑋 0 𝑓 ) 𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 )] ≤ E var𝑃𝑋 𝑓 .
𝜒
2 2
Hence,
𝐶𝜒2  𝐶𝜒2 
var𝜇𝑃 𝑓 ≤ 1 + E var𝑃𝑋 𝑓 ≤ 1 + 𝐶 PI (𝑃) E E𝑃𝑋 Γ(𝑓 , 𝑓 ) . 
2 2 | {z }
=E𝜇𝑃 Γ(𝑓 ,𝑓 )
82 CHAPTER 2. FUNCTIONAL INEQUALITIES

Our aim is to extend this idea to the log-Sobolev inequality, which will require a few
preliminaries. Rather than aiming to directly prove a log-Sobolev inequality, we will
instead prove a defective log-Sobolev inequality: for all 𝑓 : R𝑑 → R,

ent𝜋 (𝑓 2 ) ≤ 2𝐶 E𝜋 Γ(𝑓 , 𝑓 ) + 𝐷 E𝜋 [𝑓 2 ] . (2.3.10)

Although the defective LSI involves an extra term on the right-hand side of the inequality,
the extra term can be removed via a Poincaré inequality. The following two results show
how this is achieved.

Lemma 2.3.11 (Rothaus lemma). Let 𝑓 : R𝑑 → R. For all 𝑐 ∈ R,

ent𝜋 (𝑓 + 𝑐) 2 ≤ ent𝜋 (𝑓 2 ) + 2 E𝜋 [𝑓 2 ] .


Proof. Omitted; see [BGL14, Lemma 5.1.4]. 

Lemma 2.3.12 (tightening a defective LSI). Suppose that 𝜋 satisfies the defective
log-Sobolev inequality (2.3.10), together with a Poincaré inequality. Then, 𝜋 satisfies an
log-Sobolev inequality with constant
𝐷
𝐶 LSI ≤ 𝐶 + 𝐶 PI 1 + .
2

Proof. Using the Rothaus lemma, the defective log-Sobolev inequality, and the Poincaré
inequality, we obtain

ent𝜋 (𝑓 2 ) ≤ ent𝜋 (𝑓 − E𝜋 𝑓 ) 2 + 2 var𝜋 (𝑓 ) ≤ 2𝐶 E𝜋 Γ(𝑓 , 𝑓 ) + (2 + 𝐷) var𝜋 (𝑓 )




≤ 2𝐶 + 𝐶 PI (2 + 𝐷) E𝜋 Γ(𝑓 , 𝑓 ) .



We also need one change of measure lemma.

Lemma 2.3.13 ([CCN21]). Suppose that 𝜇, 𝜈 are probability measures and 𝑓 is a positive
function. Then,

E𝜇 (𝑓 )
E𝜇 (𝑓 ) ln ≤ ent𝜇 (𝑓 ) + E𝜇 (𝑓 ) ln 1 + 𝜒 2 (𝜇 k 𝜈) .

E𝜈 (𝑓 )
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 83

Proof. By rescaling, we may assume E𝜇 𝑓 = 1. Recall the Donsker–Varadhan variational


principle (Theorem 1.5.4), which states

KL(𝜂 k 𝜂 0) = sup E𝜂 𝑔 − ln E𝜂 0 exp 𝑔 𝑔 : X → R is bounded and measurable .




If we take 𝜂 = 𝑓 𝜇, 𝜂 0 = 𝜈, and 𝑔 = 𝑓 , then


𝑓  𝑓 𝑓 d𝜇  
E𝜇 𝑓 ln = E𝜂 ln ≤ KL(𝜂 k 𝜈) + ln E𝜈 = E𝜇 𝑓 ln 𝑓
 
.
E𝜈 𝑓 E𝜈 𝑓 E𝜈 𝑓 d𝜈
| {z }
=0

By subtracting E𝜇 (𝑓 ln 𝑓 ) from both sides, we obtain

1 1  d𝜇 
ln = E𝜇 𝑓 ln ≤ E𝜇 𝑓 ln
 
.
E𝜈 𝑓 E𝜈 𝑓 d𝜈
Next, applying the Donsker–Varadhan principle a second time with 𝜂 = 𝑓 𝜇, 𝜂 0 = 𝜇,
d𝜇
and 𝑔 = ln d𝜈 yields

d𝜇  d𝜇 d𝜇
E𝜇 𝑓 ln = E𝜂 ln ≤ KL(𝜂 k 𝜇) + ln E𝜇 = ent𝜇 𝑓 + ln 1 + 𝜒 2 (𝜇 k 𝜈) ,
 
d𝜈 d𝜈 d𝜈
which is what we wanted to show. 
We can now prove the log-Sobolev inequality for mixtures.

Proposition 2.3.14 (LSI for mixtures, [CCN21]). Let 𝜇𝑃 be a mixture and assume that
each 𝑃𝑥 satisfies a log-Sobolev inequality with constant 𝐶 LSI (𝑃). Also, assume that

𝐶𝜒2 B sup 𝜒 2 (𝑃𝑥 k 𝑃𝑥 0 ) < ∞ .


𝑥,𝑥 0 ∈supp(𝜇)

Then, 𝜇𝑃 satisfies a log-Sobolev inequality with constant


n 𝐶𝜒2  ln(1 + 𝐶 𝜒 2 )  o
𝐶 LSI (𝜇𝑃) ≤ 𝐶 LSI (𝑃) 2 + 1 + 1+
2  2
≤ 6𝐶 LSI (𝑃) 1 ∨ 𝐶 𝜒 2 ln(1 + 𝐶 𝜒 2 ) .

Proof. We begin, as in the proof of Proposition 2.3.8, with a decomposition of the entropy:

ent𝜇𝑃 (𝑓 2 ) = E ent𝑃𝑋 (𝑓 2 ) + ent E𝑃𝑋 (𝑓 2 ) .


84 CHAPTER 2. FUNCTIONAL INEQUALITIES

As before, it is the second term that is difficult to control.


Applying Lemma 2.3.13,

E𝑃 (𝑓 2 ) 
ent E𝑃𝑋 (𝑓 2 ) = E E𝑃𝑋 (𝑓 2 ) ln 𝑋 2 ≤ E ent𝑃𝑋 (𝑓 2 ) + E𝑃𝑋 (𝑓 2 ) ln 1 + 𝜒 2 (𝑃𝑋 k 𝜇𝑃)
  
E𝜇𝑃 (𝑓 )
≤ E ent𝑃𝑋 (𝑓 2 ) + ln(1 + 𝐶 𝜒 2 ) E𝜇𝑃 (𝑓 2 ) .

Hence,

ent𝜇𝑃 (𝑓 2 ) ≤ 2 E ent𝑃𝑋 (𝑓 2 ) + ln(1 + 𝐶 𝜒 2 ) E𝜇𝑃 (𝑓 2 )


≤ 4𝐶 LSI (𝑃) E𝜇𝑃 Γ(𝑓 , 𝑓 ) + ln(1 + 𝐶 𝜒 2 ) E𝜇𝑃 (𝑓 2 ) ,

where we have applied the log-Sobolev inequality for 𝑃𝑋 . This is a defective log-Sobolev in-
equality for 𝜇𝑃; by applying the Poincaré inequality from Proposition 2.3.8 and tightening
the inequality via Lemma 2.3.12, we conclude the proof. 

Example 2.3.15 (LSI for Gaussian mixtures). Suppose that 𝜇 is supported on a ball
B(0, 𝑅), and that for each 𝑥 ∈ R𝑑 , 𝑃𝑥 = normal(𝑥, 𝜎 2𝐼𝑑 ). Then, 𝜇𝑃 is the convolution
𝜇 ∗ normal(0, 𝜎 2𝐼𝑑 ). Since 𝐶 LSI (𝑃) = 𝜎 2 and

k𝑥 − 𝑥 0 k 2 4𝑅 2
𝜒 2 (𝑃𝑥 k 𝑃𝑥 0 ) = exp − 1 ≤ exp −1
𝜎2 𝜎2
for 𝑥, 𝑥 0 ∈ B(0, 𝑅), we deduce that 𝜇𝑃 satisfies a log-Sobolev inequality with constant

4𝑅 2 
𝐶 LSI (𝜇𝑃) . 𝜎 2 ∨ 𝑅 2 exp .
𝜎2
Hence, Gaussian convolutions of measures with bounded support satisfy a log-Sobolev
inequality. The exponential dependence on 𝑅 2 /𝜎 2 is unavoidable in general.

We extend the results of this section in Exercise 2.8.

2.3.5 Tensorization
A key feature of these functional inequalities which makes them crucial for the study of
high-dimensional (or even infinite-dimensional phenomena) is that they often hold with
dimension-free constants, as demonstrated in the next result.
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 85

Theorem 2.3.16 (tensorization). Suppose that 𝜋1, . . . , 𝜋 𝑁 ∈ P (R𝑑 ) satisfy either a


Poincaré inequality or a log-Sobolev inequality. Then, for any 𝑁 ∈ N+ , the product mea-
sure 𝜋 B 𝑖=1 𝜋𝑖 also satisfies the corresponding functional inequality with constant
Ë𝑁

𝐶 PI (𝜋) = max 𝐶 PI (𝜋𝑖 ) or 𝐶 LSI (𝜋) = max 𝐶 LSI (𝜋𝑖 )


𝑖∈[𝑁 ] 𝑖∈[𝑁 ]

respectively.

Proof. The proof is a straightforward consequence of subadditivity (Lemma 2.3.6). Indeed,


if 𝑓 : R𝑁𝑑 → R and if 𝑋𝑖 ∼ 𝜋𝑖 are independent for 𝑖 ∈ [𝑁 ],
𝑁
∑︁
var𝜋 𝑓 = var 𝑓 (𝑋 1, . . . , 𝑋 𝑁 ) ≤ E var 𝑓 (𝑋 1, . . . , 𝑋 𝑁 ) 𝑋 −𝑖

𝑖=1
𝑁
E[k∇𝑖 𝑓 (𝑋 1, . . . , 𝑋 𝑁 )k 2 | 𝑋 −𝑖 ]
∑︁
≤ max 𝐶 PI (𝜋𝑖 ) E
𝑖∈[𝑁 ]
𝑖=1
= max 𝐶 PI (𝜋𝑖 ) E𝜋 [k∇𝑓 k 2 ] .
𝑖∈[𝑁 ]

The proof is the same for the log-Sobolev inequality. 


There is also a tensorization principle for transport inequalities, which however re-
quires some additional work to prove. We formulate a general result which applies to
many different transport inequalities (not just the T1 and T2 inequalities).

Theorem 2.3.17 (Marton’s tensorization). Let X1, . . . , X𝑁 be Polish spaces equipped


with probability measures 𝜋1, . . . , 𝜋 𝑁 respectively. Let X B X1 × · · · × X𝑁 be equipped
with the product measure 𝜋 B 𝜋1 ⊗ · · · ⊗ 𝜋 𝑁 .
Let 𝜑 : [0, ∞) → [0, ∞) be convex and for 𝑖 ∈ [𝑁 ], let 𝑐𝑖 : X𝑖 × X𝑖 → [0, ∞) be a
lower semicontinuous cost function. Suppose that
∫ 
inf 𝜑 𝑐𝑖 d𝛾𝑖 ≤ 2𝜎 2 KL(𝜈𝑖 k 𝜋𝑖 ) , ∀𝜈𝑖 ∈ P (X𝑖 ) , ∀𝑖 ∈ [𝑁 ] .
𝛾𝑖 ∈C(𝜋𝑖 ,𝜈𝑖 )
86 CHAPTER 2. FUNCTIONAL INEQUALITIES

Then, it holds that


𝑁 ∫ 
𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 , 𝑦1:𝑁 ) ≤ 2𝜎 2 KL(𝜈 k 𝜋) ,
∑︁
inf 𝜑 ∀𝜈 ∈ P (X) .
𝛾 ∈C(𝜋,𝜈)
𝑖=1

Proof. The proof goes by induction on 𝑁 , with 𝑁 = 1 being trivial. So, assume that the
result is true in dimension 𝑁 , and let us prove it for dimension 𝑁 + 1.
Let 𝜈 ∈ P (X) = P (X1 × · · · × X𝑁 +1 ), let 𝜈 1:𝑁 denote its X1 × · · · × X𝑁 marginal,
and let 𝜈 𝑁 +1|1:𝑁 denote the corresponding conditional kernel (and similarly for 𝜋). Let
K denote the set of conditional kernels 𝑦1:𝑁 ↦→ 𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) such that for 𝜈 1:𝑁 -a.e.
𝑦1:𝑁 ∈ X1 × · · · × X𝑁 , it holds that 𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) ∈ C(𝜋 𝑁 +1, 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 )). Instead
of minimizing over all 𝛾 ∈ C(𝜈, 𝜋), we can minimize over couplings 𝛾 such that for all
bounded 𝑓 ∈ C(X × X),
∫ ∫ ∫ 
𝑓 d𝛾 = 𝑓 (𝑥 1:𝑁 +1, 𝑦1:𝑁 +1 ) 𝛾 𝑁 +1|1:𝑁 (d𝑥 𝑁 +1, d𝑦𝑁 +1 | 𝑦1:𝑁 ) 𝛾 1:𝑁 (d𝑥 1:𝑁 , d𝑦1:𝑁 ) ,

for some 𝛾 1:𝑁 ∈ C(𝜋1:𝑁 , 𝜈 1:𝑁 ) and 𝛾 𝑁 +1|1:𝑁 ∈ K.4 Thus,


𝑁
∑︁+1 ∫ 
inf 𝜑 𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 +1, 𝑦1:𝑁 +1 )
𝛾 ∈C(𝜋,𝜈)
𝑖=1
𝑁
∑︁ ∫ 
≤ inf 𝜑 𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 1:𝑁 ∈C(𝜋 1:𝑁 ,𝜈 1:𝑁 )
𝑖=1
∫ ∫ 
inf 𝑐 𝑁 +1 d𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )

+ 𝜑
𝛾 𝑁 +1|1:𝑁 ∈K
𝑁
∑︁ ∫ 
≤ inf 𝜑 𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 1:𝑁 ∈C(𝜋 1:𝑁 ,𝜈 1:𝑁 )
𝑖=1
∫ ∫  
+ inf 𝜑 𝑐 𝑁 +1 d𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 ) .
𝛾 𝑁 +1|1:𝑁 ∈K

Then, after checking that the integrands are indeed measurable,


∫ ∫ 
inf 𝜑 𝑐 𝑁 +1 d𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 𝑁 +1|1:𝑁 ∈K

4 Suppose 𝑁 = 2 and (𝑋 1, 𝑋 2 ) ∼ 𝜋 and (𝑌1, 𝑌2 ) ∼ 𝜈. Observe that a general coupling 𝑝 ∈ C(𝜋, 𝜈) factorizes
as 𝑝 (𝑥 1, 𝑥 2, 𝑦1, 𝑦2 ) = 𝑝𝑋1 (𝑥 1 ) 𝑝𝑋2 (𝑥 2 ) 𝑝𝑌1,𝑌2 |𝑋1,𝑋2 (𝑦1, 𝑦2 | 𝑥 1, 𝑥 2 ). In contrast, we are restricting to couplings
of the form 𝑝 (𝑥 1, 𝑥 2, 𝑦1, 𝑦2 ) = 𝑝𝑋1 (𝑥 1 ) 𝑝𝑌1 |𝑋1 (𝑦1 | 𝑥 1 ) 𝑝𝑋2 (𝑥 2 ) 𝑝𝑌2 |𝑋2,𝑌1 (𝑦2 | 𝑥 2, 𝑦1 ).
2.3. OPERATIONS PRESERVING FUNCTIONAL INEQUALITIES 87
∫ ∫ 
= inf 𝜑 𝑐 𝑁 +1 d𝛾 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 𝑁 +1|1:𝑁 ∈C(𝜋 𝑁 +1,𝜈 𝑁 +1|1:𝑁 (·|𝑦1:𝑁 ))

2
≤ 2𝜎 KL 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) 𝜋 𝑁 +1 d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 )


2
= 2𝜎 KL 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) 𝜋 𝑁 +1 d𝜈 1:𝑁 (𝑦1:𝑁 ) ,


where we used the assumption. On the other hand, the inductive hypothesis is
𝑁 ∫ 
𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 1:𝑁 (𝑥 1:𝑁 , 𝑦1:𝑁 ) ≤ 2𝜎 2 KL(𝜈 1:𝑁 k 𝜋1:𝑁 ) .
∑︁
inf 𝜑
𝛾 1:𝑁 ∈C(𝜋 1:𝑁 ,𝜈 1:𝑁 )
𝑖=1

The chain rule for the KL divergence (Lemma 1.5.5) yields



KL 𝜈 𝑁 +1|1:𝑁 (· | 𝑦1:𝑁 ) 𝜋 𝑁 +1 d𝜈 1:𝑁 (𝑦1:𝑁 ) .

KL(𝜈 k 𝜋) = KL(𝜈 1:𝑁 k 𝜋1:𝑁 ) +

Therefore, we have proven


𝑁 +1 ∫ 
𝑐𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 +1, 𝑦1:𝑁 +1 ) ≤ 2𝜎 2 KL(𝜈 k 𝜋) .
∑︁
inf 𝜑 
𝛾 ∈C(𝜋,𝜈)
𝑖=1

The preceding proof is supposed to be a straightforward proof by induction, but it is


rather cumbersome to write out precisely.
As our first application of the tensorization principle, we will examine the tensorization
properties of the T1 inequality. Recall that on a general metric space, the Wasserstein
distances are defined as in Exercise 1.11.

Example 2.3.18 (tensorization of T1 ). We will use the cost 𝑐𝑖 = d𝑖 , where d𝑖 is a lower


semicontinuous metric on X𝑖 , and we take the convex function 𝜑 (𝑥) B 𝑥 2 . Suppose
that for each 𝑖 ∈ [𝑁 ], the measure 𝜋𝑖 ∈ P (X) satisfies the T1 inequality

𝑊12 (𝜈𝑖 , 𝜋𝑖 ) ≤ 2𝜎 2 KL(𝜈𝑖 k 𝜋𝑖 ), ∀𝜈𝑖 ∈ P (X𝑖 ) .

Let 𝜋 B 𝜋1 ⊗ · · · ⊗ 𝜋 𝑁 be the product measure and let 𝜈 ∈ P (X1 × · · · × X𝑁 ). Suppose


Í𝑁 2
also that 𝛼 1, . . . , 𝛼 𝑁 > 0 are numbers with 𝑖=1 𝛼𝑖 = 1. Then, Marton’s tensorization
(Theorem 2.3.17) yields
𝑁
∑︁  𝑁 ∫ 2
2
𝛼𝑖2
∑︁
2𝜎 KL(𝜈 k 𝜋) ≥ inf d𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 , 𝑦1:𝑁 )
𝛾 ∈C(𝜈,𝜋)
𝑖=1 𝑖=1
88 CHAPTER 2. FUNCTIONAL INEQUALITIES

𝑁
∫ ∑︁
≥ inf 𝛼𝑖 d𝑖 (𝑥𝑖 , 𝑦𝑖 ) d𝛾 (𝑥 1:𝑁 , 𝑦1:𝑁 ) ,
𝛾 ∈C(𝜈,𝜋)
𝑖=1

where we used the Cauchy–Schwarz inequality. This is a T1 inequality for the weighted
distance d𝛼 (𝑥 1:𝑁 , 𝑦1:𝑁 ) B 𝑖=1
Í𝑁
𝛼𝑖 d𝑖 (𝑥𝑖 , 𝑦𝑖 ).

Together with results from the next section, this tensorization result is already powerful
enough to recover the bounded differences concentration inequality (see Exercise 2.12),
but it is not fully satisfactory as it yields a transport inequality for a weighted metric.
A more satisfactory result is obtained by applying Marton’s tensorization (Theo-
rem 2.3.17) with the convex function 𝜑 (𝑥) B 𝑥 and cost functions 𝑐𝑖 B d𝑖2 , with d𝑖 a lower
semicontinuous metric on X𝑖 . This immediately yields the following corollary.

Corollary 2.3.19 (tensorization of T2 ). Suppose that for each 𝑖 ∈ [𝑁 ], 𝜋𝑖 ∈ P (X𝑖 )


satisfies a T2 inequality with parameter 𝜎 2 with respect to the metric d𝑖 . Then, the
product measure 𝜋1 ⊗ · · · ⊗ 𝜋 𝑁 satisfies a T2 inequality with the same parameter 𝜎 2
with respect to the metric d(𝑥 1:𝑁 , 𝑦1:𝑁 ) 2 B 𝑖=1 d(𝑥𝑖 , 𝑦𝑖 ) 2 on X1 × · · · × X𝑁 .
Í𝑁

2.4 Concentration of Measure


We now turn towards the close relationship between functional inequalities and the
concentration of measure phenomenon, the latter of which is an indispensable tool in
high-dimensional probability and statistics. Since many of the arguments hold on a general
Polish space (that is, a complete separable metric space) (X, d) equipped with a probability
measure 𝜋, we will work in this setting unless explicitly stated otherwise.

2.4.1 Blow-Up of Sets and Concentration of Lipschitz Functions


Loosely speaking, the concentration of measure phenomenon holds when a huge fraction
of the mass of 𝜋 is concentrated on a relatively small set. Another way of capturing
this idea is to assert that whenever a set has a non-trivial amount of mass under 𝜋, then
expanding the set slightly causes it to capture almost all of the mass of 𝜋. The following
definitions formalize this idea.

Definition 2.4.1. For a Borel subset 𝐴 ⊆ X and 𝜀 > 0, we let 𝐴𝜀 denote the 𝜀-blow-up
2.4. CONCENTRATION OF MEASURE 89

of 𝐴, defined by

𝐴𝜀 B {𝑥 ∈ X | d(𝑥, 𝐴) < 𝜀} .

The concentration function 𝛼 𝜋 : R+ → [0, 1] is defined via


1
𝛼 𝜋 (𝜀) = sup 𝜋 (𝐴𝜀 ) c 𝐴 ⊆ X is a Borel subset with 𝜋 (𝐴) ≥
 
.
2

Typically, we have 𝛼 𝜋 (𝜀) ≤ 𝐶 0 exp(−𝜀/𝐶 1 ) or 𝛼 𝜋 (𝜀) ≤ 𝐶 0 exp(−𝜀 2 /𝐶 1 ) for some con-


stants 𝐶 0, 𝐶 1 > 0; hence, as we increase 𝜀, the blow-up 𝐴𝜀 captures a (often substantially)
larger fraction of the mass of 𝜋. In the next sections, we will develop tools to upper bound
concentration functions. For now, however, we wish to develop an equivalence between
the formulation of concentration of measure via blow-up of sets, and with another in-
volving concentration of Lipschitz functions. Although the former has a more striking
geometric interpretation, the latter is often how concentration is used in applications.
Given a real-valued random variable 𝑋 , we abuse notation and let med 𝑋 denote any
median of 𝑋 , that is, any number 𝑚 such that P{𝑋 ≤ 𝑚} ∧ P{𝑋 ≥ 𝑚} ≥ 21 .

Theorem 2.4.2 (blow-up and Lipschitz functions). Suppose that (X, d, 𝜋) has concen-
tration function 𝛼 𝜋 . Then, for any 1-Lipschitz function 𝑓 : X → R and 𝜀 ≥ 0,

𝜋 {𝑓 ≥ med 𝑓 + 𝜀} ≤ 𝛼 𝜋 (𝜀) .

Conversely, suppose that for all 1-Lipschitz functions 𝑓 : X → R, it holds that

𝜋 {𝑓 ≥ med 𝑓 + 𝜀} ≤ 𝛽 (𝜀) .

Then, the concentration function 𝛼 𝜋 of (X, d, 𝜋) satisfies 𝛼 𝜋 ≤ 𝛽.

Proof. ( =⇒ ) Consider the set 𝐴 B {𝑓 ≤ med 𝑓 }. We claim that 𝐴𝜀 ⊆ {𝑓 − med 𝑓 < 𝜀}.
To prove this, let 𝑥 ∈ 𝐴𝜀 . By definition, there exists 𝑦 ∈ 𝐴 such that d(𝑥, 𝑦) < 𝜀, so
𝑓 (𝑥) − med 𝑓 = 𝑓 (𝑦) − med 𝑓 + 𝑓 (𝑥) − 𝑓 (𝑦) ≤ d(𝑥, 𝑦) < 𝜀. Hence,

𝜋 {𝑓 − med 𝑓 < 𝑡 } ≥ 𝜋 (𝐴𝜀 ) ≥ 1 − 𝛼 (𝜀) .


1
( ⇐= ) The function 𝑓 B d(·, 𝐴) is 1-Lipschitz, and if 𝜋 (𝐴) ≥ 2 then 0 is a median of
𝑓 . Thus, it holds that

𝜋 (𝐴𝜀 ) = 𝜋 {𝑓 − med 𝑓 < 𝜀} ≥ 1 − 𝛽 (𝜀) . 


90 CHAPTER 2. FUNCTIONAL INEQUALITIES

More broadly, it is a general principle that many statements about sets have an
equivalent reformulation in terms of functions. We will see more instances of this idea
throughout the book.
Some statements regarding concentration, such as the theorem above, are more easily
phrased in terms of concentration around the median rather than around the mean. The
following result shows that, up to numerical constants, the mean and the median are
equivalent. To state the result in generality, we introduce the idea of an Orlicz norm.

Definition 2.4.3 (Orlicz norm). If 𝜓 : [0, ∞) → [0, ∞) is a convex strictly increasing


function with 𝜓 (0) = 0 and 𝜓 (𝑥) → ∞ as 𝑥 → ∞, then it is an Orlicz function.
For a real-valued random variable 𝑋 , its Orlicz norm is defined to be
|𝑋 | 
k𝑋 k𝜓 B inf 𝑡 > 0 E𝜓 ≤1 .

𝑡

Examples of Orlicz functions include 𝜓 (𝑥) = 𝑥 𝑝 for 𝑝 ≥ 1, for which the corresponding
Orlicz norm is the 𝐿𝑝 (P) norm, and 𝜓 2 (𝑥) B exp(𝑥 2 ) − 1 for which the Orlicz norm k𝑋 k𝜓2
captures the sub-Gaussianity of 𝑋 .

Lemma 2.4.4 (mean and median). Let 𝜓 be an Orlicz function and let 𝑋 be a real-valued
random variable. Then,
1
k𝑋 − E 𝑋 k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 ≤ 3 k𝑋 − E 𝑋 k𝜓 .
2

Proof. We can assume that 𝑋 is not constant; from the properties of Orlicz functions,
𝜓 −1 (𝑡) is well-defined for any 𝑡 > 0. Then,
k𝑋 − E 𝑋 k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 + kmed 𝑋 − E 𝑋 k𝜓
= k𝑋 − med 𝑋 k𝜓 + |med 𝑋 − E 𝑋 | k1k𝜓
≤ k𝑋 − med 𝑋 k𝜓 + E|𝑋 − med 𝑋 | k1k𝜓 .
Since
 |𝑋 − med 𝑋 |   E|𝑋 − med 𝑋 |  1 
E𝜓 ≥𝜓 =𝜓 = 1,
E|𝑋 − med 𝑋 | k1k𝜓 E|𝑋 − med 𝑋 | k1k𝜓 k1k𝜓
it implies E|𝑋 − med 𝑋 | k1k𝜓 ≤ k𝑋 − med 𝑋 k𝜓 .
Next, assume that med 𝑋 ≥ E 𝑋 (or else replace 𝑋 by −𝑋 ). Then,
1
≤ P{𝑋 ≥ med 𝑋 } ≤ P{|𝑋 − E 𝑋 | ≥ med 𝑋 − E 𝑋 }
2
2.4. CONCENTRATION OF MEASURE 91

1
≤ ,
𝜓 ((med 𝑋 − E 𝑋 )/k𝑋 − E 𝑋 k𝜓 )

so that

|med 𝑋 − E 𝑋 | ≤ 𝜓 −1 (2) k𝑋 − E 𝑋 k𝜓 .

Therefore,

k𝑋 − med 𝑋 k𝜓 ≤ k𝑋 − E 𝑋 k𝜓 + kE 𝑋 − med 𝑋 k𝜓 ≤ 1 + k1k𝜓 𝜓 −1 (2) k𝑋 − E 𝑋 k𝜓 .




Note, however, that k1k𝜓 = 1/𝜓 −1 (1). Since 𝜓 (𝜓 −1 (2)/2) ≤ 1 by convexity (and the
property 𝜓 (0) = 0), it implies 𝜓 −1 (2) ≤ 2𝜓 −1 (1), and we obtain the result. 

2.4.2 The Herbst Argument


In this section, we specialize to the case where (X, d) is the Euclidean space R𝑑 .
To put it succinctly, the idea of the Herbst argument is to apply functional inequalities,
such as the Poincaré inequality or the log-Sobolev inequality, to the moment-generating
function of a 1-Lipschitz function 𝑓 : R𝑑 → R in order to deduce a concentration inequality
for 𝑓 . We illustrate this with the log-Sobolev inequality, which implies, for any 𝜆 ∈ R,

 𝜆 exp(𝜆𝑓 /2) 2  𝐶 LSI 𝜆 2


ent𝜋 exp(𝜆𝑓 ) ≤ 2𝐶 LSI E𝜋 ∇𝑓 = E𝜋 [exp(𝜆𝑓 ) k∇𝑓 k 2 ]
2 2 (2.4.5)
𝐶 LSI 𝜆 2
≤ E𝜋 exp(𝜆𝑓 ) .
2
The next lemma shows how to apply this inequality.

Lemma 2.4.6 (Herbst argument). Suppose that a random variable 𝑋 satisfies

𝜆 2𝜎 2
ent exp(𝜆𝑋 ) ≤ E exp(𝜆𝑋 ) for all 𝜆 ≥ 0 .
2
Then, it holds that

𝜆 2𝜎 2
E exp{𝜆 (𝑋 − E 𝑋 )} ≤ exp for all 𝜆 ≥ 0 .
2
92 CHAPTER 2. FUNCTIONAL INEQUALITIES

In particular, via a standard Chernoff inequality,

𝑡2 
P{𝑋 ≥ E 𝑋 + 𝑡 } ≤ exp − for all 𝑡 ≥ 0 .
2𝜎 2

Proof. Let 𝜏 (𝜆) B 𝜆 −1 ln E exp{𝜆 (𝑋 − E 𝑋 )}. We leave it to the reader to check the
calculus identity

1 ent exp(𝜆𝑋 )
𝜏 0 (𝜆) = . (2.4.7)
𝜆 2 E exp(𝜆𝑋 )

Since 𝜏 (𝜆) → 0 as 𝜆 & 0, the assumption of the lemma yields 𝜏 (𝜆) ≤ 𝜆𝜎 2 /2. 

The calculation in (2.4.5) shows that the assumption of the Herbst argument is satisfied
for all 1-Lipschitz functions 𝑓 , with 𝜎 2 = 𝐶 LSI . Hence, we deduce a concentration inequality
for Lipschitz functions, which we formally state in the next theorem together with the
corresponding result under a Poincaré inequality. The Poincaré case is left as Exercise 2.9.

Theorem 2.4.8. Let 𝜋 ∈ P (R𝑑 ), and let 𝑓 : R𝑑 → R be a 1-Lipschitz function.

1. If 𝜋 satisfies a Poincaré inequality with constant 𝐶 PI , then for all 𝑡 ≥ 0,


 𝑡 
𝜋 {𝑓 − E𝜋 𝑓 ≥ 𝑡 } ≤ 3 exp − √ .
𝐶 PI

2. If 𝜋 satisfies a log-Sobolev inequality with constant 𝐶 LSI , then for all 𝑡 ≥ 0,


𝑡2 
𝜋 {𝑓 − E𝜋 𝑓 ≥ 𝑡 } ≤ exp − .
2𝐶 LSI

Example 2.4.9. Suppose that 𝛾 is the standard Gaussian measure on R𝑑 . From the
Bakry–Émery theorem (Theorem 1.2.28), 𝛾 satisfies the log-Sobolev inequality with
2 ] = 𝑑, the Poincaré inequality applied to the norm
𝐶 LSI = 1. For 𝑍 ∼ 𝛾, since E[k𝑍 k√ √
k·k shows that var k𝑍 k ≤ 1, i.e., 𝑑 − 1 ≤ Ek𝑍 k ≤ 𝑑.
The concentration result above
√ now shows that the standard Gaussian “lives” on
a thin spherical shell of radius 𝑑 and width 𝑂 (1).
2.4. CONCENTRATION OF MEASURE 93

2.4.3 Transport Inequalities and Concentration


Next, we will show that a T1 transport inequality is equivalent to sub-Gaussian concen-
tration of Lipschitz functions, which was proven by Bobkov and Götze. The proof shows
that in a sense, the two statements are dual to each other.

Theorem 2.4.10 (Bobkov–Götze). Let 𝜋 ∈ P1 (X). The following are equivalent.

1. The function 𝑓 is 𝜎 2 -sub-Gaussian with respect to 𝜋 for every 1-Lipschitz function


𝑓 : X → R which is mean-zero under 𝜋.

2. The measure 𝜋 satisfies T1 (𝜎 2 ).

Proof. Let Lip1 (X) denote the space of 1-Lipschitz and mean-zero functions on X. Lipschitz
concentration can be stated as
n ∫ 𝜆 2𝜎 2 o
sup sup ln exp(𝜆𝑓 ) d𝜋 − ≤ 0.
𝜆∈R 𝑓 ∈Lip1 (X) 2

By Donsker–Varadhan duality (Theorem 1.5.4), this is equivalent to


n ∫ 𝜆 2𝜎 2 o
∫ 
sup sup sup 𝜆 𝑓 d𝜈 − 𝑓 d𝜋 − KL(𝜈 k 𝜋) − ≤ 0,
𝜆∈R 𝑓 ∈Lip1 (X) 𝜈 ∈P (X) 2

where we recall that 𝑓 d𝜋 = 0 for 𝑓 ∈ Lip1 (X). If we first evaluate the supremum over

𝜆 ∈ R, then we obtain the statement


n 1 ∫ ∫ 2 o
sup sup 2
𝑓 d𝜈 − 𝑓 d𝜋 − KL(𝜈 k 𝜋) ≤ 0,
𝑓 ∈Lip1 (X) 𝜈 ∈P (X) 2𝜎

If we next evaluate the supremum over functions 𝑓 ∈ Lip1 (X) using the Kantorovich
duality formula (1.E.4) for 𝑊1 from Exercise 1.11, we obtain
n𝑊 2 (𝜈, 𝜋) o
1
sup − KL(𝜈 k 𝜋) ≤ 0 ,
𝜈 ∈P (X) 2𝜎 2

which is the T1 inequality. 


Using the fact that the 𝑊1 distance for the trivial metric d(𝑥, 𝑦) = 1{𝑥 ≠ 𝑦} coincides
with the TV distance5 , the Bobkov–Götze theorem implies that two classical inequalities in
5 One has to be slightly careful since for the trivial metric, (X, d) is usually not separable.
94 CHAPTER 2. FUNCTIONAL INEQUALITIES

probability theory, Hoeffding’s inequality and Pinsker’s inequality, are in fact equivalent
to each other (see Exercise 2.10).
Although the T1 inequality implies sub-Gaussian concentration for all Lipschitz func-
tions, it is in fact equivalent to sub-Gaussian concentration of a single function, the
distance function d(·, 𝑥 0 ) for some 𝑥 0 ∈ X. The next theorem is not used often because
the quantitative dependence of the equivalence can be crude, but it is worth knowing.

Theorem 2.4.11. Let 𝜋 ∈ P1 (X) and 𝑥 0 ∈ X. The following are equivalent:

1. 𝜋 satisfies a T1 inequality.

2. There exists 𝑐 > 0 such that E𝜋 exp(𝑐 d(·, 𝑥 0 ) 2 ) < ∞.

Transport inequalities offer a flexible and powerful method for characterizing and
proving concentration inequalities, as we will see in the next section. Before doing so,
however, we wish to also demonstrate how concentration of measure, formulated via
blow-up of sets, can be deduced directly from a T1 inequality.
Suppose that T1 (𝜎 2 ) holds, i.e.,
𝑊12 (𝜇, 𝜋) ≤ 2𝜎 2 KL(𝜇 k 𝜋) for all 𝜇 ∈ P1 (X), 𝜇  𝜋 .
For any disjoint sets 𝐴, 𝐵, with 𝜋 (𝐴) 𝜋 (𝐵) > 0, if we let 𝜋 (· | 𝐴) (resp. 𝜋 (· | 𝐵)) denote
the distribution 𝜋 conditioned on 𝐴 (resp. 𝐵), then
  
d(𝐴, 𝐵) ≤ 𝑊1 𝜋 (· | 𝐴), 𝜋 (· | 𝐵) ≤ 𝑊1 𝜋 (· | 𝐴), 𝜋 + 𝑊1 𝜋 (· | 𝐵), 𝜋
√︃  √︃
≤ 2𝜎 KL 𝜋 (· | 𝐴) 𝜋 + 2𝜎 2 KL 𝜋 (· | 𝐵) 𝜋 .
2


However,
1 1

𝜋 (d𝑥)
ln = ln

KL 𝜋 (· | 𝐴) 𝜋 = ,
𝐴 𝜋 (𝐴) 𝜋 (𝐴) 𝜋 (𝐴)
so that
1 1
√︂ √︂
d(𝐴, 𝐵) ≤ 2𝜎 2 ln + 2𝜎 2 ln .
𝜋 (𝐴) 𝜋 (𝐵)
In particular, if we take 𝐵 = (𝐴𝜀 ) c where 𝜋 (𝐴) ≥ 12 , then d(𝐴, 𝐵) ≥ 𝜀. Hence, for all
√ √︃
1
𝜀 ≥ 2 2𝜎 2 ln 2, it holds that 2𝜀 ≤ 2𝜎 2 ln 𝜋 (𝐵) , or

𝜀2  √
𝜋 (𝐴𝜀 ) c ≤ exp − 2 for all 𝜀 ≥ 8 ln 2 𝜎 . (2.4.12)

8𝜎
2.4. CONCENTRATION OF MEASURE 95

2.4.4 Tensorization and Gozlan’s Theorem


Our goal is now to investigate the relationship between concentration and tensoriza-
tion. Although results like the Bobkov–Götze theorem (Theorem 2.4.10) provide us with
powerful tools to establish concentration results, so far there is nothing inherently high-
dimensional about these phenomena.
Indeed, to discuss dimensionality, we should move to the product space X𝑁 and ask
when concentration results can hold independently of 𝑁 . If such a statement holds, then
the concentration inequality typically becomes stronger6 as 𝑁 becomes larger.
For instance, when X = R, then we know from Theorem 2.3.16 that the Poincaré and
log-Sobolev inequalities both tensorize: if they hold for 𝜋 ∈ P (R) with a constant 𝐶,
then they also hold for 𝜋 ⊗𝑁 ∈ P (R𝑁 ) with the same constant 𝐶. Since these inequalities
imply powerful concentration results (Theorem 2.4.8), they yield examples of genuinely
high-dimensional concentration.
For transport inequalities, the tensorization for the T1 inequality is unsatisfactory in
0 ) 2 B Í𝑁 d(𝑥 , 𝑥 0) 2 ,
the sense that once we equip X𝑁 with the product metric d(𝑥 1:𝑁 , 𝑥 1:𝑁 𝑖=1 𝑖 𝑖
the validity of T1 (𝐶) for 𝜋 ∈ P (X) does not imply the validity of T1 (𝐶) for 𝜋 ⊗𝑁 ∈ P (X𝑁 )
with the same constant √ 𝐶. In fact, from Example 2.3.18, we expect that the T1 constant
for 𝜋 ⊗𝑁 can grow as 𝑁 . On the other hand, from Corollary 2.3.19, we know that the
T2 inequality tensorizes. Since the T2 inequality on X𝑁 implies the T1 inequality on
X𝑁 (trivially), it in turn implies high-dimensional concentration via the Bobkov–Götze
equivalence (Theorem 2.4.10).
In this section, we will prove the surprising fact that high-dimensional concentration
is actually equivalent to the T2 inequality, in a sense that we shall make precise shortly.
First, we need a few preliminary results, which we shall not prove. The first one is a
straightforward technical lemma (see Exercise 2.13).

Lemma 2.4.13. Let 𝜋 ∈ P2 (X).

1. The mapping (𝑥 1, . . . , 𝑥 𝑁 ) ↦→ 𝑊2 (𝑁 −1 is 𝑁 −1/2 -Lipschitz.


Í𝑁
𝑖=1 𝛿𝑥𝑖 , 𝜋)

i.i.d.
2. (Wasserstein law of large numbers) Suppose that (𝑋𝑖 )𝑖=1

∼ 𝜋, and that for some

6 Here,the word “stronger” is not precisely defined but it means something akin to “more useful” or
“produces more surprising consequences”.
96 CHAPTER 2. FUNCTIONAL INEQUALITIES

𝑥 0 ∈ X and some 𝜀 > 0, it holds that E[d(𝑥 0, 𝑋 1 ) 2+𝜀 ] < ∞. Then,

1 ∑︁
𝑁
𝛿 𝑋𝑖 , 𝜋 → 0 as 𝑁 → ∞ .

E𝑊2
𝑁 𝑖=1

The second result, Sanov’s theorem, is a foundational theorem from large deviations.
Although Sanov’s theorem is of fundamental importance in its own right, it would take
us too far afield to develop large deviations theory here, so we invoke it as a black box.

i.i.d.
Theorem 2.4.14 (Sanov’s theorem). Let (𝑋𝑖 )𝑖=1 ∞
∼ 𝜋 and let 𝜋 𝑁 B 𝑁 −1
Í𝑁
𝑖=1 𝛿𝑋𝑖
denote the empirical measure. Then, for any Borel set 𝐴 ⊆ P (X), it holds that
1
− inf KL(· k 𝜋) ≤ lim inf ln P{𝜋 𝑁 ∈ 𝐴}
int 𝐴 𝑁 →∞ 𝑁
1
≤ lim sup ln P{𝜋 𝑁 ∈ 𝐴} ≤ − inf KL(· k 𝜋) .
𝑁 →∞ 𝑁 𝐴

We are now ready to establish the equivalence.

Theorem 2.4.15 (Gozlan). The measure 𝜋 ∈ P2 (X) satisfies T2 (𝜎 2 ) if and only if


for all 𝑁 ∈ N+ and all 1-Lipschitz 𝑓 : X𝑁 → R, the centered function 𝑓 − E𝜋 ⊗𝑁 𝑓 is
𝜎 2 -sub-Gaussian under 𝜋 ⊗𝑁 .

Proof. It remains to prove the converse implication. Fix 𝑡 > 0 and apply the assumption
statement to the 𝑁 −1/2 -Lipschitz function (𝑥 1, . . . , 𝑥 𝑁 ) ↦→ 𝑊2 (𝑁 −1 𝑖=1 𝛿𝑥𝑖 , 𝜋). It implies
Í𝑁

 𝑁 {𝑡 − E𝑊 (𝜋 , 𝜋)}2 
2 𝑁
P{𝑊2 (𝜋 𝑁 , 𝜋) > 𝑡 } ≤ exp − ,
2𝜎 2
i.i.d.
where 𝜋 𝑁 B 𝑁 −1 𝑖=1 𝛿𝑋𝑖 , with (𝑋𝑖 )𝑖∈N+ ∼ 𝜋. On the other hand, the lower semicon-
Í𝑁
tinuity of 𝑊2 implies that {𝜈 ∈ P (X) | 𝑊2 (𝜇, 𝜈) > 𝑡 } is open. By Sanov’s theorem
(Theorem 2.4.14), we obtain
1
− inf {KL(𝜈 k 𝜋) | 𝑊2 (𝜈, 𝜋) > 𝑡 } ≤ lim inf ln P{𝑊2 (𝜋 𝑁 , 𝜋) > 𝑡 }
𝑁 →∞ 𝑁
{𝑡 − E𝑊2 (𝜋 𝑁 , 𝜋)}2 𝑡2
≤ − lim sup = − ,
𝑁 →∞ 2𝜎 2 2𝜎 2
2.5. ISOPERIMETRIC INEQUALITIES 97

where the last inequality comes from the Wasserstein law of large numbers (our assump-
tion implies that 𝜋 has sub-Gaussian tails, which in particular means E[d(𝑥, 𝑋 1 ) 𝑝 ] < ∞
for any 𝑥 ∈ X and any 𝑝 ≥ 1).
We have proven that 𝑊2 (𝜈, 𝜋) > 𝑡 implies KL(𝜈 k 𝜋) ≥ 𝑡 2 /(2𝜎 2 ), which is seen to be
equivalent to the T2 inequality. 
Observe in particular that this theorem implies the Otto–Villani theorem (Exercise 1.16):
due to tensorization (Theorem 2.3.16) and the Herbst argument (Lemma 2.4.6), a log-
Sobolev inequality implies high-dimensional sub-Gaussian concentration of Lipschitz
functions, which by Gözlan’s theorem is equivalent to a T2 inequality.

2.5 Isoperimetric Inequalities


In Section 2.4.1, we introduced the concentration function 𝛼 𝜋 of a measure 𝜋. Thus far, we
have provided tools to upper bound the concentration function; for example, in (2.4.12),
we showed that if 𝜋 satisfies T1 (𝜎 2 ), then

𝜀2  √
𝛼 𝜋 (𝜀) ≤ exp − for all 𝜀 ≥ 8 ln 2 𝜎 .
8𝜎 2
Observe that this provides no information when 𝜀 is small, whereas by definition we know
that 𝛼 𝜋 (0) = 21 . In this section, we will study finer questions about the concentration
function. More generally, for 𝑝 ∈ (0, 1) and 𝜀 > 0, we introduce the quantity

𝜔 𝜋 (𝑝, 𝜀) B inf {𝜋 (𝐴𝜀 ) | 𝐴 is a Borel set with 𝜋 (𝐴) = 𝑝} ,

and we can ask about the deviation of 𝜔 𝜋 (𝑝, 𝜀) from 𝑝 when 𝜀 is small. In some special
cases, we can even determine the function 𝜔 𝜋 exactly. The study of this question will bring
us to the classical geometric problem of isoperimetry. In its simplest guise, it asks: among
all plane curves which enclose an area of a prescribed area, which ones have the least
perimeter? Unsurprisingly, among regular curves, it is well-known that circles provide
the answer to this question. As we shall see, the isoperimetric question, once generalized
to abstract spaces, contains a wealth of information about concentration phenomena.

2.5.1 Classical Isoperimetry Results


The connection between concentration and isoperimetry began with the work of Lévy,
who found the isoperimetric inequality on the sphere.
98 CHAPTER 2. FUNCTIONAL INEQUALITIES

Theorem 2.5.1 (spherical isoperimetry). Let 𝜎𝑑 denote the uniform measure on the
𝑑-dimensional unit sphere S𝑑 , and let 𝐴 ⊆ S𝑑 be a Borel subset with 𝜇 (𝐴) ∉ {0, 1}. Let
𝐶 be a spherical cap with the same measure as 𝐴. Then, for all 𝜀 > 0,

𝜎𝑑 (𝐴𝜀 ) ≥ 𝜎𝑑 (𝐶 𝜀 ) .

We give a few reminders about spherical geometry. We equip S𝑑 with its geodesic
metric d, so that the distance between two points 𝑥, 𝑦 ∈ S𝑑 is equal to the angle between
𝑥 and 𝑦. A spherical cap is a geodesic ball, that is, it is a set of the form B(𝑥 0, 𝑟 ) for some
𝑥 0 ∈ S𝑑 and 𝑟 > 0, where the balls are defined w.r.t. d.
Note that Theorem 2.5.1 identifies the exact function 𝜔 𝜎𝑑 .
To see how an isoperimetric result naturally leads to a concentration result, suppose
that 𝐴 has measure 12 . Then, the corresponding spherical cap 𝐶 can be taken to be half of
the sphere, 𝐶 = B(𝑥 0, π2 ), and so
 π 
𝜎𝑑 (𝐴𝜀 ) ≥ 𝜎𝑑 (𝐶 𝜀 ) = 𝜎𝑑 B 𝑥 0, + 𝜀 .
2
To obtain an upper bound on the concentration function 𝛼𝜎𝑑 , it therefore suffices to lower
bound the volume of the spherical cap. It leads to the following result, which we leave as
an exercise (Exercise 2.14).

Theorem 2.5.2 (concentration on the sphere). Let 𝜎𝑑 be the uniform measure on the
unit sphere S𝑑 in dimension 𝑑 ≥ 2, equipped with the geodesic distance d. Then,

(𝑑 − 1) 𝜀 2 
𝛼𝜎𝑑 (𝜀) ≤ exp − , for all 𝜀 > 0 .
2

There is also an isoperimetric result for the standard Gaussian measure 𝛾𝑑 on R𝑑 . In


this case, the optimal sets are given by half-spaces, i.e., sets of the form

𝐻𝑥 0,𝑡 B {𝑥 ∈ R𝑑 | h𝑥, 𝑥 0 i ≤ 𝑡 } .

Theorem 2.5.3 (Gaussian isoperimetry). Let 𝛾𝑑 denote the standard Gaussian measure
on R𝑑 , and let 𝐴 ⊆ R𝑑 be a Borel subset with 𝛾𝑑 (𝐴) ≠ {0, 1}. Let 𝐻𝑥 0,𝑡 be a half-space
2.5. ISOPERIMETRIC INEQUALITIES 99

with the same measure as 𝐴. Then, for all 𝜀 > 0,

𝛾𝑑 (𝐴𝜀 ) ≥ 𝛾𝑑 (𝐻𝑥𝜀 0,𝑡 ) .

We can write this result more explicitly as follows. By rotational invariance of the
Gaussian, we can take 𝑥 0 to be any unit vector 𝑒, in which case the measure of 𝐻𝑒,𝑡 is
𝛾𝑑 (𝐻𝑒,𝑡 ) = Φ(𝑡), where Φ is the Gaussian CDF. Since 𝐻𝑒,𝑡
𝜀 =𝐻
𝑒,𝑡+𝜀 , then

𝛾𝑑 (𝐴𝜀 ) ≥ Φ Φ−1 𝛾𝑑 (𝐴) + 𝜀 . (2.5.4)


 

In particular, if 𝛾𝑑 (𝐴) = 12 , then Φ−1 ( 12 ) = 0, so

1 𝜀2 
𝛼𝛾𝑑 (𝜀) ≤ Φ(−𝜀) ≤ exp − .
2 2
We now pause to give a remark on proofs. Since these isoperimetric inequalities require
a detailed understanding of the measure (including the optimal sets in the inequality), they
are considerably more difficult to prove than the other results we have seen so far (e.g.,
a log-Sobolev inequality). In particular, usually they can only established for measures
which are simple in some regard, e.g., they enjoy many symmetries. Hence, we will not
prove them here.
It is often convenient to pass to a differential form of the isoperimetric inequality,
which is obtained by sending 𝜀 & 0. This is formalized as follows.

Definition 2.5.5 (Minkowski content). Given a non-empty Borel set 𝐴 and a measure
𝜋 on a Polish space (X, d), the Minkowski content of 𝐴 under 𝜋 is

𝜋 (𝐴𝜀 ) − 𝜋 (𝐴)
𝜋 + (𝐴) B lim inf .
𝜀&0 𝜀

Definition 2.5.6 (isoperimetric profile). For a measure 𝜋 on a Polish space (X, d),
the isoperimetric profile of 𝜋, denoted I𝜋 : [0, 1] → R+ , is the function

I𝜋 (𝑝) B inf {𝜋 + (𝐴) | 𝐴 is measurable with 𝜋 (𝐴) = 𝑝} .

For the standard Gaussian, a Taylor expansion of (2.5.4) yields

𝛾𝑑 (𝐴𝜀 ) ≥ 𝛾𝑑 (𝐴) + 𝜙 Φ−1 𝛾𝑑 (𝐴) 𝜀 + 𝑜 (𝜀) ,



100 CHAPTER 2. FUNCTIONAL INEQUALITIES

where 𝜙 = Φ0 is the Gaussian density. Hence, we can identify the isoperimetric profile of
the standard Gaussian as

I𝛾𝑑 (𝑝) = 𝜙 Φ−1 (𝑝) . (2.5.7)




Actually, the result (2.5.7) is equivalent to the Gaussian isoperimetric inequality (2.5.4).
Here, (2.5.4) is called the integral form of the inequality, whereas (2.5.7) is called the
differential form. The following theorem shows how to convert between the two forms.

Theorem 2.5.8 ([BH97]). Let I : (0, 1) → R>0 . Define the increasing function
𝐹 such that 𝐹 (0) = 21 and 𝑓 ◦ 𝐹 −1 = 𝐼 , where 𝑓 = 𝐹 0; equivalently, we can take
𝐹 −1 (𝑝) = 1/2 I (𝑡) −1 d𝑡. Then, the following statements are equivalent.
∫𝑝

1. For all 𝜀 > 0 and all Borel 𝐴 with 𝜋 (𝐴) ∉ {0, 1},

𝜋 (𝐴𝜀 ) ≥ 𝐹 𝐹 −1 𝜋 (𝐴) + 𝜀 .
 

2. For all Borel 𝐴 with 𝜋 (𝐴) ∉ {0, 1},

𝜋 + (𝐴) ≥ I 𝜋 (𝐴) .


Proof sketch. Let 𝜔¯ (𝑝, 𝜀) B 𝐹 (𝐹 −1 (𝑝) + 𝜀). Then, 𝜔¯ satisfies the semigroup property
𝜔¯ (𝜔¯ (𝑝, 𝜀), 𝜀 0) = 𝜔¯ (𝑝, 𝜀 + 𝜀 0), and using this one can show that to prove 𝜋 (𝐴𝜀 ) ≥ 𝜔 (𝜋 (𝐴))
it suffices to consider 𝜀 & 0. A Taylor expansion yields

𝜋 (𝐴𝜀 ) ≥ 𝜋 (𝐴) + 𝜋 + (𝐴) 𝜀 + 𝑜 (𝜀) ,


𝜔¯ (𝜋 (𝐴), 𝜀) = 𝜋 (𝐴) + I (𝜋 (𝐴)) 𝜀 + 𝑜 (𝜀) ,

from which we deduce that 𝜋 + (𝐴) ≥ I (𝜋 (𝐴)) for all 𝐴 if and only if 𝜋 (𝐴𝜀 ) ≥ 𝜔¯ (𝜋 (𝐴), 𝜀)
for all 𝐴 and all 𝜀 > 0. 

2.5.2 Cheeger Isoperimetry


We now consider a class of probability measures which is characterized by a lower bound
on the isoperimetric profile.

Definition 2.5.9. A probability measure 𝜋 satisfies a Cheeger isoperimetric in-


2.5. ISOPERIMETRIC INEQUALITIES 101

equality with constant Ch > 0 if for all Borel sets 𝐴 ⊆ X,


1
𝜋 + (𝐴) ≥ 𝜋 (𝐴) 𝜋 (𝐴c ) . (2.5.10)
Ch

For the two-sided exponential density 𝑥 ↦→ 𝜇 (𝑥) B 12 exp(−|𝑥 |), the isoperimetric
profile is known to be I𝜇 (𝑝) = min(𝑝, 1 − 𝑝). Hence, the Cheeger isoperimetric inequality
roughly asserts that the isoperimetric properties of 𝜋 are at least as good as those of 𝜇.
The inequality in (2.5.10) is the differential form of the inequality. By applying Theo-
rem 2.5.8, one shows that the inequality (2.5.10) implies, for any 𝜀 ∈ [0, Ch],
𝜀
𝜋 (𝐴𝜀 ) − 𝜋 (𝐴) ≥ 𝜋 (𝐴) 𝜋 (𝐴c ) . (2.5.11)
2 Ch
For all 𝜀 > 0, Theorem 2.5.8 also implies that
𝜀 
𝛼 𝜋 (𝜀) ≤ exp − ,
Ch
so 𝜋 enjoys at least subexponential concentration.
Such isoperimetric inequalities will play a key role when we study Metropolis-adjusted
sampling algorithms in Chapter 7. For now, however, our goal is to establish an equivalence
between the Cheeger isoperimetric inequality and a functional version of it.
To pass from a functional inequality to an inequality involving sets, we can usually
apply the functional inequality to the indicator of a set. To go the other way around, we
need to represent a function via its level sets, which is achieved via the coarea inequality.

Theorem 2.5.12 (coarea inequality). Let 𝑓 : X → R be Lipschitz. Then,


∫ ∫ ∞
k∇𝑓 k d𝜋 ≥ 𝜋 + {𝑓 > 𝑡 } d𝑡 .
−∞

Remark 2.5.13. On a general metric space (X, d), we define

|𝑓 (𝑥) − 𝑓 (𝑦)|
k∇𝑓 k(𝑥) B lim sup .
𝑦∈X, d(𝑥,𝑦)&0 d(𝑥, 𝑦)

In “nice” spaces, the coarea inequality is actually an equality, but we will not need this.
Proof. By an approximation argument we may assume that 𝑓 is bounded, and by adding
a constant to 𝑓 we may suppose 𝑓 ≥ 0. Let 𝑓𝜀 (𝑥) B supd(𝑥,·)<𝜀 𝑓 and 𝐴𝑡 B {𝑓 > 𝑡 }. We
102 CHAPTER 2. FUNCTIONAL INEQUALITIES
∫∞
can check that 𝐴𝑡𝜀 = {𝑓𝜀 > 𝑡 }, and for 𝑔 ≥ 0 we have the formula 𝑔 d𝜋 = 0 𝜋 {𝑔 > 𝑡 } d𝑡.

By applying this to 𝑔 = 𝑓 and 𝑔 = 𝑓𝜀 ,


∫ ∞
𝜋 (𝐴𝑡𝜀 ) − 𝜋 (𝐴𝑡 )

𝑓𝜀 − 𝑓
d𝜋 = d𝑡 .
𝜀 0 𝜀
Now let 𝜀 & 0 using Fatou’s lemma and dominated convergence. 

Theorem 2.5.14. Let 𝜋 ∈ P1 (X) and let Ch > 0. The following are equivalent.

1. 𝜋 satisfies a Cheeger isoperimetric inequality with constant Ch.

2. For all Lipschitz 𝑓 : X → R, it holds that

E𝜋 |𝑓 − E𝜋 𝑓 | ≤ 2 Ch E𝜋 k∇𝑓 k . (2.5.15)

Proof sketch. (2) =⇒ (1): Apply (2.5.15) to an approximation 𝑓 of the indicator function
1𝐴 , so that E𝜋 |𝑓 − E𝜋 𝑓 | ≈ 2 𝜋 (𝐴) (1 − 𝜋 (𝐴)) and E𝜋 k∇𝑓 k ≈ 𝜋 + (𝐴).
(1) =⇒ (2): Let 𝐴𝑡 B {𝑓 > 𝑡 }. Applying the coarea inequality and the Cheeger
isoperimetric inequality,
∫ ∞
2 Ch E𝜋 k∇𝑓 k ≥ 2 Ch 𝜋 + {𝑓 > 𝑡 } d𝑡
∫ ∞ −∞ ∫ ∞
≥2 𝜋 (𝐴𝑡 ) 𝜋 (𝐴𝑡 ) d𝑡 =
c
E𝜋 | 1𝐴𝑡 − 𝜋 (𝐴𝑡 )| d𝑡
−∞ −∞
∫ ∞ ∫ 
≥ sup 𝑔 {1𝐴𝑡 − 𝜋 (𝐴𝑡 )} d𝜋 d𝑡
k𝑔k 𝐿∞ (𝜋 ) ≤1 −∞
∫ ∞ ∫  ∫
= sup {𝑔 − E𝜋 𝑔} 1𝐴𝑡 d𝜋 d𝑡 = sup {𝑔 − E𝜋 𝑔} 𝑓 d𝜋
k𝑔k 𝐿∞ (𝜋 ) ≤1 −∞ k𝑔k 𝐿∞ (𝜋 ) ≤1

= E𝜋 |𝑓 − E𝜋 𝑓 | . 

2.5.3 𝐿𝑝 –𝐿𝑞 Poincaré Inequalities


In this section, we work on Euclidean space for simplicity.
The inequality (2.5.15) can be considered an “𝐿 1 variant” of the Poincaré inequality.
More generally, we can define the following family of inequalities.
2.5. ISOPERIMETRIC INEQUALITIES 103

Definition 2.5.16 (𝐿𝑝 –𝐿𝑞 Poincaré inequality). For 𝑝, 𝑞 ∈ [1, ∞] with 𝑞 ≥ 𝑝, the
𝐿𝑝 –𝐿𝑞 Poincaré inequality asserts that for all smooth 𝑓 : R𝑑 → R,

k 𝑓 − E𝜋 𝑓 k 𝐿𝑝 (𝜋) ≤ 𝐶𝑝,𝑞 k∇𝑓 k 𝐿𝑞 (𝜋) .

In this new
√ notation, the usual Poincaré inequality1is an 𝐿 2 –𝐿 2 Poincaré inequality
with 𝐶 2,2 = 𝐶 PI , whereas the inequality (2.5.15) is an 𝐿 –𝐿 1 Poincaré inequality.
These inequalities form a hierarchy via Hölder’s inequality.

ˆ 𝑞ˆ ∈ [1, ∞] are such that 𝑝 ≤ 𝑝ˆ and


Proposition 2.5.17 ([Mil09]). Suppose 𝑝, 𝑞, 𝑝,
ˆ and 𝑝 − 𝑞 = 𝑝ˆ − 𝑞ˆ . Then,
𝑞 ≤ 𝑞, −1 −1 −1 −1

𝑝ˆ
𝐶𝑝,ˆ 𝑞ˆ . 𝐶𝑝,𝑞 .
𝑝

Proof. Let 𝑓 satisfy med𝜋 𝑓 = 0, which we can arrange by adding a constant. Define the
ˆ
function 𝑔 B (sgn 𝑓 ) |𝑓 |𝑝/𝑝 , which still satisfies med𝜋 𝑔 = 0. By using the equivalence be-
tween the mean and the median (Lemma 2.4.4) and applying the 𝐿𝑝 –𝐿𝑞 Poincaré inequality
to 𝑔 together with Hölder’s inequality,
ˆ
k𝑓 − med𝜋 𝑓 k = k𝑔 − med𝜋 𝑔k 𝐿𝑝 (𝜋) . k𝑔 − E𝜋 𝑔k 𝐿𝑝 (𝜋) ≤ 𝐶𝑝,𝑞 k∇𝑔k 𝑞
𝑝/𝑝
𝐿𝑝ˆ (𝜋) 𝐿 (𝜋)
𝑝ˆ ˆ
= 𝐶𝑝,𝑞 |𝑓 |𝑝/𝑝−1 k∇𝑓 k 𝐿𝑞 (𝜋)
𝑝
𝑝ˆ ˆ
= 𝐶𝑝,𝑞 k 𝑓 − med𝜋 𝑓 k 𝐿𝑝ˆ (𝜋) k∇𝑓 k 𝐿𝑞ˆ (𝜋) ,
𝑝/𝑝−1
𝑝
where we leave it to the reader to check that the exponents work out correctly. If we
rearrange this inequality and apply Lemma 2.4.4 again, then

𝑝ˆ
k 𝑓 − E𝜋 𝑓 k 𝐿𝑝ˆ (𝜋) . k 𝑓 − med𝜋 𝑓 k 𝐿𝑝ˆ (𝜋) .

𝐶𝑝,𝑞 k∇𝑓 k 𝐿𝑞ˆ (𝜋) . 
𝑝
Thus, we have the following implications: for any 𝑝 ∈ (2, ∞),

(𝐿 1 –𝐿 1 ) =⇒ (𝐿 2 –𝐿 2 ) =⇒ · · · =⇒ (𝐿𝑝 –𝐿𝑝 ) .

In particular, the first implication together with the equivalence in Theorem 2.5.14 shows
that the Cheeger isoperimetric inequality implies the Poincaré inequality.
104 CHAPTER 2. FUNCTIONAL INEQUALITIES

Also, given any 𝐿𝑝 –𝐿𝑞 Poincaré inequality, by Jensen’s inequality we can trivially
make it weaker by decreasing 𝑝 or increasing 𝑞; hence, every 𝐿𝑝 –𝐿𝑞 Poincaré inequality
implies an 𝐿 1 –𝐿 ∞ Poincaré inequality. On the other hand, for any 1 ≤ 𝑝 ≤ 𝑞 < ∞, an
𝐿 1 –𝐿 1 Poincaré implies an 𝐿𝑝 –𝐿𝑝 Poincaré, which trivially implies an 𝐿𝑝 –𝐿𝑞 Poincaré
inequality. We conclude that among these inequalities, the 𝐿 1 –𝐿 1 inequality is the strongest
and the 𝐿 1 –𝐿 ∞ inequality is the weakest.
We now wish to sketch the proof of a deep result by E. Milman, which states that for
log-concave measures, the hierarchy can be reversed. The formal statement is as follows.

Theorem 2.5.18 (reversing the hierarchy). Let 𝜋 ∈ P (R𝑑 ) be log-concave. Then,

𝐶 1,1 . 𝐶 1,∞ .

As a consequence, suppose that 𝜋 is 𝛼-strongly log-concave. By the Bakry–Émery


2 = 𝐶
theorem (Theorem 1.2.28), 𝜋 satisfies a Poincaré inequality with 𝐶 2,2 PI ≤ 1/𝛼. By
reversing the hierarchy, we see that this implies a Cheeger isoperimetric inequality.

Corollary 2.5.19. If 𝜋 ∈ P (R𝑑 ) is 𝛼-strongly √


log-concave, then 𝜋 satisfies a Cheeger
isoperimetric inequality with constant Ch . 1/ 𝛼.

The proof of Milman’s theorem will require some preparations. The first fact that
we need is a deep result in its own right. Typically it is proven with geometric measure
theory by studying the isoperimetric problem, and we omit the proof.

Theorem 2.5.20. If 𝜋 is log-concave, then its isoperimetric profile I𝜋 is concave.

The isoperimetric profile satisfies I𝜋 (0) = 0, and it is symmetric around 12 , so it suffices


to consider 𝑝 ∈ [0, 12 ]. By concavity,

1
I𝜋 (𝑝) ≥ I𝜋 𝑝. (2.5.21)
2
Hence, in order to prove Cheeger’s isoperimetric inequality, we need only find a suitable
lower bound for I𝜋 ( 21 ).
The next idea is that instead of applying the 𝐿 1 –𝐿 ∞ directly to an indicator function
1𝐴 , we will first regularize 1𝐴 using the Langevin semigroup (𝑃𝑡 )𝑡 ≥0 with stationary
distribution 𝜋. We start with a semigroup calculation.
2.5. ISOPERIMETRIC INEQUALITIES 105

Proposition 2.5.22. Assume that the Markov semigroup (𝑃𝑡 )𝑡 ≥0 is reversible and
satisfies the curvature-dimension condition CD(𝛼, ∞) for some 𝛼 ∈ R.

1. For all 𝑓 and 𝑡 ≥ 0,

exp(2𝛼𝑡) − 1
𝑃𝑡 (𝑓 2 ) − (𝑃𝑡 𝑓 ) 2 ≥ Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) , (2.5.23)
𝛼
exp(2𝛼𝑡)−1
where we interpret 𝛼 = 2𝑡 when 𝛼 = 0.

2. If 𝛼 = 0, then for all 𝑡 > 0 and 𝑝 ∈ [2, ∞],


√︁
Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) 1
(2.5.24)

𝐿𝑝 (𝜋)
≤ √ k 𝑓 k 𝐿𝑝 (𝜋)
2𝑡
and
√ √︁
2𝑡 Γ(𝑓 , 𝑓 ) 𝐿1 (𝜋) . (2.5.25)

k 𝑓 − 𝑃𝑡 𝑓 k 𝐿1 (𝜋) ≤

Proof. We recall the calculation that we performed for the local Poincaré inequality
(Theorem 2.2.11):

𝜕𝑠 [𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )] = 2𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) .

From the second statement of Theorem 2.2.11, we have

Γ(𝑃𝑠 𝑔, 𝑃𝑠 𝑔) ≤ exp(−2𝛼𝑠) 𝑃𝑠 Γ(𝑔, 𝑔) .

Taking 𝑔 = 𝑃𝑡−𝑠 𝑓 , we deduce that

𝑃𝑠 Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) ≥ exp(2𝛼𝑠) Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) .

Integrating this from 𝑠 = 0 to 𝑠 = 𝑡,

exp(2𝛼𝑡) − 1
∫ 𝑡
2 2
𝑃𝑡 (𝑓 ) − (𝑃𝑡 𝑓 ) ≥ 2Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) exp(2𝛼𝑠) d𝑠 = Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) .
0 𝛼
This establishes the first inequality. In particular, for 𝛼 = 0,
1 √︁
𝑃𝑡 (𝑓 2 )
√︁
Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 ) ≤ √
2𝑡
106 CHAPTER 2. FUNCTIONAL INEQUALITIES

so that
1 1
≤ √ {E𝜋 𝑃𝑡 (|𝑓 |𝑝 )}1/𝑝 ≤ √ k𝑓 k 𝐿𝑝 (𝜋) .
√︁
Γ(𝑃𝑡 𝑓 , 𝑃𝑡 𝑓 )

𝐿𝑝 (𝜋)
2𝑡 2𝑡
The last inequality is the dual of the 𝑝 = ∞ case. Indeed, for 𝑔 with k𝑔k 𝐿∞ (𝜋) ≤ 1,
∫ ∫ ∫
𝜕𝑡 (𝑓 − 𝑃𝑡 𝑓 ) 𝑔 d𝜋 = − 𝑃𝑡 ℒ𝑓 𝑔 d𝜋 = Γ(𝑓 , 𝑃𝑡 𝑔) d𝜋

and hence, by the Cauchy–Schwarz inequality for the carré du champ (Exercise 1.6),
∫ ∫ 𝑡 ∫  ∫ 𝑡 ∫ √︁ 
(𝑓 − 𝑃𝑡 𝑓 ) 𝑔 d𝜋 = Γ(𝑓 , 𝑃𝑠 𝑔) d𝜋 d𝑠 ≤ Γ(𝑓 , 𝑓 ) Γ(𝑃𝑠 𝑔, 𝑃𝑠 𝑔) d𝜋 d𝑠
0 0
∫ 𝑡 √︁
√︁
≤ Γ(𝑓 , 𝑓 ) 𝐿1 (𝜋) Γ(𝑃𝑠 𝑔, 𝑃𝑠 𝑔) ∞ d𝑠

𝐿 (𝜋)
0
1 √ √︁
∫ 𝑡
√︁
≤ Γ(𝑓 , 𝑓 ) 𝐿1 (𝜋) k𝑔k 𝐿∞ (𝜋) √ d𝑠 ≤ 2𝑡 Γ(𝑓 , 𝑓 ) 𝐿1 (𝜋) . 

0 2𝑠

Remark 2.5.26. The inequality (2.5.23) can be regarded as a “reverse Poincaré” inequality
because it upper bounds the size of the gradient via a variance term. Similarly to Theo-
rem 2.2.11, the inequality (2.5.23) can also be shown to be equivalent to CD(𝛼, ∞).
We are now ready to prove Milman’s theorem.

Proof of Milman’s theorem, Theorem 2.5.18. By approximating the indicator function 1𝐴


with a smooth function and applying the inequality (2.5.25), we can justify the bound

2𝑡 𝜋 + (𝐴) ≥ k 1𝐴 − 𝑃𝑡 1𝐴 k 𝐿1 (𝜋) .

Next, a calculation shows that

E𝜋 | 1𝐴 − 𝑃𝑡 1𝐴 | = 2 𝜋 (𝐴) 𝜋 (𝐴c ) − E 1𝐴 − 𝜋 (𝐴) 𝑃𝑡 1𝐴 − 𝜋 (𝐴)


   

≥ 2 {𝜋 (𝐴) 𝜋 (𝐴c ) − k 1𝐴 − 𝜋 (𝐴)k 𝐿∞ (𝜋) k𝑃𝑡 1𝐴 − 𝜋 (𝐴)k 𝐿1 (𝜋) } .


| {z }
≤1

From the 𝐿 1 –𝐿 ∞ Poincaré inequality and (2.5.24),

𝐶 1,∞ 𝐶 1,∞
k𝑃𝑡 1𝐴 − 𝜋 (𝐴)k 𝐿1 (𝜋) ≤ 𝐶 1,∞ k∇𝑃𝑡 1𝐴 k 𝐿∞ (𝜋) ≤ √ k 1𝐴 k 𝐿∞ (𝜋) ≤ √ .

2𝑡 2𝑡
2.5. ISOPERIMETRIC INEQUALITIES 107

Hence, we have
√ 𝐶 1,∞
2𝑡 𝜋 + (𝐴) ≥ 2 𝜋 (𝐴) 𝜋 (𝐴c ) − √

.
2𝑡

Now choose 2𝑡 = 2𝐶 1,∞ /(𝜋 (𝐴) 𝜋 (𝐴c )) to obtain

1
𝜋 + (𝐴) ≥ 𝜋 (𝐴) 2 𝜋 (𝐴c ) 2 .
2𝐶 1,∞

1
This inequality is not fully satisfactory, but if we take 𝜋 (𝐴) = 2 then we deduce from this
that I𝜋 ( 21 ) ≥ 1/(32𝐶 1,∞ ), and from (2.5.21) we conclude. 

2.5.4 Gaussian Isoperimetry


The Cheeger isoperimetric inequality asserts that for small 𝑝, I𝜋 (𝑝) & 𝑝. On the other
hand, one can check that the Gaussian isoperimetric profile I𝛾𝑑 (𝑝) = 𝜙 (Φ−1 (𝑝)) has the
asymptotics I𝛾𝑑 (𝑝) ∼ 𝑝 2 ln(1/𝑝) as 𝑝 & 0.
√︁

As with the Cheeger isoperimetric inequality, isoperimetry of Gaussian type can also
be captured via a functional inequality. The following result is due to Bobkov.

Theorem 2.5.27 (Gaussian isoperimetry, functional form). Suppose that 𝜋 ∈ P (R𝑑 )


is 𝛼-strongly log-concave for some 𝛼 > 0. Then, for all 𝑓 : R𝑑 → [0, 1],
√ √︃
𝛼 I𝛾𝑑 (E𝜋 𝑓 ) ≤ E𝜋 𝛼 I𝛾𝑑 (𝑓 ) 2 + Γ(𝑓 , 𝑓 ) . (2.5.28)

As this formulation suggests, Theorem 2.5.27 has a proof via Markov semigroup theory,
for which we refer readers to [BGL14, §8.5.2]. By converting the functional inequality
back into an isoperimetric statement, one can deduce the following comparison theorem.

Theorem 2.5.29 (Gaussian isoperimetry comparison theorem). Suppose that the


measure 𝜋 ∈ P (R𝑑 ) is 𝛼-strongly log-concave for some 𝛼 > 0. Then,

I𝜋 ≥ 𝛼 I𝛾𝑑 .

We explore these results further in the exercises.


108 CHAPTER 2. FUNCTIONAL INEQUALITIES

2.6 Metric Measure Spaces


In this section, we revisit the curvature-dimension condition. As we hinted at in Sec-
tion 2.2.1, the commutation relation which underlies the curvature-dimension condition
captures the underlying curvature of both the ambient space and the measure. This
observation leads not only to an extension of the ideas we have been considering thus far
to weighted Riemannian manifolds, but in fact provides an avenue towards developing
geometric analysis on non-smooth spaces which a priori have no differential structure.
The key to this program is that, whereas the Ricci curvature tensor cannot be defined on
such spaces, the convexity of the KL divergence w.r.t. an appropriate Wasserstein space
continues to make sense as a “synthetic” notion of a Ricci curvature lower bound.
To aid the reader who is unfamiliar with Riemannian geometry, we will provide a
brief review of the main concepts. In this section, we shall omit many of the proofs, as
the goal is simply to acquaint the reader with the general picture in a geometric context
without delving into the details.

2.6.1 Riemannian Geometry


Basic concepts. We recall some of the definitions from Section 1.3.2. A Riemannian
manifold M is a space which is locally homeomorphic to a Euclidean space, such that
at every point 𝑝 ∈ M there is an associated vector space 𝑇𝑝 M, called the tangent space
to M at 𝑝, equipped with an inner product h·, ·i𝑝 . The tangent space 𝑇𝑝 M represents the
velocities of all curves passing through 𝑝. We can collect together the different tangent
spaces into a single object called the tangent bundle,
Ø
𝑇M B ({𝑝} × 𝑇𝑝 M) .
𝑝∈M

The Riemannian metric 𝑝 ↦→ h·, ·i𝑝 is required to be smooth in a suitable sense.


A smooth function 𝑓 : M → R has a differential 𝑑 𝑓 : 𝑇 M → R, defined as follows.
Given a point 𝑝 ∈ M and a tangent vector 𝑣 ∈ 𝑇𝑝 M, let (𝑝𝑡 )𝑡 ∈R be a curve on M with
𝑝 0 = 𝑝 and with velocity 𝑣 at time 0. Then, (𝑑 𝑓 ) 𝑝 𝑣 B 𝜕𝑡 |𝑡=0 𝑓 (𝑝𝑡 ). One can check that
this definition does not depend on the choice of curve (𝑝𝑡 )𝑡 ∈R and that (𝑑 𝑓 ) 𝑝 is a linear
function on 𝑇𝑝 M. Note that the differential can be defined on any manifold, even if it does
not have a Riemannian structure, but (𝑑 𝑓 ) 𝑝 is not an element of 𝑇𝑝 M; it is an element
of the dual space 𝑇𝑝∗ M, called the cotangent space. The Riemannian metric allows us
to identify (𝑑 𝑓 ) 𝑝 with an element of 𝑇𝑝 M: there is a unique vector ∇𝑓 (𝑝) ∈ 𝑇𝑝 M such
that for all 𝑣 ∈ 𝑇𝑝 M, it holds that (𝑑 𝑓 ) 𝑝 𝑣 = h∇𝑓 (𝑝), 𝑣i𝑝 . The vector ∇𝑓 (𝑝) is called the
gradient of 𝑓 at 𝑝. The gradient depends on the choice of the metric, and we can then
2.6. METRIC MEASURE SPACES 109

define the gradient flow of 𝑓 to be a curve (𝑝𝑡 )𝑡 ≥0 such that the velocity 𝑝¤𝑡 of the curve
equals −∇𝑓 (𝑝𝑡 ) for all 𝑡 ≥ 0.
We pause to give a simple example. Suppose that M = R𝑑 , and we pick a smooth
mapping 𝑝 ↦→ 𝐴𝑝 where 𝐴𝑝 is a positive definite 𝑑 × 𝑑 matrix for each 𝑝 ∈ R𝑑 . This
induces a Riemannian metric via h𝑢, 𝑣i𝑝 B h𝑢, 𝐴𝑝 𝑣i, where h·, ·i (without a subscript)
denotes the usual Euclidean inner product. If (𝑝𝑡 )𝑡 ∈R is a smooth curve in R𝑑 and its usual
time derivative is (𝑝¤𝑡 )𝑡 ∈R , then we know that 𝜕𝑡 𝑓 (𝑝𝑡 ) = h∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i = h𝐴𝑝−1𝑡 ∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 .
Hence, the manifold gradient ∇M 𝑓 is given by ∇M 𝑓 (𝑝) = 𝐴𝑝−1 ∇𝑓 , where ∇𝑓 is the
Euclidean gradient. When 𝐴𝑝 = ∇2𝜙 (𝑝) is obtained as the Hessian of a mapping 𝜙, then
M is called a Hessian manifold.
A vector field on M is a mapping 𝑋 : M → 𝑇 M such that 𝑋 (𝑝) ∈ 𝑇𝑝 M for all 𝑝 ∈ M.7
A single vector 𝑣 ∈ 𝑇𝑝 M can be thought of as a differential operator; for 𝑓 ∈ C ∞ (M), we
can define the action of 𝑣 on 𝑓 via 𝑣 (𝑓 ) B (𝑑 𝑓 ) 𝑝 𝑣. Similarly, a vector field 𝑋 acts on 𝑓 and
produces a function 𝑋 𝑓 : M → R, defined by 𝑋 𝑓 (𝑝) = 𝑋 (𝑝)𝑓 for all 𝑝 ∈ M. For example,
on R𝑑 , a vector field can be identified with a mapping R𝑑 → R𝑑 , and it differentiates
functions via 𝑋 𝑓 (𝑝) = h∇𝑓 (𝑝), 𝑋 (𝑝)i.
We would also like to differentiate vector fields along other vector fields, and there are
two main ways of doing so. The first is called the Lie derivative, and it can be defined on
any smooth manifold without the need for a Riemannian metric, and is consequently less
important for our discussion. The second is the Levi–Civita connection, which given
vector fields 𝑋 and 𝑌 , outputs another vector field ∇𝑋 𝑌 . This connection is characterized
by various properties, including compatibility with the Riemannian metric: for all vector
fields 𝑋 , 𝑌 , and 𝑍 , we have the chain rule

𝑍 h𝑋, 𝑌 i = h∇𝑍 𝑋, 𝑌 i + h𝑋, ∇𝑍 𝑌 i . (2.6.1)

Here, the vector field 𝑍 is differentiating the scalar function 𝑝 ↦→ h𝑋 (𝑝), 𝑌 (𝑝)i𝑝 . Since we
do not aim to perform many Riemannian calculations here, we omit most of the other
properties for simplicity. However, we mention one key fact, which is that for any smooth
function 𝑓 , it holds that ∇ 𝑓 𝑋 𝑌 = 𝑓 ∇𝑋 𝑌 , where 𝑓 𝑋 is the vector field (𝑓 𝑋 )(𝑝) = 𝑓 (𝑝) 𝑋 (𝑝).
This property implies that the mapping (𝑋, 𝑌 ) ↦→ ∇𝑋 𝑌 is tensorial in its first argument,
that is, (∇𝑋 𝑌 )(𝑝) only depends on the value 𝑋 (𝑝) of 𝑋 at 𝑝.
The tensorial property of the Levi–Civita connection allows us to compute the deriva-
tive of a vector field 𝑌 along a curve 𝑐 : R → M. Namely, for 𝑡 ∈ R, we can define
𝐷𝑐 𝑌 (𝑡) B (∇𝑐¤(𝑡) 𝑌 )(𝑐 (𝑡)), which makes sense because we can extend 𝑐¤ to a vector field
𝑋 on M and deduce that (∇𝑋 𝑌 )(𝑐 (𝑡)) only depends on 𝑋 (𝑐 (𝑡)) = 𝑐¤ (𝑡) (and not on the
choice of extension 𝑋 ). Then, 𝐷𝑐 𝑌 is called the covariant derivative of 𝑌 along the curve
7 Geometers would say that 𝑋 is a section of the tangent bundle.
110 CHAPTER 2. FUNCTIONAL INEQUALITIES

𝑐. From there, we can define the parallel transport of a vector 𝑣 0 ∈ 𝑇𝑐 (0) M along the
curve 𝑐 to be the unique vector field (𝑣 (𝑡))𝑡 ∈R defined along the curve 𝑐 with 𝑣 (0) = 𝑣 0
such that the covariant derivative vanishes: 𝐷𝑐 𝑣 = 0. The parallel transport is a canonical
way of identifying two different tangent spaces on M. Due to compatibility with the
metric, it has the property that if 𝑐 (0) = 𝑝, 𝑐 (1) = 𝑞, and 𝑃𝑐 𝑣 ∈ 𝑇𝑞 M denotes the parallel
transport of 𝑣 ∈ 𝑇𝑝 M along 𝑐 for time 1, then 𝑃𝑐 : 𝑇𝑝 M → 𝑇𝑞 M is an isometry.
We already have seen the idea of a length-minimizing curve, or a geodesic. Recall
that the Riemannian metric induces a distance on M via
n∫ 1 o
d(𝑝, 𝑞) = inf k𝛾¤ (𝑡)k𝛾 (𝑡) d𝑡 𝛾 (0) = 𝑝, 𝛾 (1) = 𝑞 .

0

If there is a minimizing constant-speed curve 𝛾 in this variational problem, we say that 𝛾


is a geodesic joining 𝑝 and 𝑞. By taking the first variation of this problem, one shows that
a necessary condition for 𝛾 to be a geodesic is for the covariant derivative of its velocity to
vanish: 𝐷𝛾 𝛾¤ = 0. We will write this, however, with the more familiar notation 𝛾¥ = 0, which
in Euclidean space means that there is zero acceleration (and hence Euclidean geodesics
are straight lines). The converse is not true; if 𝛾¥ = 0, it does not imply that 𝛾 must be a
shortest path between its endpoints (but it means that 𝛾 is locally a shortest path).
If 𝑝 ∈ M and 𝑣 ∈ 𝑇𝑝 M, then exp𝑝 (𝑣) is defined to be the endpoint (at time 1) of a
constant-speed geodesic emanating from 𝑝 with velocity 𝑣, if such a geodesic exists. In
general, the exponential map may only be defined in a neighborhood of 0 on 𝑇𝑝 M. The
logarithmic map is the inverse of the exponential map: given 𝑞 ∈ M, log𝑝 (𝑞) is the unique
vector 𝑣 ∈ 𝑇𝑝 M, if this is well-defined, such that exp𝑝 (𝑣) = 𝑞.
Given a vector field 𝑋 on M, the divergence of 𝑋 is the function div 𝑋 : M → R
defined as follows: (div 𝑋 )(𝑝) is the trace8 of the linear mapping 𝑣 ↦→ (∇𝑣 𝑋 )(𝑝) on
𝑇𝑝 M. Also, 𝑓 ∈ C ∞ (M), we define the Hessian of 𝑓 at 𝑝 to be the bilinear mapping
∇2 𝑓 (𝑝) : 𝑇𝑝 M × 𝑇𝑝 M → R given by

∇2 𝑓 (𝑝) [𝑣, 𝑤] = h∇𝑣 ∇𝑓 (𝑝), 𝑤i𝑝 .

Even though we have not given all of the definitions precisely, we will now work through
one example to give the reader the flavor of the computations. Suppose that (𝑝𝑡 )𝑡 ∈R is a
curve on M; then we know that 𝜕𝑡 𝑓 (𝑝𝑡 ) = h∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 . If 𝑔(𝑝) B h∇𝑓 (𝑝), 𝑝i ¤ 𝑝 , then by
definition we have 𝜕𝑡2 𝑓 (𝑝𝑡 ) = 𝑝¤𝑡 (𝑔)(𝑝𝑡 ). By compatibility of the Levi–Civita connection
with the metric (2.6.1), this equals h∇𝑝¤𝑡 ∇𝑓 (𝑝𝑡 ), 𝑝¤𝑡 i𝑝𝑡 + h∇𝑓 (𝑝𝑡 ), ∇𝑝¤𝑡 𝑝¤𝑡 i𝑝𝑡 . The first term is
8 Thetrace is defined as usual, namely if 𝐴 is a linear mapping on 𝑇𝑝 M, then after choosing an arbitrary
orthonormal basis 𝑒 1, . . . , 𝑒𝑑 of 𝑇𝑝 M (w.r.t. the Riemannian metric), we have tr 𝐴 = 𝑑𝑖=1 h𝑒𝑖 , 𝐴 𝑒𝑖 i𝑝 .
Í
2.6. METRIC MEASURE SPACES 111

∇2 𝑓 (𝑝𝑡 ) [𝑝¤𝑡 , 𝑝¤𝑡 ]. For the second term, if 𝑝 is a geodesic, then the term ∇𝑝¤𝑡 𝑝¤𝑡 vanishes. This
shows that it is convenient to pick geodesic curves when computing Hessians.9
For 𝑓 ∈ C ∞ (M), the Laplacian of 𝑓 is the function Δ𝑓 : M → R defined by
Δ𝑓 B tr ∇2 𝑓 . In the Riemannian setting, Δ is usually called the Laplace–Beltrami
operator. The Riemannian metric induces a volume measure, which we always denote
via 𝔪. Throughout, when we abuse notation to refer to the density of an absolutely
continuous measure 𝜇 ∈ P (M), we always refer to the density w.r.t. the volume measure,
d𝜇
i.e., d𝔪 . We have the integration by parts formula
∫ ∫ ∫
Δ𝑓 𝑔 d𝔪 = 𝑓 Δ𝑔 d𝔪 = − h∇𝑓 , ∇𝑔i d𝔪 ,

provided that there are no boundary terms.

Curvature. For a two-dimensional surface, it is easier to define the notion of curva-


ture: one has the Gaussian curvature, which associates to each point 𝑝 ∈ M a single
number 𝐾 (𝑝) ∈ R. It is the product of the two principal curvatures at 𝑝. The celebrated
Theorema Egregium (“remarkable theorem”) of Gauss asserts that the Gaussian curvature
is unchanged under local isometries, i.e., the Gaussian curvature is intrinsic to the surface.
(In contrast, there are other extrinsic notions of curvature, such as the mean curvature,
which rely on the embedding of the manifold in Euclidean space.)
In higher dimensions, we are not so fortunate and it requires much more geometric
information to fully capture the idea of curvature. In fact, at each point 𝑝 ∈ M, we
associate to it a 4-tensor, called the Riemann curvature tensor. It is defined as follows:
given vector fields 𝑊 , 𝑋 , 𝑌 , and 𝑍 ,

Riem(𝑊 , 𝑋, 𝑌, 𝑍 ) B h∇𝑋 ∇𝑊 𝑌 − ∇𝑊 ∇𝑋 𝑌 + ∇ [𝑊 ,𝑋 ] 𝑌, 𝑍 i .

Here, [𝑊 , 𝑋 ] is the Lie bracket of 𝑊 and 𝑋 , which is the vector field 𝑈 defined as the
commutator: 𝑈 𝑓 B 𝑊 𝑋 𝑓 − 𝑋𝑊 𝑓 . This tensor is obviously an unwieldy object, and it is
unclear whether anyone fully understands its complexities. Nevertheless, we may begin
to get a handle on it by observing that at its core, it measures the lack of commutativity of
certain differential operators, which we stated was the basis for curvature in Section 2.2.1.
9 When computing first-order derivatives, it is only important that the first-order behavior of the curve is
correct (i.e., the curve has the correct tangent vector). When computing second-order derivatives, it should
come at no surprise that the second-order behavior of the curve begins to matter.
Incidentally, if ∇𝑓 (𝑝𝑡 ) = 0, i.e., we are at a stationary point, then the second term vanishes regardless
of the curve 𝑝. Hence, the Hessian of 𝑓 can be defined on any smooth manifold without the need for a
Riemannian metric, provided that we restrict ourselves to stationary points of 𝑓 . This observation is used
heavily in Morse theory.
112 CHAPTER 2. FUNCTIONAL INEQUALITIES

On Euclidean space, it vanishes: Riem = 0. Also, the Riemann curvature tensor is fully
determined by the sectional curvatures of M: given a two-dimensional subspace 𝑆 of
𝑇𝑝 M, the sectional curvature of 𝑆 can be defined as the Gaussian curvature of the two-
dimensional surface obtained by following geodesics with directions in 𝑆. Thus, we can
view the Riemann curvature tensor as collecting together all of the curvature information
from two-dimensional slices.
Luckily, the Riemann curvature tensor contains information that is too detailed for our
purposes. With an eye towards probabilistic applications, we focus mainly on properties
such as the distortion of volumes of balls along geodesics, which only requires looking at
certain averages of the Riemann curvature. More specifically, for 𝑢, 𝑣 ∈ 𝑇𝑝 M, let

Ric𝑝 (𝑢, 𝑣) B tr Riem(𝑢, ·, 𝑣, ·) .

The tensor Ric is called the Ricci curvature tensor. It is a powerful fact that many
useful geometric and probabilistic consequences, such as diameter bounds and functional
inequalities, are consequences of lower bounds on the Ricci curvature.
We also mention that one can further take the trace of the Ricci curvature tensor to
arrive at a single scalar function, known as the scalar curvature, but we shall not use it
in this book.

Diffusions on manifolds. Recall that on R𝑑 , the generator of the standard Brownian


motion is 12 Δ, where Δ is the Laplacian operator. On a manifold M, we define standard
Brownian motion (𝐵𝑡 )𝑡 ≥0 to be the unique M-valued stochastic process with generator 12 Δ,
where Δ is now the Laplace–Beltrami operator. ∫ 𝑡 This means that for all smooth functions
1
𝑓 : M → R, we require 𝑡 ↦→ 𝑓 (𝐵𝑡 ) − 𝑓 (𝐵 0 ) − 0 2 Δ𝑓 (𝐵𝑠 ) d𝑠 to be a local martingale.
More generally, a stochastic process (𝑍𝑡 )𝑡 ≥0 ∫has generator ℒ if for all smooth functions
𝑓 : M → R, the process 𝑡 ↦→ 𝑓 (𝑍𝑡 ) − 𝑓 (𝑍 0 ) − 0 ℒ𝑓 (𝑍𝑠 ) d𝑠 is a local martingale. When
𝑡

the generator is ℒ𝑓 = Δ𝑓 − h∇𝑉 , ∇𝑓 i for a smooth function 𝑉 : M → R, this corresponds √


to a Langevin diffusion on the manifold. We informally write d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 ,
although the “+” symbol has to be interpreted carefully. Under some assumptions, the
stationary distribution 𝜋 of the Langevin diffusion has density 𝜋 ∝ exp(−𝑉 ) w.r.t. the
volume measure 𝔪.
Under appropriate assumptions on ∇𝑉 , the existence and uniqueness of the diffusion
process on the manifold can be proven, e.g., via embedding the manifold in Euclidean
space and using similar arguments as in Section 1.1.3.

Optimal transport on Riemannian manifolds. We conclude this section by dis-


cussing how the optimal transport problem can be generalized to Riemannian manifolds.
2.6. METRIC MEASURE SPACES 113

Recall from Exercise 1.11 that the optimal transport problem can be posed with other costs;
in particular, we take the cost to be 𝑐 (𝑥, 𝑦) = d(𝑥, 𝑦) 2 , where d is the distance induced by
the Riemannian metric. Suppose, for simplicity, that M is compact and that 𝜇 is absolutely
continuous (w.r.t. the volume measure). Then, there is a unique optimal transport map 𝑇
from 𝜇 to 𝜈, which is of the form 𝑇 (𝑥) = exp𝑥 (∇𝜓 (𝑥)), where −𝜓 is d2 /2-concave.
Moreover, there is a formal Riemannian structure on P2,ac (M). We can formally define
the tangent space at 𝜇 to be
𝐿 2 (𝜇)
𝑇𝜇 P2,ac (M) B {∇𝜓 | 𝜓 ∈ C ∞ (M)} ,
√︃∫
equipped with the norm k∇𝜓 k 𝜇 B k∇𝜓 k 2 d𝜇. Also, curves of measures are again
characterized by the continuity equation

𝜕𝑡 𝜇𝑡 + div(𝜇𝑡 𝑣𝑡 ) = 0 ,

where the equation is to be interpreted in a weak sense: for any test function 𝜑 ∈ C ∞ (M),
for a.e. 𝑡, it holds that
∫ ∫
𝜕𝑡 𝜑 d𝜇𝑡 = h∇𝜑, 𝑣𝑡 i d𝜇𝑡 .

In short, aside from new technicalities introduced in the Riemannian setting (such as
the presence of a cut locus10 ), most of the facts familiar to us from the Euclidean setting
continue to hold when generalized appropriately. We refer to [Vil09] for more details.

2.6.2 Metric Geometry


We now depart from the setting of smooth manifolds and consider metric spaces (X, d).

Definition 2.6.2 (length). Given a continuous curve 𝛾 : [0, 1] → X, we define the


length of 𝛾 to be
𝑛
n∑︁  o
len 𝛾 B sup d 𝛾 (𝑡𝑖 ), 𝛾 (𝑡𝑖−1 ) 0 ≤ 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑛 ≤ 1 .
𝑖=1

We can check that this definition agrees with the usual notion of length on R𝑑 . By the
triangle inequality, if 𝛾 (0) = 𝑝 and 𝛾 (1) = 𝑞, then d(𝑝, 𝑞) ≤ len 𝛾.
10 Looselyspeaking, the presence of a cut locus means that there are multiple minimizing geodesics
connecting two points. Think for instance of the two poles of a sphere.
114 CHAPTER 2. FUNCTIONAL INEQUALITIES

Definition 2.6.3. We say that (X, d) is a geodesic space if for all 𝑝, 𝑞 ∈ X, there is a
constant-speed curve 𝛾 : [0, 1] → X such that 𝛾 (0) = 𝑝, 𝛾 (1) = 𝑞, and d(𝑝, 𝑞) = len 𝛾.
Here, “constant speed” implies that for all 𝑠, 𝑡 ∈ [0, 1],

d(𝛾 (𝑠), 𝛾 (𝑡)) = |𝑠 − 𝑡 | d(𝑝, 𝑞) .

The curve 𝛾 is called the geodesic joining 𝑝 to 𝑞.

Geodesic spaces are a broader class of spaces than Riemannian manifolds. In particular,
they do not have to have a smooth structure, and they can have “kinks”. For example,
the Wasserstein space (P2,ac (R𝑑 ),𝑊2 ) is not truly a Riemannian manifold, as it is infinite-
dimensional (along with other issues, e.g., it is not locally homeomorphic to a Hilbert
space), but from the considerations in Section 1.3.2 it follows that the Wasserstein space
is a geodesic space. The study of geodesic spaces is called metric geometry, and a
comprehensive treatment of this subject can be found in [BBI01].
There is a way to generalize the idea of a uniform bound on the sectional curvature to
the setting of geodesic spaces. It is based on comparing the sizes of triangles in X with
the corresponding sizes in a model space.

Definition 2.6.4 (model space). Let 𝜅 ∈ R. The model space M𝜅2 of curvature 𝜅 is the
standard two-dimensional Riemannian manifold with constant sectional curvature
equal to 𝜅, that is:

1. the hyperbolic plane H2 of√curvature 𝜅 (that is, the usual hyperbolic plane but
with metric rescaled by 1/ −𝜅) if 𝜅 < 0;

2. the Euclidean plane R2 if 𝜅 = 0;



3. the rescaled sphere S2 / 𝜅 if 𝜅 > 0.

Definition 2.6.5 (Alexandrov curvature). Let (X, d) be a geodesic space and let 𝜅 ∈ R.
We say that (X, d) has Alexandrov curvature bounded from below by 𝜅 (resp.
from above by 𝜅) if the following holds. For any triple of points 𝑎, 𝑏, 𝑐 ∈ X, and any
¯ 𝑐¯ in the model space M2 such that
¯ 𝑏,
corresponding triple of points 𝑎, 𝜅

d(𝑎, 𝑏) = d(𝑎, ¯ ,
¯ 𝑏) ¯ 𝑐)
d(𝑎, 𝑐) = d(𝑎, ¯ , ¯ 𝑐)
d(𝑏, 𝑐) = d(𝑏, ¯ ,
2.6. METRIC MEASURE SPACES 115

for any 𝑝 ∈ X in the geodesic joining 𝑎 to 𝑐, and any 𝑝¯ ∈ M𝜅2 in the geodesic joining
𝑎¯ to 𝑐¯ with d(𝑎, 𝑝) = d(𝑎,
¯ 𝑝), ¯ 𝑝)
¯ it holds that d(𝑏, 𝑝) ≥ d(𝑏, ¯ 𝑝)).
¯ (resp. d(𝑏, 𝑝) ≤ d(𝑏, ¯
If such a curvature bound holds, then (X, d) is called an Alexandrov space.

Thus, triangles in X are thicker (resp. thinner) than their counterparts in M𝜅2 . The
advantage of this definition is that it can be stated using only the metric (and geodesic)
structure of X. For the case when 𝜅 = 0, there is another useful reformulation.

Proposition 2.6.6. Let (X, d) be a geodesic space. Then, (X, d) has Alexandrov curva-
ture bounded below by 0 (resp. bounded above by 0) if and only if the following holds.
For any constant-speed geodesic (𝑝𝑡 )𝑡 ∈[0,1] in X, any 𝑞 ∈ X, and any 𝑡 ∈ [0, 1],

d(𝑝𝑡 , 𝑞) 2 ≥ (resp. ≤) (1 − 𝑡) d(𝑝 0, 𝑞) 2 + 𝑡 d(𝑝 1, 𝑞) 2 − 𝑡 (1 − 𝑡) d(𝑝 0, 𝑝 1 ) 2 .

We saw in Exercise 1.13 that (P2,ac (R𝑑 ),𝑊2 ) has non-negative Alexandrov curvature.
One can show that a Riemannian manifold has section curvature bounded by 𝜅 if and
only if the corresponding Alexandrov curvature bound holds.
Alexandrov curvature bounds enforce enough regularity that a satisfactory infinites-
imal theory can be developed for Alexandrov spaces. For instance, one can define the
notion of a tangent cone11 , and in the case of the Wasserstein space, its tangent cone
coincides with the definition of the tangent space that we gave in Section 1.3.2; see [AGS08,
§12.4] for details.

2.6.3 Geometry of Markov Semigroups


We now indicate how Markov semigroup proofs can be extended to the setting of a
weighted Riemannian manifold M with a reference measure 𝜋 which admits a density
𝜋 ∝ exp(−𝑉 ) w.r.t. the volume measure 𝔪.
Consider the Langevin diffusion on M with generator ℒ given by
ℒ𝑓 B Δ𝑓 − h∇𝑉 , ∇𝑓 i .
As before, we can compute the carré du champ to be
Γ(𝑓 , 𝑓 ) = k∇𝑓 k 2 .
For the iterated carré du champ,
1
Γ2 (𝑓 , 𝑓 ) = {ℒ(k∇𝑓 k 2 ) − 2 h∇𝑓 , ∇ℒ𝑓 i}
2
11 In general, this is only a cone and not a vector space, because of the possibility of kinks.
116 CHAPTER 2. FUNCTIONAL INEQUALITIES

1
= {Δ(k 𝑓 k 2 ) − 2 h∇𝑓 , ∇Δ𝑓 i} + h∇𝑓 , ∇2𝑉 ∇𝑓 i .
2
Unlike in Section 2.2.1, however, we now have to apply the Bochner identity
1
Δ(k∇𝑓 k 2 ) = h∇𝑓 , ∇Δ𝑓 i + k∇2 𝑓 k 2HS + Ric(∇𝑓 , ∇𝑓 ) (2.6.7)
2
which shows that

Γ2 (𝑓 , 𝑓 ) = k∇2 𝑓 k 2HS + h∇𝑓 , (Ric + ∇2𝑉 ) ∇𝑓 i .

Observe that in this formula, the curvature of the ambient space and the curvature of the
measure are placed on an equal footing through the tensor Ric + ∇2𝑉 . If Ric + ∇2𝑉  𝛼,
in the sense that Ric(𝑋, 𝑋 ) + h𝑋, ∇2𝑉 𝑋 i ≥ 𝛼 k𝑋 k 2 for any vector field 𝑋 on M, then
the curvature-dimension condition Γ2 ≥ 𝛼 Γ holds. Since the proof of the Bakry–Émery
theorem (Theorem 1.2.28) only relied on the CD(𝛼, ∞) condition (together with calculus
rules for the Markov semigroup, such as the chain rule), the theorem continues to hold in
the setting of weighted Riemannian manifolds.
Actually, we can refine the condition further as follows. If dim M = 𝑑, then
1 2 1
k∇2 𝑓 k 2HS ≥ (tr ∇2 𝑓 ) = (Δ𝑓 ) 2 .
𝑑 𝑑
This observation motivates the following definition.

Definition 2.6.8. A Markov semigroup is said to satisfy the curvature-dimension


condition with curvature lower bound 𝛼 and dimension bound 𝑑, denoted CD(𝛼, 𝑑),
if for all functions 𝑓 ,
1
Γ2 (𝑓 , 𝑓 ) ≥ 𝛼 Γ(𝑓 , 𝑓 ) + (ℒ𝑓 ) 2 . (2.6.9)
𝑑

As the name suggests, the following theorem holds.

Theorem 2.6.10. Let M be a complete Riemannian manifold with volume measure


𝔪, and let 𝛼 > 0, 𝑑 ≥ 1. Consider the Markov semigroup associated with standard
Brownian motion on M. Then, the following two statements are equivalent.

1. CD(𝛼, 𝑑) holds.

2. Ric  𝛼 and dim M ≤ 𝑑.


2.6. METRIC MEASURE SPACES 117

As an example, one can show that the unit sphere S𝑑 satisfies Ric = 𝑑 − 1, so that the
CD(𝑑 − 1, 𝑑) condition holds. Then, using the Bakry–Émery theorem (Theorem 1.2.28), or
by using Markov semigroup calculus to prove that the curvature-dimension condition
implies Bobkov’s functional form of the Gaussian isoperimetric inequality (Theorem 2.5.27;
see [BGL14, Corollary 8.5.4]), one can now deduce results such as concentration on the
sphere (Theorem 2.5.2).
Besides providing an abstract framework for deriving functional inequalities, it is
worth noting that the condition (2.6.9) no longer makes any mention of the ambient space
except through the Markov semigroup (𝑃𝑡 )𝑡 ≥0 and its associated operators ℒ, Γ, and Γ2 .
This has led to a line of research investigating to what extent we can study the intrinsic
geometry intrinsic associated with a Markov semigroup. Although we do not intend
to survey the literature here, we show one illustrative example to give the flavor of the
results. First, one shows that the CD(𝛼, 𝑑) condition implies a Sobolev inequality.

Theorem 2.6.11. Consider a diffusion Markov semigroup satisfying the CD(𝛼, 𝑑) con-
2𝑑
dition for some 𝛼 > 0 and 𝑑 > 2. Then, for all 𝑝 ∈ [1, 𝑑−2 ] and all functions 𝑓 ,

1 n
∫  2/𝑝 ∫ o 𝑑 −1∫
2
|𝑓 | d𝜋
𝑝
− 𝑓 d𝜋 ≤ Γ(𝑓 , 𝑓 ) d𝜋 . (2.6.12)
𝑝 −2 𝛼𝑑

From this Sobolev inequality, one can then deduce a diameter bound for the Markov
semigroup. Here, the diameter is defined as follows:

diam (𝑃𝑡 )𝑡 ≥0 B sup 𝜋-ess sup |𝑓 (𝑥) − 𝑓 (𝑦)| kΓ(𝑓 , 𝑓 )k 𝐿∞ (𝜋) ≤ 1 .


 
𝑥,𝑦∈X

Theorem 2.6.13. Suppose that the Markov semigroup (𝑃𝑡 )𝑡 ≥0 satisfies the Sobolev
inequality (2.6.12). Then,
√︂
𝑑 −1
diam (𝑃𝑡 )𝑡 ≥0 ≤ π

.
𝛼

The diameter bound is sharp, as it is attained by the sphere, and together with Theo-
rem 2.6.11 it recovers the classical Bonnet–Myers diameter bound from Riemannian
geometry. Other geometric results obtained in this fashion include volume growth com-
parison results and heat kernel bounds.
118 CHAPTER 2. FUNCTIONAL INEQUALITIES

2.6.4 The Lott–Sturm–Villani Theory of Synthetic Ricci Curvature


The other perspective with which we can encode geometry is the optimal transport
perspective. Namely, in Section 1.4, we informally argued that in the Euclidean context,
the 𝛼-strong convexity of the KL divergence KL(· k 𝜋) on (P2 (R𝑑 ),𝑊2 ) is equivalent to
the 𝛼-strong convexity of the potential 𝑉 . At this stage, it is a perhaps expected, although
still remarkable, fact that on a general weighted Riemannian manifold M, the 𝛼-strong
convexity of KL(· k 𝜋) on (P2 (M),𝑊2 ) is equivalent to the CD(𝛼, ∞) condition, which in
turn is equivalent to Ric + ∇2𝑉  𝛼.
There are also ways to formulate the general CD(𝛼, 𝑑) condition via displacement
convexity, but they are considerably more complicated, and we omit them for simplicity.
If (X, d) is a geodesic space, then (P2 (X),𝑊2 ) is also a geodesic space, which is suffi-
cient to define displacement convexity. Hence, we can work in the setting of Section 2.6.2,
together with the additional data of a reference measure 𝜋 ∈ P (X). In general, technical
issues arise when geodesics on X can “branch” off into multiple geodesics, and so we ought
to impose a mild non-branching assumption; however, we will ignore this technicality.
We can then formulate the following definition.

Definition 2.6.14. Let (X, d, 𝜋) be a metric measure space, where (X, d) is a geodesic
space. Then, we say that (X, d, 𝜋) satisfies the CD(𝛼, ∞) condition if for all measures
𝜇0, 𝜇1 ∈ P2 (X), there exists a constant-speed geodesic (𝜇𝑡 )𝑡 ∈[0,1] joining 𝜇0 to 𝜇1 with

𝛼 𝑡 (1 − 𝑡) 2
KL(𝜇𝑡 k 𝜋) ≤ (1 − 𝑡) KL(𝜇 0 k 𝜋) + 𝑡 KL(𝜇1 k 𝜋) − 𝑊2 (𝜇 0, 𝜇1 ) ,
2
for all 𝑡 ∈ [0, 1].

We now pause to discuss the motivation behind the introduction of this definition.
Unlike the statement Ric  𝛼, which only makes sense on Riemannian manifolds (and
hence requires a smooth structure), the above definition makes sense on a wider class
of spaces, including non-smooth spaces. The question of to what extent the concept of
curvature makes sense on non-smooth spaces is perhaps an interesting question in its
own right, but it also arises even when one is solely interested in smooth Riemannian
manifolds. Suppose, for instance, that we have a sequence of Riemannian manifolds
(M𝑘 )𝑘∈N that is converging in some sense to a limit space M; what properties of the
sequence are preserved in the limit?
If we want to pass to the limit in the condition RicM𝑘  𝛼, then typically we would
need the Ricci curvature tensors RicM𝑘 to be converging in the limit. Since curvature
involves two derivatives of the metric, this holds if the sequence converges in a C 2 sense.
2.6. METRIC MEASURE SPACES 119

However, for some applications, this notion of convergence is too strong. Instead, it is
common to work with Gromov–Hausdorff convergence, which is based on a notion
of distance between metric spaces. More specifically, it metrizes the space12 of compact
metric spaces. Moreover, this notion of convergence is weak enough that it admits a
useful compactness theorems.
As a consequence of the compactness theorem, a sequence of Riemannian manifolds
(M𝑘 )𝑘∈N with a uniform upper bound on the diameter and a uniform lower bound on
the Ricci curvature converges to a limit space M in the Gromov–Hausdorff topology.
However, in this topology, the space of Riemannian manifolds with diameter ≤ 𝐷 and
with Ric  𝛼 is not closed; the limit space M is not necessarily a Riemannian manifold.
So what then is M? It is a geodesic space, but understanding whether it can be said to
satisfy “RicM  𝛼” requires developing a theory of Ricci curvature lower bounds that
makes sense on such spaces.
An analogy is in order. For a function 𝑓 : R𝑑 → R, convexity can be described via the
Hessian, ∇2 𝑓  0, or via the property
for all 𝑥, 𝑦 ∈ R𝑑 , 𝑡 ∈ [0, 1] .

𝑓 (1 − 𝑡) 𝑥 + 𝑡 𝑦 ≤ (1 − 𝑡) 𝑓 (𝑥) + 𝑡 𝑓 (𝑦) ,
The former definition only makes sense for C 2 functions, whereas the latter definition
makes sense for any function. The former is called the analytic definition, whereas the
definition is called synthetic definition. Although the analytic definition is often more
intuitive, the synthetic definition is more general and more useful for technical arguments.
For example, from the synthetic definition is apparent that convexity is preserved under
pointwise convergence, whereas from the analytic definition one needs the stronger
notion of C 2 convergence.
From this perspective, the definition of Alexandrov curvature bounds in Section 2.6.2
is the synthetic counterpart to sectional curvature bounds from Riemannian geometry.
However, as we have already seen, sectional curvature bounds are often too strong for
geometric purposes, as we can obtain a wide array of geometric consequences (spectral
gap estimates, log-Sobolev and Sobolev inequalities, diameter bounds, volume growth
estimates, heat kernel bounds, etc.) from Ricci curvature lower bounds. Here, the curvature-
dimension condition provides us with synthetic Ricci curvature lower bounds.
By deducing geometric facts from the CD(𝛼, ∞) condition, one shows that spaces
satisfying the CD(𝛼, ∞) condition, despite the lack of smoothness, enjoy many of the
good properties shared by Riemannian manifolds satisfying Ric  𝛼. To complete the
program described in this section, we should ask whether synthetic Ricci curvature lower
bounds are preserved under a weak notion of convergence. The correct notion to consider
12 Thespace of all compact metric spaces is too large to be a set (it is a proper class). However, if we
choose one representative from each isometry class of metric spaces, then this is a bona fide set.
120 CHAPTER 2. FUNCTIONAL INEQUALITIES

is an extension of Gromov–Hausdorff convergence to take into account the reference


measure, called measured Gromov–Hausdorff convergence.

Definition 2.6.15. Let (X𝑘 , d𝑘 , 𝜋𝑘 )𝑘∈N be a sequence of compact metric measure


spaces. We say that the sequence converges to (X, d, 𝜋) in the measured Gromov–
Hausdorff topology if there is a sequence (𝑓𝑘 )𝑘∈N of maps 𝑓𝑘 : X𝑘 → X with:

1. sup𝑥𝑘 ,𝑥 0 ∈X𝑘 |d(𝑓𝑘 (𝑥𝑘 ), 𝑓𝑘 (𝑥𝑘0 )) − d𝑘 (𝑥𝑘 , 𝑥𝑘0 )| = 𝑜 (1);


𝑘

2. sup𝑥 ∈X inf 𝑥𝑘 ∈X𝑘 |d(𝑓𝑘 (𝑥𝑘 ), 𝑥)| = 𝑜 (1);

3. (𝑓𝑘 ) # 𝜋𝑘 → 𝜋 weakly.

The following stability result is a key achievement of the theory of synthetic Ricci
curvature, arrived at simultaneously by Lott and Villani [LV09] and Sturm [Stu06a; Stu06b].

Theorem 2.6.16 (stability of synthetic Ricci curvature bounds). Let (X𝑘 , d𝑘 , 𝜋𝑘 )𝑘∈N →
(X, d, 𝜋) in the measured Gromov–Hausdorff topology. Let 𝛼 ∈ R and 𝑑 ≥ 1. If each
(X𝑘 , d𝑘 , 𝜋𝑘 ) satisfies CD(𝛼, 𝑑), then so does (X, d, 𝜋).

Note that we have not defined the CD(𝛼, 𝑑) condition for 𝑑 < ∞ in this context; we
refer readers to the original sources for the full treatment.

2.6.5 Discussion
A remark on the settings of the results. Throughout this chapter, we have not been
careful to state in what generality the various results hold. Certainly the results hold
on the Euclidean space R𝑑 , and with appropriate modifications they continue to hold on
weighted Riemannian manifolds.
The results based on optimal transport (e.g., results on transport inequalities) typically
hold on general Polish spaces. The theory of synthetic Ricci curvature makes sense on
geodesic spaces (with mild regularity conditions).
The results based on Markov semigroup theory only require an abstract space X on
which there is a Markov semigroup (𝑃𝑡 )𝑡 ≥0 satisfying various properties (e.g., a chain rule
for the carré du champ). Although this usually arises from a diffusion on a Riemannian
manifold, one can also start with a Dirichlet energy functional on a metric space and
develop a theory of non-smooth analysis. See [AGS15] for further discussion on how the
two approaches may be reconciled in a quite general setting.
2.6. METRIC MEASURE SPACES 121

Comparison between the two approaches. The discussion thus far has been rather
abstract, and it may be difficult to grasp how the two main approaches (Bakry–Émery
theory and optimal transport) can capture geometric information such as the curvature.
Here, we will briefly provide some intuition for this connection following [Vil09, §14].
Starting with the optimal transport perspective, fix 𝑥 0 ∈ M and a mapping ∇𝜓 . For
𝑡 ≥ 0, let 𝑥𝑡 B exp(𝑡 ∇𝜓 (𝑥 0 )), and let 𝛿 > 0. If 𝑒 1, . . . , 𝑒𝑑 be an orthonormal basis of 𝑇𝑥 0 M,
in an abuse of notation let 𝑥 0 + 𝛿𝑒𝑖 denote a point obtained by travelling along a curve
emanating from 𝑥 0 with velocity 𝑒𝑖 for time 𝛿. The points (𝑥 0 + 𝛿𝑒𝑖 )𝑑𝑖=1 form the vertices
of a parallelepiped 𝐴𝛿0 . On the other hand, for 𝑡 > 0, we can consider pushing the point
𝑥 0 + 𝛿𝑒𝑖 along the exponential map to obtain a new point exp𝑥 0 +𝛿𝑒𝑖 (𝑡 ∇𝜓 (𝑥 0 + 𝛿𝑒𝑖 )). These
points form the vertices of a new parallelepiped 𝐴𝛿𝑡 .
In terms of measures, let 𝜇𝛿0 denote the uniform measure on 𝐴𝛿0 , and 𝜇𝑡𝛿 = exp(𝑡 ∇𝜓 ) # 𝜇𝛿0 ,
so that 𝜇𝑡𝛿 is approximately the uniform measure on 𝐴𝛿𝑡 . Then, the displacement convexity
of entropy states that
1 1 1
ln ≤ (1 − 𝑡) ln + 𝑡 ln + 𝑜 (1)
𝔪(𝐴𝑡 )
𝛿 𝔪(𝐴0 )
𝛿 𝔪(𝐴𝛿1 )

as 𝛿 & 0. On the other hand, the infinitesimal change in volume is governed by the
Jacobian determinant

𝔪(𝐴𝛿𝑡 )
→ J (𝑡, 𝑥) B det 𝐽 (𝑡, 𝑥) ,
𝔪(𝐴𝛿0 )

where 𝐽𝑖 (𝑡, 𝑥) B 𝜕𝛿 |𝛿=0 exp𝑥 0 +𝛿𝑒𝑖 (𝑡 ∇𝜓 (𝑥 0 +𝛿𝑒𝑖 )). Hence, the displacement convexity yields

ln J (𝑡, 𝑥) ≥ (1 − 𝑡) ln J (0, 𝑥) + 𝑡 ln J (1, 𝑥) . (2.6.17)

In Euclidean space, we have the formula J (𝑡, 𝑥) = |det(𝐼𝑑 + 𝑡 ∇2𝜓 (𝑥))|, but the situation
is more complicated on a Riemannian manifold because there is also a change of volume
due to curvature. To account for this, one can derive an equation for 𝐽 , known as the
Jacobi equation:

𝐽¥(𝑡, 𝑥) + 𝑅(𝑡, 𝑥) 𝐽 (𝑡, 𝑥) = 0 ,

where 𝑅(𝑡, 𝑥) B Riem𝑥𝑡 (𝑥¤𝑡 , ·, 𝑥¤𝑡 , ·). By taking the trace and performing some computations,
we arrive at

𝜕𝑡2 J (𝑡, 𝑥) = −k𝐽 −1 (𝑡, 𝑥) 𝐽¤(𝑡, 𝑥)k 2HS − Ric𝑥𝑡 (𝑥¤𝑡 , 𝑥¤𝑡 ) . (2.6.18)
122 CHAPTER 2. FUNCTIONAL INEQUALITIES

By comparing (2.6.17) and (2.6.18), we now obtain a hint as to how optimal transport
captures curvature: displacement convexity of the entropy is related to concavity of the
Jacobian determinant, which in turn is tied to Ricci curvature lower bounds.
The calculations above are performed with the Lagrangian description of fluid flows,
as they follow a single trajectory 𝑡 ↦→ 𝑥𝑡 . If we switch to the Eulerian perspective, then
we are led to define the vector field ∇𝜓𝑡 as follows: ∇𝜓𝑡 (𝑥) is the velocity 𝑥¤𝑡 of the
curve 𝑡 ↦→ exp𝑥 (𝑡 ∇𝜓 (𝑥)) at time 𝑡. By reformulating the Jacobi equation in the Eulerian
perspective, we arrive precisely at the Bochner identity (2.6.7) for 𝜓 which, as we saw
in Section 2.6.3, underlies the curvature-dimension condition from the Bakry–Émery
perspective. In this sense, the two approaches to curvature are dual.

2.7 Discrete Space and Time


Up until this point, we have been focusing on continuous-time Markov processes on a
continuous state space. In this section, we give a few pointers on what may break down
in discrete space or discrete time. Our treatment here is far from comprehensive.

Discrete space. For Markov processes on a discrete space space, we can still define the
Markov semigroup, generator, carré du champ, and Dirichlet form. The main difference
is that the carré du champ is now a finite difference operator, rather than a differential
operator, and consequently it fails to satisfy a chain rule.
Crucially, this difference manifests itself for the log-Sobolev inequality, which we have
written in this chapter as
ent𝜋 (𝑓 2 ) ≤ 2𝐶 ℰ(𝑓 , 𝑓 ) for all 𝑓 . (2.7.1)
On the other hand, recall from Theorem 1.2.20 (which still holds for discrete state spaces)
that the exponential decay of the KL divergence is equivalent to the inequality
𝐶
ent𝜋 (𝑓 ) ≤ ℰ(𝑓 , ln 𝑓 ) for all 𝑓 ≥ 0 . (2.7.2)
2
When the carré du champ satisfies a chain rule, then (2.7.1) and (2.7.2) are equivalent, but
in general the first inequality (2.7.1) is strictly stronger.

Lemma 2.7.3. The inequality (2.7.1) implies inequality (2.7.2).

See Exercise 2.16. The first inequality (2.7.1) is often simply called the log Sobolev
inequality, whereas the second inequality (2.7.2) is called a modified log-Sobolev in-
equality (MLSI). In many cases, the log-Sobolev inequality is too strong in that it does
2.7. DISCRETE SPACE AND TIME 123

not hold with a good constant 𝐶; hence, the modified log-Sobolev inequality is often the
more appropriate inequality for the discrete setting.
We have already seen a concentration inequality for discrete spaces in Exercise 2.12.
In general, concentration of measure on discrete spaces is a rich subject, with many
applications to computer science and probability, and at the same time subtle, involving
new ideas such as asymmetric transport inequalities or careful use of hypercontractivity.
See, e.g., [BLM13; Han16] for more detailed treatments.

Discrete time. Similarly, for discrete-time Markov chains we can no longer use semi-
group calculus, although the basic principles of Poincaré inequalities (spectral gap inequal-
ities) and modified log-Sobolev inequalities can be adapted to this setting. In addition,
there are new techniques based on the notion of conductance. As we shall need to study
discrete-time Markov chains in detail for sampling algorithms, we defer a fuller discussion
of this theory to Chapter 7.

Discrete curvature. Inspired by the geometric connections in Section 2.6, many re-
searchers have attempted to define notions of curvature on discrete spaces. We do not
attempt to survey this literature here, but we give a few pointers to the literature.
Ollivier [Oll07; Oll09] introduced the following notion of curvature.

Definition 2.7.4. A metric space (X, d) equipped with a Markov kernel 𝑃 is said to
have coarse Ricci curvature bounded below by 𝜅 ∈ [0, 1] if for all 𝑥, 𝑦 ∈ X,

𝑊1 𝑃 (𝑥, ·), 𝑃 (𝑦, ·) ≤ (1 − 𝜅) d(𝑥, 𝑦) .

In other words, the Markov chain with kernel 𝑃 is a 𝑊1 contraction. The definition is
motivated by the following observation: on a 𝑑-dimensional Riemannian manifold with
Ric  𝛼, let 𝑃 (𝑥, ·) be the uniform measure on B(𝑥, 𝜀). Then, provided that d(𝑥, 𝑦) = 𝑂 (𝜀),
it holds that
 𝛼𝜀 2 3

𝑊1 𝑃 (𝑥, ·), 𝑃 (𝑦, ·) ≤ 1 −

+ 𝑂 (𝜀 ) d(𝑥, 𝑦) .
2 (𝑑 + 2)

A lower bound on the coarse Ricci curvature is often too strong of an assumption for
the purpose of studying mixing times of Markov chains, although there are refinements
in [Oll07; Oll09]. However, when a lower bound on the coarse Ricci curvature holds,
then it implies a number of useful consequences, such as concentration estimates and
functional inequalities. We mention the following result in particular.
124 CHAPTER 2. FUNCTIONAL INEQUALITIES

Theorem 2.7.5 ([Oll09]). Suppose that 𝑃 is a Markov kernel on a metric space (X, d),
and that 𝑃 has coarse Ricci curvature bounded below by 𝜅. Then, 𝑃 satisfies a Poincaré
inequality with constant at most 1/𝜅.

Refer to Chapter 7 for a precise definition of the Poincaré inequality used here.
Other approaches for studying the curvature of discrete Markov processes include:
studying the displacement convexity of entropy (using different interpolating curves rather
than𝑊2 geodesics) [OV12; Goz+14; Léo17]; using ideas from Bakry–Émery theory [Kla+16;
FS18]; and defining a modified 𝑊2 distance for which the Markov process becomes a
gradient flow of the KL divergence [Maa11; EM12; Mie13].
We emphasize that although we only described the coarse Ricci curvature approach
in any detail, there is not a single approach which supersedes the others in the discrete
setting. Each approach has its own merits and shortcomings.

Bibliographical Notes
The monographs [BGL14; Han16] are excellent sources to learn more about Markov
semigroup theory.
The Monge–Ampère equation introduced in Exercise 2.1, being a fully non-linear
PDE, is fairly difficult to study. See [Vil03, §4] for an overview of rigorous results on the
Monge–Ampère equation, including the celebrated regularity theory of Caffarelli. The
proofs of Proposition 2.1.1 and Exercise 2.3 are taken from [Cor17].
In the proof of Lemma 2.2.6, we assumed the solvability of the Poisson equation; this
can be avoided via a density argument, see [CFM04; BC13]. The proof of the dimensional
Brascamp–Lieb inequality in Exercise 2.4 is taken from the paper [BGG18]. The bound on
var𝜋 𝑉 obtained in the exercise was used in [Che21b] to show that the entropic barrier is an
optimal self-concordant barrier. Note that the Brascamp–Lieb inequality in Theorem 2.2.8
should not be confused with another family of inequalities, which are unfortunately also
known as Brascamp–Lieb inequalities, described in, e.g., [Vil03, §6.3].
Although the device in the proof of Theorem 2.2.11 of differentiating 𝑠 ↦→ 𝑃𝑠 ((𝑃𝑡−𝑠 𝑓 ) 2 )
may seem mysterious at first glance, it forms the basis for a great number of useful
inequalities. The key is that the chain rule for the carré du champ also implies a chain rule
for the generator: ℒ(𝜙 ◦ 𝑓 ) = 𝜙 0 (𝑓 ) ℒ𝑓 + 𝜙 00 (𝑓 ) Γ(𝑓 , 𝑓 ). Using this, one can differentiate
𝑠 ↦→ 𝑃𝑠 𝜙 (𝑃𝑡−𝑠 𝑓 ) for a general function 𝜙 : R → R, and obtain the nice identity

𝜕𝑠 [𝑃𝑠 𝜙 (𝑃𝑡−𝑠 𝑓 )] = 𝑃𝑠 ℒ𝜙 (𝑃𝑡−𝑠 𝑓 ) − 𝜙 0 (𝑃𝑡−𝑠 𝑓 ) ℒ𝑃𝑡−𝑠 𝑓 = 𝑃𝑠 𝜙 00 (𝑃𝑡−𝑠 𝑓 ) Γ(𝑃𝑡−𝑠 𝑓 , 𝑃𝑡−𝑠 𝑓 ) .


 

The book [BGL14] is a treasure trove of applications of this principle.


2.7. DISCRETE SPACE AND TIME 125

The convergence in Rényi divergence of the Langevin diffusion was obtained earlier,
under stronger assumptions in [CLL19]. A natural question to ask is whether there are
functional inequalities that interpolate between the Poincaré and log-Sobolev inequalities,
which imply intermediate rates of convergence for the Langevin diffusion. One answer is
given by the family of Latała–Oleszkiewicz inequalities (LOI) [LO00]. The convergence
of the Langevin diffusion under an LO inequality is given in [Che+21a]. One can also
consider variants of Sobolev inequalities [Cha04].
In [KLS95], Kannan, Lovász, and Simonovits conjectured that any log-concave mea-
sure 𝜋 on R𝑑 which is isotropic (i.e., if 𝑋 ∼ 𝜋 then cov 𝑋 = 𝐼𝑑 ) satisfies a Poincaré
inequality with a dimension-free constant 𝐶 PI . 1. This is known as the Kannan–
Lovász–Simonovits (KLS) conjecture. By considering linear test functions of the form
𝑥 ↦→ h𝑎, 𝑥i, one has 𝐶 PI ≥ 1, so the conjecture asserts that linear functions nearly saturate
the spectral gap inequality for log-concave measures. The KLS conjecture has inspired
a considerable amount of research (including Theorem 2.5.18), see [GM11; Eld13; LV17;
Che21a], culminating in the current state-of-the-art result of [KL22] which asserts that
𝐶 PI ≤ polylog(𝑑).
Exercise 2.8 essentially contains the main results of [CCN21] (actually the paper
assumes a slightly weaker condition than (2.E.1), namely that the 𝑝-th moment of the
chi-squared divergence is bounded for some 𝑝 > 1, but this is handled with the same
arguments as in Exercise 2.8).
There are many treatments on concentration of measure, e.g., [Led01; BLM13; BGL14;
Han16; Ver18]. The proof of Lemma 2.4.4 is from [Mil09]. A proof of the characterization
of the T1 inequality in Theorem 2.4.11 can be found in, e.g., [BV05].
The proof of Sanov’s theorem can be found in many textbooks on large deviations,
e.g., [DZ10; RS15].
The monographs [BH97; Led01; BGL14] are excellent sources to learn about isoperime-
try. The exposition of the functional form of Cheeger’s inequality (Theorem 2.5.14) as well
as Milman’s theorem (Theorem 2.5.18) were inspired by the treatment in [AB15]. It would
be hard to survey the various developments on this subject here, but we would like to
mention a few nice additions to the story. First, as we saw in Theorem 2.5.14 and Proposi-
tion 2.5.17, isoperimetric inequalities are typically stronger than their functional inequality
counterparts, and often strictly so. In order to obtain inequalities involving sets which
are equivalent to, say, the Poincaré and log-Sobolev inequalities, one should turn towards
measure capacity inequalities, for which we refer the reader to [BGL14, §8]. Also, more
refined “two-level” isoperimetric inequalities have been pioneered by Talagrand in [Tal91],
which has applications in its own right.
The Gaussian isoperimetric inequality is due to Sudakov and Tsirelson [SC74] and
Borell [Bor75]. It has since been extended and refined in various ways, e.g., in the context
126 CHAPTER 2. FUNCTIONAL INEQUALITIES

of noise stability [Bor85; IM12; Eld15; MN15; KKO18].


Section 2.6 draws upon many resources on geometry, which we list here: [Car92] for
Riemannian geometry; [Hsu02] for diffusions on manifolds; [Vil09] for optimal trans-
port on manifolds, including synthetic Ricci curvature bounds and the discussion in
Section 2.6.5; [BBI01; Gro07] for metric geometry; and [Led00; BGL14] for the geometry
of Markov semigroups.
Although the bounded differences inequality from Exercise 2.12 is already quite pow-
erful, there are situations in which it does not give the correct answer, in which case we
must turn towards more powerful tools. Among these, we mention Talagrand’s convex
distance inequality [Tal96], which can be established via the tensorization argument
of Theorem 2.3.17 (see [Mar96]).

Exercises
Overview of the Inequalities

D Exercise 2.1 linearization of the Monge–Ampère equation


In general, when 𝜇, 𝜈 ∈ P2,ac (R𝑑 ) have smooth densities and ∇𝜑 denotes the optimal
transport map from 𝜇 to 𝜈, then from the change of variables formula we expect
𝜇
= det ∇2𝜑 .
𝜈 ◦ ∇𝜑
This is known as the Monge–Ampère equation. It is a non-linear PDE in the variable 𝜑,
which is a convex function (by Brenier’s theorem, see Theorem 1.3.8). In this exercise, we
linearize the Monge–Ampère equation to gain insight into the infinitesimal behavior of
the optimal transport problem.
Let ∫𝜇 be a probability measure on R𝑑 with a smooth density, and let 𝑓 ∈ Cc∞ (R𝑑 )
satisfy 𝑓 d𝜇 = 0. Let 𝜇𝜀 B (1 + 𝜀 𝑓 ) 𝜇, and let ∇𝜑𝜀 denote the optimal transport map
2
from 𝜇 to 𝜇𝜀 . Assuming that 𝜑𝜀 (𝑥) = k𝑥2k + 𝜀𝑢 (𝑥) + 𝑜 (𝜀) for some function 𝑢 : R𝑑 → R,
perform an expansion of the Monge–Ampère equation in 𝜀 and argue that 𝑢 satisfies the
following linear PDE, known as the Poisson equation:
1
where ℒ𝑢 B Δ𝑢 − ∇ ln , ∇𝑢 .


−ℒ𝑢 = 𝑓 ,
𝜇
Note that ℒ is the generator of the Langevin diffusion with stationary distribution 𝜇
(see Example 1.2.4). Use this to formally argue that
1 2
∫ ∫ ∫
2
lim 2 𝑊2 𝜇, (1 + 𝜀 𝑓 ) 𝜇 = k∇𝑢 k d𝜇 = 𝑢 (−ℒ) 𝑢 d𝜇 = 𝑓 (−ℒ) −1 𝑓 d𝜇 .

𝜀&0 𝜀
2.7. DISCRETE SPACE AND TIME 127

Here, k∇𝑢 k 2 d𝜇 = 𝑢 (−ℒ) 𝑢 d𝜇 is the squared Sobolev norm k𝑢 k 𝐻2¤ 1 (𝜇) , where the dot
∫ ∫

is used to distinguish this from the usual Sobolev norm k𝑢 k 𝐻2 1 (𝜇) = k𝑢 k 𝐿2 2 (𝜇) + k𝑢 k 𝐻2¤ 1 (𝜇) .
Similarly, 𝑓 (−ℒ) −1 𝑓 d𝜇 is the squared inverse Sobolev norm k𝑓 k 𝐻2¤ −1 (𝜇) . Therefore,

the linearization result shows that 𝑊22 (𝜇, 𝜈) ∼ k𝜇 − 𝜈 k 𝐻2¤ −1 (𝜇) as 𝜈 → 𝜇.


Using the linearization (1.E.1) of the KL divergence from Exercise 1.8, deduce that the
T2 (𝐶) inequality implies
∫ ∫
2
𝐶 𝑓 d𝜇 ≥ 𝑓 (−ℒ) −1 𝑓 d𝜇 .

In light of the spectral gap interpretation of the Poincaré inequality, why does the above
inequality suggest that T2 (𝐶) implies PI(𝐶)?
The astute reader should also work out how the Poisson equation can be obtained
starting with the continuity equation (1.3.18).

Proofs via Markov Semigroup Theory

D Exercise 2.2 curvature-dimension condition


Verify the commutation identity (2.2.4) and deduce the formula (2.2.5) for the iterated
carré du champ operator.

D Exercise 2.3 Bregman transport inequality


Let ∇𝜑 denote the optimal transport map from 𝜋 to 𝜇, so that the Monge–Ampère equation
holds (see Exercise 2.1):
𝜋
= det ∇2𝜑 .
𝜇 ◦ ∇𝜑
Take logarithms of both sides of this equation and integrate w.r.t. 𝜋 to prove the Bregman
transport inequality (Theorem 2.2.10). Then, by applying Proposition 2.1.1, give another
proof of the Brascamp–Lieb inequality (Theorem 2.2.8).

D Exercise 2.4 dimensional improvement of the Brascamp–Lieb inequality


In finite-dimensional space, one can improve upon the Brascamp–Lieb inequality (Theo-
rem 2.2.8) by subtracting a non-negative term from the right-hand side. There are different
ways to do this, but in this exercise we explore an approach which utilizes the extra term
k∇2𝑢 k 2HS in the iterated carré du champ operator.
1. Let 𝜋 ∝ exp(−𝑉 ), where as before we assume that 𝑉 is twice continuously differen-
tiable and strictly convex. Let 𝑓 satisfy E𝜋 𝑓 = 0, and consider another function 𝑢
128 CHAPTER 2. FUNCTIONAL INEQUALITIES

(not necessarily the solution to −ℒ𝑢 = 𝑓 ). Show that

E𝜋 [𝑓 2 ] ≤ E𝜋 [(𝑓 + ℒ𝑢) 2 ] + E𝜋 h∇𝑓 , (∇2𝑉 ) −1 ∇𝑓 i − E𝜋 [k∇2𝑢 k 2HS ] .

2. Prove that E𝜋 [k∇2𝑢 k 2HS ] ≥ 𝑑 −1 (E𝜋 Δ𝑢) 2 , and that

E𝜋 Δ𝑢 = cov𝜋 (𝑓 , 𝑉 ) − E𝜋 [(𝑓 + ℒ𝑢) 𝑉 ] .

3. Choose 𝑢 to solve −ℒ𝑢 = 𝑓 + 𝜆 (𝑉 − E𝜋 𝑉 ) for some 𝜆 ≥ 0 and substitute this into


the previous parts. Optimize over 𝜆 and prove that

cov𝜋 (𝑓 , 𝑉 ) 2
var𝜋 𝑓 ≤ E𝜋 h∇𝑓 , (∇2𝑉 ) −1 ∇𝑓 i − .
𝑑 − var𝜋 𝑉

4. In particular, deduce that

𝑑 E𝜋 h∇𝑉 , (∇2𝑉 ) −1 ∇𝑉 i
var𝜋 𝑉 ≤ ≤𝑑.
𝑑 + E𝜋 h∇𝑉 , (∇2𝑉 ) −1 ∇𝑉 i

D Exercise 2.5 local Poincaré inequality


Prove the implication (3) =⇒ (1) in Theorem 2.2.11.
Hint: Perform a Taylor expansion of both sides of (3) up to order 𝑜 (𝑡 2 ).

D Exercise 2.6 hypercontractivity


Let (𝑃𝑡 )𝑡 ≥0 be a reversible Markov semigroup with stationary distribution 𝜋, and let 𝛼 ≥ 0.
Show that the log-Sobolev inequality in the form

ent𝜋 (𝑓 2 ) ≤ 2𝐶 LSI ℰ(𝑓 , 𝑓 )

for all 𝑓 is equivalent to the following hypercontractivity statement: for all functions 𝑓 ,
𝑡 ≥ 0, and 𝑝 ≥ 1, if we set 𝑝 (𝑡) B 1 − (𝑝 − 1) exp(2𝑡/𝐶 LSI ), then

k𝑃𝑡 𝑓 k 𝐿𝑝 (𝑡 ) (𝜋) ≤ k𝑓 k 𝐿𝑝 (𝜋) .

This is a strengthening of the fact that the semigroup is a contraction on any 𝐿𝑝 (𝜋) space
0
and shows that in fact the semigroup maps 𝐿𝑝 (𝜋) into 𝐿𝑝 (𝜋) for some 𝑝 0 > 𝑝.
Hint: Differentiate 𝑡 ↦→ ln k𝑃𝑡 𝑓 k 𝐿𝑝 (𝑡 ) (𝜋) .
2.7. DISCRETE SPACE AND TIME 129

Operations Preserving Functional Inequalities

D Exercise 2.7 transport inequality in one dimension


Let 𝜋 be the standard Gaussian on R, and let 𝜇  𝜋. In one dimension, the optimal
transport map 𝑇 from 𝜋 to 𝜇 is the monotone rearrangement that satisfies, for each 𝑥 ∈ R,
𝜇 ((−∞,𝑇 (𝑥)]) = 𝜋 ((−∞, 𝑥]).
d𝜇
1. Differentiate this relation to obtain a formula for d𝜋 (𝑇 (𝑥)).
d𝜇
2. Substitute this into the KL divergence KL(𝜇 k 𝜋) = ln d𝜋 (𝑇 (𝑥)) d𝜋 (𝑥) and use the

inequality 𝑡 − 1 − ln 𝑡 ≥ 0 for all 𝑡 > 0 in order to prove the Gaussian T2 inequality


in one dimension. Deduce the Gaussian T2 inequality in general dimension via a
tensorization argument.

3. Can you generalize this calculation to a density 𝜋 ∝ exp(−𝑉 ) on R𝑑 , where 𝑉 is


smooth and 𝛼-strongly convex for some 𝛼 > 0?

D Exercise 2.8 generalizing the LSI for mixtures


In this exercise, we generalize the log-Sobolev inequality for mixtures (Proposition 2.3.14).
1. First, show that Example 2.3.15 is sharp up to universal constants as follows. Con-
sider the case when 𝜇 = 12 (𝛿 −𝑅 + 𝛿 +𝑅 ) on R, so that 𝜇𝑃 is a mixture of two Gaussians.
Construct a test function 𝑓 : R → R for the Poincaré inequality which shows that
𝐶 PI (𝜇𝑃) & 𝑅 2 exp(Ω(𝑅 2 /𝜎 2 )) if 𝑅/𝜎 & 1.

2. Next, consider the setting of Proposition 2.3.8 except that we replace the assump-
tion (2.3.9) with the weaker condition
√︃
𝐶 𝜒 2,2 B E[𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 ) 2 ] < ∞ , (2.E.1)

i.i.d.
where 𝑋, 𝑋 0 ∼ 𝜇. Now, rather than writing var E𝑃𝑋 𝑓 = 12 E[|E𝑃𝑋 𝑓 − E𝑃𝑋 0 𝑓 | 2 ],
instead write var E𝑃𝑋 𝑓 = E[|E𝑃𝑋 𝑓 − E𝜇𝑃 𝑓 | 2 ]. By bounding this quantity in two
different ways deduce that

var E𝑃𝑋 𝑓 ≤ E min{(var𝑃𝑋 𝑓 ) 𝜒 2 (𝜇𝑃 k 𝑃𝑋 ), (var𝜇𝑃 𝑓 ) 𝜒 2 (𝑃𝑋 k 𝜇𝑃)}


√︃
≤ E (var𝑃𝑋 𝑓 ) 𝜒 2 (𝜇𝑃 k 𝑃𝑋 ) (var𝜇𝑃 𝑓 ) 𝜒 2 (𝑃𝑋 k 𝜇𝑃) .

Use this to prove that a Poincaré inequality holds for 𝜇𝑃, and give an upper bound
on 𝐶 PI (𝜇𝑃).
130 CHAPTER 2. FUNCTIONAL INEQUALITIES

3. Now consider the setting of Proposition 2.3.14 except that we again assume the
weaker condition (2.E.1). Previously, we bounded

E E𝑃𝑋 (𝑓 2 ) ln 1 + 𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 ) ≤ E𝜇𝑃 (𝑓 2 ) ln(1 + 𝐶 𝜒 2 ) ,


 

which relies on 𝐿 1 –𝐿 ∞ duality. This time, we want to use duality between “𝐿 log 𝐿”
and “exp 𝐿”. Namely, use the variational principle for the entropy (Lemma 2.3.4) to
prove that for a suitable constant 𝐶 > 0 (depending on 𝐶 𝜒 2,2 ),

2 E E𝑃𝑋 (𝑓 2 ) ln 1 + 𝜒 2 (𝑃𝑋 k 𝑃𝑋 0 ) − 𝐶 ≤ ent E𝑃𝑋 (𝑓 2 ) .


   

Use this to prove that a log-Sobolev inequality holds for 𝜇𝑃, and give an upper
bound on 𝐶 LSI (𝜇𝑃).

4. Consider Example 2.3.15 again, except instead of assuming that 𝜇 is supported on


B(0, 𝑅), we assume that 𝜇 has sub-Gaussian tails:

k𝑥 − 𝑥 0 k 2

exp 2
d𝜇 (𝑥) d𝜇 (𝑥 0) ≤ 𝐶 sG .
𝜎sG

Prove that if 𝜎 & 𝜎sG for a sufficiently large implied constant, then the Gaussian
mixture 𝜇𝑃 satisfies a log-Sobolev inequality, and give an upper bound on 𝐶 LSI (𝜇𝑃).
Also, show how this can recover the result of Example 2.3.15.

Concentration of Measure

D Exercise 2.9 Herbst argument


1. Verify the calculus identity (2.4.7) in the Herbst argument.

2. Suppose that 𝑋 is a real-valued random variable satisfying the following condition:


for all 𝜆 ≥ 0, it holds that

𝜆 2𝜎 2
var exp(𝜆𝑋 ) ≤ E𝜋 exp(𝜆𝑋 ) .
4
Let 𝜂 (𝜆) B E exp(𝜆𝑋 ) and deduce an inequality for 𝜂 (𝜆) in terms of 𝜂 (𝜆/2). Solve
this recursion to prove that for 𝜆 < 2/𝜎,

2 + 𝜆𝜎
E exp{𝜆 (𝑋 − E 𝑋 )} ≤ .
2 − 𝜆𝜎
2.7. DISCRETE SPACE AND TIME 131

3. Prove the Poincaré case of Theorem 2.4.8.

D Exercise 2.10 Hoeffding’s lemma and Pinsker’s inequality


1. Hoeffding’s lemma states that for any mean-zero random variable 𝑋 with values
in [𝑎, 𝑏] a.s., it holds that 𝑋 is (𝑏 − 𝑎) 2 /4-sub-Gaussian. Prove this lemma as follows.
For 𝜆 ∈ R, let 𝜓 (𝜆) B ln E exp(𝜆𝑋 ). Differentiate 𝜓 twice and show that 𝜓 00 (𝜆) can
be interpreted as the variance of a random variable under a change of measure and
hence 𝜓 00 (𝜆) ≤ (𝑏 − 𝑎) 2 /4.

2. Pinsker’s inequality states that for any two probability measures 𝜇 and 𝜈 on the
same space, k𝜇 − 𝜈 k 2TV ≤ 21 KL(𝜇 k 𝜈). Prove this inequality as follows. First, by the
data-processing inequality (Theorem 1.5.3), for any event 𝐴,

KL(𝜇 k 𝜈) ≥ KL ( 1𝐴 ) # 𝜇 ( 1𝐴 ) #𝜈 = KL Bernoulli(𝜇 (𝐴)) Bernoulli(𝜇 (𝐴)) .


 

Next, for any 𝑞 ∈ (0, 1), differentiate 𝑝 ↦→ 𝑘𝑞 (𝑝) B KL(Bernoulli(𝑝) k Bernoulli(𝑞))


twice to show that 𝑘𝑞 is 4-strongly convex, and deduce that 𝑘𝑞 (𝑝) ≥ 2 |𝑝 − 𝑞| 2 .
Finally, take the supremum over events 𝐴.

3. Apply the Bobkov–Götze theorem (Theorem 2.4.10) to show that Hoeffding’s lemma
and Pinsker’s inequality are equivalent to each other.

D Exercise 2.11 inequivalence between PI and T1


In this exercise, we show that the Poincaré inequality and the T1 inequality are incompa-
rable, i.e., one does not necessarily imply the other.

1. Use Theorem 2.4.11 to provide an example of a measure 𝜋 ∈ P1 (R𝑑 ) which satisfies


a T1 inequality but which does not satisfy a Poincaré inequality.
Hint: Explain why a Poincaré inequality necessarily requires the support of the
measure to be connected.

2. For the converse direction, let 𝜇 be the exponential distribution on R, so that the
density is 𝜇 (𝑥) = exp(−𝑥) 1{𝑥 > 0}.
∫ 𝑥 Let 𝑓 : R+ → R; we may assume that 𝑓 (0) = 0.
2 2
Now apply the identity 𝑓 (𝑥) = 2 0 𝑓 (𝑠) 𝑓 (𝑠) d𝑠 to the integral 𝑓 d𝜇 and prove

0

that 𝜇 satisfies PI(4). Explain why 𝜇 cannot satisfy a T1 inequality.


132 CHAPTER 2. FUNCTIONAL INEQUALITIES

D Exercise 2.12 bounded differences inequality


1. Prove the Azuma–Hoeffding inequality: let (ℱ𝑖 )𝑛𝑖=0 be a filtration, let (Δ𝑖 )𝑛𝑖=1 be
a martingale difference sequence (that is, Δ𝑖 is ℱ𝑖 -measurable and E[Δ𝑖 | ℱ𝑖−1 ] = 0),
and assume that for each 𝑖 there exist ℱ𝑖−1 -measurable random variables 𝐴𝑖 and 𝐵𝑖
such that 𝐴𝑖 ≤ Δ𝑖 ≤ 𝐵𝑖 a.s. Then, 𝑛𝑖=1 Δ𝑖 is 𝑛𝑖=1 k𝐵𝑖 − 𝐴𝑖 k 𝐿2 ∞ (P) /4-sub-Gaussian.
Í Í

Hint: Apply Hoeffding’s lemma from Exercise 2.10 conditionally.


2. Use this to prove the bounded differences inequality: if 𝑋 1, . . . , 𝑋𝑛 are indepen-
dent, then 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) − E 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) is 𝑛𝑖=1 k𝐷𝑖 𝑓 k 2sup /4-sub-Gaussian.
Í

Hint: Recall the proof of the Efron–Stein inequality from Exercise 1.2.
3. Next, apply Marton’s tensorization (Theorem 2.3.17) to Pinsker’s inequality from Ex-
ercise 2.10 (see Example 2.3.18) to obtain a transport inequality for the product
space X𝑁 . Using the Bobkov–Götze equivalence (Theorem 2.4.10), give a second
proof of the bounded differences inequality.

D Exercise 2.13 a loose end in Gozlan’s theorem


Prove the first statement of Lemma 2.4.13.

Isoperimetric Inequalities

D Exercise 2.14 isoperimetry on the sphere


Prove Theorem 2.5.2 from the spherical isoperimetric inequality (Theorem 2.5.1). To do so,
use the fact that the measure of B(𝑥 0, 𝑟 ) is
(sin 𝜃 )𝑑−1 d𝜃
∫𝑟
 0
𝜎𝑑 B(𝑥 0, 𝑟 ) = ∫ π ,
0
(sin 𝜃 ) 𝑑−1
d𝜃
or prove this fact yourself. It is also acceptable to establish a weaker bound of the form
𝛼𝜎𝑑 (𝜀) ≤ 𝐶 exp(−𝑐𝑑𝜀 2 ) for universal constants 𝑐, 𝐶 > 0.

D Exercise 2.15 Gaussian isoperimetry


1. In the spirit of Theorem 2.5.14, show that the functional inequality (2.5.28) is equiv-
alent to an isoperimetric statement. Consequently, deduce the comparison theorem
(Theorem 2.5.29) from Theorem 2.5.27.
2. Show that the functional form of the Gaussian isoperimetric inequality in (2.5.28) is
preserved (up to constants) under Lipschitz mappings (in other words, prove the
analogue of Proposition 2.3.3 for (2.5.28)).
2.7. DISCRETE SPACE AND TIME 133

Discrete Space and Time

D Exercise 2.16 LSI implies MLSI


Prove Lemma 2.7.3. √ √ 2
Hint: Prove that 4 ( 𝑎 − 𝑏) = ( 𝑎 𝑡 −1/2 d𝑡) 2 ≤ (ln 𝑎 − ln 𝑏) (𝑎 − 𝑏) for all 𝑎, 𝑏 > 0.
∫𝑏
134 CHAPTER 2. FUNCTIONAL INEQUALITIES
CHAPTER 3

Additional Topics in Stochastic Calculus

In this chapter, we further expand our toolbox of stochastic analysis. Namely, we introduce
Girsanov’s theorem, which furnishes a formula for the Radon–Nikodym derivative of
the laws of two SDEs w.r.t. to each other, and we discuss the time reversal of an SDE. In
order to highlight the flexibility and power of these ideas, we then study some interesting
applications, not all of which are directly relevant to log-concave sampling but nevertheless
fit within the broader themes of this book.

3.1 Quadratic Variation


We now take a more general view of the ideas that led to the construction of the Itô
integral as well as Itô’s formula (Theorem 1.1.18).

Finite variation vs. quadratic variation. As a first step towards understanding


the difficulties we faced when constructing the Itô integral, we recall that the classi-
cal condition under which it is possible to integrate a continuous process (𝜂𝑡 )𝑡 ∈[0,𝑇 ]
against another continuous process (𝐴𝑡 )𝑡 ∈[0,𝑇 ] , i.e., when we can consider the integral
𝜂 d𝐴𝑡 , is when the process 𝐴 is of finite variation. This means that for any partition

[0,𝑇 ] 𝑡
0 = 𝑡 0 < 𝑡 1 < · · · < 𝑡𝑛 = 𝑇 of [0,𝑇 ], if we define the mesh of the partition to be

mesh(𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛) B max |𝑡𝑖 − 𝑡𝑖−1 | ,


𝑖∈[𝑛]

135
136 CHAPTER 3. ADDITIONAL TOPICS IN STOCHASTIC CALCULUS

then it holds that


𝑛
∑︁
lim |𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 | < ∞ .
mesh(𝑡𝑖 :𝑖=0,1,...,𝑛)&0
𝑖=1

The above limit is called the total variation of 𝐴 on [0,𝑇 ]. Under this condition, there is
a signed measure 𝜇𝐴 such that for all 𝑡 ∈ [0,𝑇 ], we have 𝜇𝐴 ([0, 𝑡]) = 𝐴𝑡 − 𝐴0 . Moreover,
we can define a norm k·k TV on the space of signed measures, called the total variation
norm, for which k𝜇𝐴 k TV equals the 1
∫ total variation ∫ of 𝐴 as defined above. In this case,
we can simply define the integral [0,𝑇 ] 𝜂𝑡 d𝐴𝑡 B [0,𝑇 ] 𝜂𝑡 d𝜇𝐴 (𝑡).
Note that if 𝑡 ↦→ 𝐴𝑡 is differentiable, then the total variation of 𝐴 equals [0,𝑇 ] |𝐴¤𝑡 | d𝑡,

and the integral becomes [0,𝑇 ] 𝜂𝑡 d𝐴𝑡 = [0,𝑇 ] 𝜂𝑡 𝐴¤𝑡 d𝑡.


∫ ∫

Hence, the condition that 𝐴 is of finite variation is enough to develop a satisfactory


theory of integration. The drawback, however, is that Brownian motion is not of finite
variation. To see this, take 𝑡𝑖 B 𝑖𝑇 /𝑛 for 𝑖 = 0, 1, . . . , 𝑛, so that the mesh of the partition is
𝑇 /𝑛. Since 𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ∼ normal(0,𝑇 /𝑛), we expect (heuristically) that
𝑛 √︂
∑︁ 𝑇
lim |𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 | & lim 𝑛 · = ∞.
𝑛→∞
𝑖=1 | {z }
𝑛→∞ 𝑛

 𝑇 /𝑛

On the other hand, if we change the definition slightly, then we expect (heuristically) that
𝑛
𝑇
|𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 | 2 . lim 𝑛 · < ∞ .
∑︁
lim
𝑛→∞
𝑖=1 | {z } 𝑛→∞ 𝑛
𝑇 /𝑛

We say that Brownian motion has finite quadratic variation. We will show in fact that
the above limit is well-defined almost surely.
More generally, for a process of the form
∫ 𝑡 ∫ 𝑡
𝑋𝑡 = 𝑋 0 + 𝑏𝑡 d𝑡 + 𝜎𝑡 d𝐵𝑡 , 𝑡 ∈ [0,𝑇 ] ,
0 0

the second term is a process of finite variation (provided that [0,𝑇 ] |𝑏𝑡 | d𝑡 < ∞ almost

surely), whereas the third term requires consideration of quadratic variation.


1 Indeed,the notation k𝜇 − 𝜈 k TV for the total variation distance between 𝜇 and 𝜈 is in accordance with
this more general notion of a norm on the space of signed measures, up to a factor of 2 in the conventions.
3.1. QUADRATIC VARIATION 137

Definition of the quadratic variation. More formally, we have the following theorem.

Theorem 3.1.1 (quadratic variation). Let (𝑀𝑡 )𝑡 ∈[0,𝑇 ] be a continuous local martingale;
then, there is an a.s. unique increasing process 𝑡 ↦→ h𝑀, 𝑀i𝑡 such that 𝑡 ↦→ 𝑀𝑡2 − h𝑀, 𝑀i𝑡
is a continuous local martingale. Also, suppose that for each 𝑛 ∈ N+ , (𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛)
is a partition of [0, 𝑡], with mesh tending to zero as 𝑛 → ∞. Then,
𝑛
(𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 ) 2
∑︁
h𝑀, 𝑀i𝑡 = lim in probability .
𝑛→∞
𝑖=1

Definition 3.1.2 (quadratic variation). The process h𝑀, 𝑀i of Theorem 3.1.1 is called
the quadratic variation of 𝑀.

We will not prove Theorem 3.1.1 in full generality. However, we will verify that the
quadratic variation of one-dimensional Brownian motion (𝐵𝑡 )𝑡 ∈[0,𝑇 ] is h𝐵, 𝐵i𝑇 = 𝑇 , which
gives an idea of the general result. By independence of the Brownian increments,
𝑛
h ∑︁ 2 i ∑︁
𝑛
E {(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) 2 − (𝑡𝑖 − 𝑡𝑖−1 )} = E[|(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) 2 − (𝑡𝑖 − 𝑡𝑖−1 )| 2 ]

𝑖=1 𝑖=1
𝑛 𝑛
4
(𝑡𝑖 − 𝑡𝑖−1 ) 2
∑︁ ∑︁
≤ E[(𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) ] = 3
𝑖=1 𝑖=1
𝑛
∑︁
≤ 3 mesh(𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛) (𝑡𝑖 − 𝑡𝑖−1 ) → 0 .
𝑖=1
| {z }
=𝑇

Hence, 𝑛𝑖=1 (𝐵𝑡𝑖 − 𝐵𝑡𝑖−1 ) 2 →


− 𝑇 as 𝑛 → ∞. We also know that 𝑡 ↦→ 𝐵𝑡2 − 𝑡 is a marginale
Í P

(see, e.g., Exercise 1.6).

Semimartingales. We often consider solutions to SDEs with a non-zero drift coeffi-


cient, which means that the resulting process are not continuous local martingales. To
accommodate this addition, we consider the following definition.

Definition 3.1.3 (semimartingale). A process (𝑋𝑡 )𝑡 ∈[0,𝑇 ] is a continuous semi-


martingale if we can write 𝑋 = 𝐴 + 𝑀, where 𝐴 is a process of finite variation with
138 CHAPTER 3. ADDITIONAL TOPICS IN STOCHASTIC CALCULUS

𝐴0 = 0 and 𝑀 is a continuous local martingale.

The decomposition 𝑋 = 𝐴 + 𝑀 is then unique. Indeed, suppose that 𝑋 = 𝐴 e+ 𝑀e for


another finite variation process 𝐴 e (with 𝐴e0 = 0) and a continuous local martingale 𝑀.
e
Then, from Δ B 𝑀 − 𝑀 = 𝐴 − 𝐴, we deduce that Δ is both a continuous local martingale
e e
and a process of finite variation. Since Δ is of finite variation,
𝑛 𝑛
2
∑︁ ∑︁
(Δ𝑡𝑖 − Δ𝑡𝑖−1 ) ≤ mesh(𝑡𝑖 : 𝑖 = 0, 1, . . . , 𝑛) |Δ𝑡𝑖 − Δ𝑡𝑖−1 | → 0
𝑖=1 𝑖=1
| {z }
bounded as 𝑛→∞

as the mesh size tends to zero. This shows that hΔ, Δi = 0, and thus Δ2 is a continuous
local martingale. If we knew that Δ2 were a genuine martingale, then together with Δ0 = 0
it would imply that Δ = 0, establishing uniqueness of the semimartingale decomposition.
We omit the localization argument required to finish the proof.
We can also define the quadratic variation of the semimartingale 𝑋 as h𝑋, 𝑋 i B h𝑀, 𝑀i.
To see why this makes sense, observe that
𝑛
∑︁ 𝑛
2 2
∑︁
(𝑋𝑡𝑖 − 𝑋𝑡𝑖−1 ) − (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 )


𝑖=1 𝑖=1
𝑛
∑︁ 𝑛
(𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) 2 + 2
∑︁
= (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 )

𝑖=1 𝑖=1
v
t
𝑛 𝑛
∑︁ 𝑛
 ∑︁ 
2 2 2
∑︁
≤ (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) + 2 (𝐴𝑡𝑖 − 𝐴𝑡𝑖−1 ) (𝑀𝑡𝑖 − 𝑀𝑡𝑖−1 ) ,
𝑖=1 𝑖=1 𝑖=1

which tends to zero using the same argument as in the uniqueness of the semimartingale
decomposition: finite variation processes have zero quadratic variation.

The bracket of two semimartingales. Given two semimartingales 𝑋 and 𝑌 , we define


their bracket via polarization:
1
h𝑋, 𝑌 i B (h𝑋 + 𝑌, 𝑋 + 𝑌 i − h𝑋, 𝑋 i − h𝑌, 𝑌 i) .
2
Equivalently, if 𝑋 = 𝐴𝑋 + 𝑀𝑋 and 𝑌 = 𝐴𝑌 + 𝑀𝑌 are the respective decompositions, then
h𝑋, 𝑌 i = h𝑀𝑋 , 𝑀𝑌 i. The following theorem gives a concrete way of computing the bracket
for processes driven by Brownian motion.
3.1. QUADRATIC VARIATION 139

Theorem 3.1.4 (bracket of processes driven by Brownian motion). Suppose that 𝑋


and 𝑌 are R𝑑 -valued processes with

d𝑋𝑡 = 𝑏𝑡𝑋 d𝑡 + 𝜎𝑡𝑋 d𝐵𝑡 ,


d𝑌𝑡 = 𝑏𝑡𝑌 d𝑡 + 𝜎𝑡𝑌 d𝐵𝑡 ,

where we assume [0,𝑇 ] k𝑏𝑡𝑋 k d𝑡, [0,𝑇 ] k𝑏𝑡𝑌 k d𝑡, [0,𝑇 ] k𝜎𝑡𝑋 k 2HS d𝑡, and [0,𝑇 ] k𝜎𝑡𝑌 k 2HS d𝑡 are
∫ ∫ ∫ ∫

all finite almost surely. Then, 𝑋 and 𝑌 are continuous semimartingales, and
∫ 𝑡
h𝑋, 𝑌 i𝑡 = h𝜎𝑠𝑋 , 𝜎𝑠𝑌 i d𝑠 , for 𝑡 ∈ [0,𝑇 ] .
0

Itô’s formula revisited. Finally, we conclude this section by revisiting Itô’s formula
(Theorem 1.1.18) using our new calculus.

Theorem 3.1.5 (Itô’s formula revisited). Let 𝑋 be an R𝑑 -valued semimartingale, and


write 𝑋 = (𝑋 1, . . . , 𝑋 𝑑 ). Let 𝑓 ∈ C 2 (R𝑑 ). Then, 𝑓 (𝑋 ) is also a semimartingale, and

1 ∑︁ 𝑡
𝑑 ∫
∑︁ 𝑡 𝑑 ∫
𝑓 (𝑋𝑡 ) = 𝑓 (𝑋 0 ) + 𝜕𝑖 𝑓 (𝑋𝑠 ) d𝑋𝑠𝑖 + 𝜕𝑖,𝑗 𝑓 (𝑋𝑠 ) dh𝑋 𝑖 , 𝑋 𝑗 i𝑠 .
𝑖=1 0 2 𝑖,𝑗=1 0

If we interpret h𝑋, 𝑋 i as the matrix whose (𝑖, 𝑗)-entry is h𝑋 𝑖 , 𝑋 𝑗 i, then this can be
written in matrix notation as
1 𝑡
2
∫ 𝑡 ∫
h∇𝑓 (𝑋𝑠 ), d𝑋𝑠 i + ∇ 𝑓 (𝑋𝑠 ), dh𝑋, 𝑋 i𝑠 .

𝑓 (𝑋𝑡 ) = 𝑓 (𝑋 0 ) +
0 2 0
For 𝑑-dimensional standard Brownian motion (𝐵𝑡 )𝑡 ∈[0,𝑇 ] , we have h𝐵, 𝐵i𝑡 = 𝑡𝐼𝑑 , so that
dh𝐵, 𝐵i𝑡 = 𝐼𝑑 d𝑡 and we recover the original statement of Itô’s formula in Theorem 1.1.18.
The point is that the quadratic variation is a convenient way of streamlining Itô calculations,
as it formalizes the idea that only the Brownian motion part of a process contributes in
the second-order term in Itô’s formula.

Bibliographical Notes
The introduction to quadratic variation in Section 3.1 is heavily inspired by the treatment
in [Le 16]. The study of functions of bounded variation ad the relationship with absolute
140 CHAPTER 3. ADDITIONAL TOPICS IN STOCHASTIC CALCULUS

continuity and total variation can be found in any standard graduate text on real analysis,
e.g., [Fol99]. See [Ste01, Proposition 8.6] for a simple case of Theorem 3.1.5.

Exercises
Part II

Complexity of Sampling

141
CHAPTER 4

Analysis of Langevin Monte Carlo

In this chapter, we will provide several analyses of the Langevin Monte Carlo (LMC)
algorithm, i.e., the iteration

𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) . (LMC)

This is known as the Euler–Maruyama discretization of the Langevin diffusion.


Although LMC does not achieve state-of-the-art complexity bounds, it is one of the
most fundamental sampling algorithms. Through the quantitative convergence analysis
of LMC, we develop techniques for discretization analysis that are broadly useful for
studying more complex algorithms.

4.1 Proof via Wasserstein Coupling


Perhaps the most straightforward analysis of LMC is based on coupling together the
discrete-time algorithm with the continuous-time diffusion, and using this coupling
to bound the discretization error in Wasserstein distance. The underlying continuous-
time result we use here is the fact that strong log-concavity implies contraction in the
Wasserstein metric for the Langevin diffusion. On one hand, this proof is robust and can
be applied to more complicated processes; on the other hand, its reliance on contractivity
means it is not applicable under weaker assumptions such as an LSI.

143
144 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

Theorem 4.1.1. For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC with step size
ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 .
Then, provided ℎ ≤ 3𝛽1 , for all 𝑁 ∈ N,
 𝛽 4𝑑ℎ 2 𝛽 2𝑑ℎ 
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 + . (4.1.2)
𝛼3 𝛼2
𝛼𝜀 2

In particular, if we initialize at 𝜇0 = 𝛿 0 and take ℎ  𝛽 2𝑑
, then for any 𝜀 ∈ [0, 𝑑] we

obtain the guarantee 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 after
 𝜅 2𝑑 𝑑
𝑁 =𝑂 log iterations .
𝜀2 𝜀

Remark 4.1.3. We pause to make a few comments about the assumptions and result.
1. Typically we assume ∇𝑉 (0) = 0 without loss of generality, i.e., that the potential 𝑉
is minimized at 0. This is because the complexity of finding the minimizer of 𝑉 via
convex optimization is typically less than the complexity of sampling.

2. It is convenient to use the metric 𝛼 𝑊2 instead of 𝑊2 because it is scale-invariant.
Namely, for 𝜆 > 0, if we define the scaling map 𝑠𝜆 : R𝑑 → R𝑑 via 𝑥 ↦→ 𝜆𝑥, then
information divergences such as KL satisfy KL((𝑠𝜆 ) # 𝜇 k (𝑠𝜆 ) # 𝜋) = KL(𝜇 k√𝜋). On
the other hand, 𝑊2 is not invariant, 𝑊2 ((𝑠𝜆 ) # 𝜇, (𝑠𝜆 ) # 𝜋) = 𝜆 𝑊2 (𝜇, 𝜋), but 𝛼 𝑊2 is
(because the distribution (𝑠𝜆 ) # 𝜋 is 𝛼/𝜆 2 -strongly convex).
Recall also that
√ the T2 transport inequality, implied
√ by 𝛼-strong log-concavity,
asserts that 𝛼 𝑊2 (·, 𝜋) ≤ KL(· k 𝜋). Therefore, 𝛼 𝑊2 is a more natural metric.
3. The inequality (4.1.2) is not sharp; in Section 4.3, via a more sophisticated analysis,
we will improve the iteration complexity to 𝑂 e(𝑑𝜅/𝜀 2 ) iterations.

4. The inequality (4.1.2) has the following interpretation: for fixed ℎ > 0, the first term
tends to zero exponentially fast, which reflects the fact that LMC converges to its
stationary distribution 𝜇∞ . However, the stationary distribution is biased, 𝜇 ∞ ≠ 𝜋,
and the second term provides an upper bound on the bias 𝑊2 (𝜇∞, 𝜋).
First, let us see why the first statement of Theorem 4.1.1 implies the second.

Lemma 4.1.4. Let 𝜋 ∝∫exp(−𝑉 ) satisfy ∇2𝑉  𝛼𝐼𝑑 and ∇𝑉 (0) = 0. Then, we have the
second moment bound k·k 2 d𝜋 ≤ 𝑑/𝛼.
4.1. PROOF VIA WASSERSTEIN COUPLING 145

Moreover, if ℎ ≤ 𝛽1 , then the LMC iterates initialized at 𝜇 0 = 𝛿 0 with step size ℎ > 0
have uniformly bounded second moment: sup𝑘∈N E[k𝑋𝑘ℎ k 2 ] . 𝑑/𝛼.

Proof. See Exercise 4.2. 


𝛼𝜀 2 𝜀2
By taking ℎ  𝛽 2𝑑
, we can make the second term in (4.1.2) at most 2𝛼 , and then for all
1 𝑑𝜅 2 𝜀2
𝑁 & 𝛼ℎ log(𝛼 𝑊22 (𝛿 0, 𝜋)/𝜀 2 )  𝜀 2 log(𝑑/𝜀) the first term is also at most 2𝛼 . This proves
the second statement of Theorem 4.1.1.
We now prove the first statement.
Proof of Theorem 4.1.1. 1. One-step discretization bound. Suppose that the continuous-
time Langevin diffusion and the LMC algorithm are both initialized at the same measure
𝜇0 . We will first bound the discretization error 𝑊22 (𝜇ℎ , 𝜋ℎ ) in one step.
We couple the two processes by taking 𝑋 0 = 𝑍 0 and using the same Brownian motion:

𝑋ℎ = 𝑍 0 − ℎ ∇𝑉 (𝑍 0 ) + 2 𝐵ℎ ,
∫ ℎ √
𝑍ℎ = 𝑍 0 − ∇𝑉 (𝑍𝑡 ) d𝑡 + 2 𝐵ℎ .
0

Then,
h ∫ ℎ 2 i
𝑊22 (𝜇ℎ , 𝜋ℎ ) ≤ E[k𝑋ℎ − 𝑍ℎ k ] ≤ E 2
∇𝑉 (𝑍𝑡 ) d𝑡 − ℎ ∇𝑉 (𝑍 0 )

0
∫ ℎ
≤ℎ E[k∇𝑉 (𝑍𝑡 ) − ∇𝑉 (𝑍 0 )k 2 ] d𝑡 .
0

Therefore, we just have to bound the movement k𝑍𝑡 − 𝑍 0 k = k− 0 ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡 k
∫𝑡

of the Langevin diffusion in time 𝑡. Roughly, we expect k 0 ∇𝑉 (𝑍𝑠 ) d𝑠 k = 𝑂 ( 𝑑 𝑡) if the
∫𝑡
√ √
size of the gradient is 𝑂 ( 𝑑), and k𝐵𝑡 k = 𝑂 ( 𝑑𝑡). For small 𝑡, it is the Brownian motion
term which is dominant, which is a common intuition for discretization proofs.
To rigorously bound this term, we appeal to stochastic calculus, see Lemma 4.1.5. For
𝑡 ≤ 3𝛽1 , it yields the bound

E[k𝑍𝑡 − 𝑍 0 k 2 ] ≤ 8𝛽 2𝑡 2 E[k𝑍 0 k 2 ] + 8𝑑𝑡

and hence
∫ ℎ
𝑊22 (𝜇ℎ , 𝜋ℎ ) 2
≤𝛽 ℎ E[k𝑍𝑡 − 𝑍 0 k 2 ] d𝑡 ≤ 3𝛽 4ℎ 4 E[k𝑍 0 k 2 ] + 4𝛽 2𝑑ℎ 3 .
0
146 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

2. Multi-step discretization bound. We produce a coupling of 𝜇 (𝑘+1)ℎ and 𝜋 as fol-


lows. First, let 𝑋𝑘ℎ ∼ 𝜇𝑘ℎ and 𝑍𝑘ℎ ∼ 𝜋 be optimally coupled. Using the same Brownian
motion, we set

𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) ,
∫ 𝑡 √
𝑍𝑡 B 𝑍𝑘ℎ − ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 (𝐵𝑡 − 𝐵𝑘ℎ ) , for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] .
𝑘ℎ

Clearly 𝑋 (𝑘+1)ℎ ∼ 𝜇 (𝑘+1)ℎ ; also, since 𝜋 is stationary for the Langevin diffusion, then
𝑍 (𝑘+1)ℎ ∼ 𝜋. We also introduce an auxiliary process: let
∫ 𝑡 √
¯
𝑋𝑡 B 𝑋𝑘ℎ − ∇𝑉 (𝑋¯𝑠 ) d𝑠 + 2 (𝐵𝑡 − 𝐵𝑘ℎ ) for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ]
𝑘ℎ

denote the Langevin diffusion started at 𝑋𝑘ℎ . We bound

𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ E[k𝑋 (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ]


= E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ]
+ 2 Eh𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ , 𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ i .

For any 𝜆 > 0, Young’s inequality yields


1
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ (1 + 𝜆) E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + 1 + E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] .
𝜆
Now we examine the two terms. In the first term, both 𝑋¯ and 𝑍 evolve via the Langevin
diffusion for an 𝛼-strongly convex potential, so we have the following contraction (which
is established by a direct coupling argument, see Theorem 1.4.10):

E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] ≤ exp(−2𝛼ℎ) E[k𝑋𝑘ℎ − 𝑍𝑘ℎ k 2 ] = exp(−2𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) .

For the second term, 𝑋 is the LMC algorithm and 𝑋¯ is the continuous-time Langevin
diffusion, both initialized at the same distribution 𝜇𝑘ℎ . Hence, we can apply our one-step
discretization bound from before and deduce that
𝛽 4𝑑ℎ 4
E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] ≤ 3𝛽 4ℎ 4 E[k𝑋𝑘ℎ k 2 ] + 4𝛽 2𝑑ℎ 3 . + 𝛽 2𝑑ℎ 3 ,
𝛼
where we used the bound from Lemma 4.1.4 on the second moment.
Combining everything,
 1  𝛽 4𝑑ℎ 4 
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ (1 + 𝜆) exp(−2𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑂 1 + + 𝛽 2𝑑ℎ 3 .
𝜆 𝛼
4.1. PROOF VIA WASSERSTEIN COUPLING 147

1 1
Next, if we take 𝜆 = 𝛼ℎ, then (1 + 𝜆) exp(−2𝛼ℎ) ≤ exp(−𝛼ℎ). Since ℎ ≤ 3𝛽 ≤ 3𝛼 ,

 𝛽 4𝑑ℎ 3 𝛽 2𝑑ℎ 2 
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp(−𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑂 + .
𝛼2 𝛼
After iterating this recursion, it implies
 𝛽 4𝑑ℎ 2 𝛽 2𝑑ℎ 
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 + . 
𝛼3 𝛼2

We finish by presenting the lemma we used in the proof of Theorem 4.1.1. The following
proof is very typical of stochastic calculus arguments, so it is worth internalizing.

Lemma 4.1.5. Let (𝑍𝑡 )𝑡 ≥0 denote the Langevin diffusion and let (𝜋𝑡 )𝑡 ≥0 denote its law.
Assume that ∇𝑉 (0) = 0 and k∇2𝑉 k op ≤ 𝛽. Then, provided that 𝑡 ≤ 3𝛽1 ,

E[k𝑍𝑡 − 𝑍 0 k 2 ] ≤ 8𝛽 2𝑡 2 E[k𝑍 0 k 2 ] + 8𝑑𝑡 .

Proof. By definition,
h ∫ 𝑡 √ 2 i
2
E[k𝑍𝑡 − 𝑍 0 k ] = E − ∇𝑉 (𝑍𝑠 ) d𝑠 + 2 𝐵𝑡

∫ 𝑡0
≤ 2𝑡 E[k∇𝑉 (𝑍𝑠 )k 2 ] d𝑠 + 4 E[k𝐵𝑡 k 2 ] .
0

Using the assumption that ∇𝑉 (0) = 0 and that ∇𝑉 is 𝛽-Lipschitz, we have the bound
k∇𝑉 (𝑍𝑠 ) k ≤ 𝛽 k𝑍𝑠 k. Thus,
∫ 𝑡
2 2
E[k𝑍𝑡 − 𝑍 0 k ] ≤ 2𝛽 𝑡 E[k𝑍𝑠 k 2 ] d𝑠 + 4𝑑𝑡
∫0 𝑡
≤ 4𝛽 2𝑡 E[k𝑍𝑠 − 𝑍 0 k 2 ] d𝑠 + 4𝛽 2𝑡 2 E[k𝑍 0 k 2 ] + 4𝑑𝑡 .
0

Applying Grönwall’s inequality (Lemma 1.1.20), it implies

E[k𝑍𝑡 − 𝑍 0 k 2 ] ≤ {4𝛽 2𝑡 2 E[k𝑍 0 k 2 ] + 4𝑑𝑡 } exp(4𝛽 2𝑡 2 ) .


1
Finally, use the assumption 𝑡 ≤ 3𝛽 to conclude. 
148 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

4.2 Proof via Interpolation Argument


The next proof we give is from [VW19] (slightly refined using a lemma from [Che+21a]).
Here, we mimic the continuous-time convergence proof in KL divergence by first defining a
continuous-time interpolation of the LMC iterates. Upon differentiating the KL divergence
along this interpolation, we discover two terms: the first is the Fisher information, and
the second is a discretization error term. By controlling the latter, we prove a convergence
result for LMC assuming only that 𝜋 satisfies LSI and that ∇𝑉 is Lipschitz.
The interpolation of LMC is defined as follows: for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ], we set

𝑋𝑡 B 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ ) . (4.2.1)

Proposition 4.2.2. Let (𝜇𝑡 )𝑡 ≥0 be the law of the interpolated process (4.2.1). Then,
h  𝜇𝑡 i
𝜕𝑡 𝜇𝑡 = div 𝜇𝑡 ∇ ln + E[∇𝑉 (𝑋𝑘ℎ ) − ∇𝑉 (𝑋𝑡 ) | 𝑋𝑡 = ·] .
𝜋

The proof is a little tricky to write out formally.


Proof. Let 𝜇𝑡 |ℱ𝑘ℎ denote the law of 𝑋𝑡 conditioned on the filtration ℱ𝑘ℎ at time 𝑘ℎ. Then,
(𝜇𝑡 |ℱ𝑘ℎ )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] satisfies the Fokker–Planck equation

𝜕𝑡 𝜇𝑡 |ℱ𝑘ℎ = Δ𝜇𝑡 |ℱ𝑘ℎ + div 𝜇𝑡 |ℱ𝑘ℎ ∇𝑉 (𝑋𝑘ℎ ) .




Next, we take the expectation of the above equation; since E 𝜇𝑡 |ℱ𝑘ℎ = 𝜇𝑡 ,


𝜕𝑡 𝜇𝑡 = Δ𝜇𝑡 + div E[𝜇𝑡 |ℱ𝑘ℎ ∇𝑉 (𝑋𝑘ℎ )] .
Write Pℱ𝑘ℎ for the probability measure on ℱ𝑘ℎ , and write Pℱ𝑘ℎ |𝑡 to denote the conditional
measure given 𝑋𝑡 . Note that 𝜇𝑡 |ℱ𝑘ℎ (𝑥 | 𝜔) Pℱ𝑘ℎ (d𝜔) = 𝜇𝑡 (𝑥) Pℱ𝑘ℎ |𝑡 (d𝜔 | 𝑥). So,


E[𝜇𝑡 |ℱ𝑘ℎ (𝑥) ∇𝑉 (𝑋𝑘ℎ )] = 𝜇𝑡 |ℱ𝑘ℎ (𝑥 | 𝜔) ∇𝑉 𝑋𝑘ℎ (𝜔) Pℱ𝑘ℎ (d𝜔)


= 𝜇𝑡 (𝑥) Pℱ𝑘ℎ |𝑡 (d𝜔 | 𝑥) ∇𝑉 𝑋𝑘ℎ (𝜔)

= 𝜇𝑡 (𝑥) E[∇𝑉 (𝑋𝑘ℎ ) | 𝑋𝑡 = 𝑥] .


Therefore,
𝜕𝑡 𝜇𝑡 = Δ𝜇𝑡 + div 𝜇𝑡 E[∇𝑉 (𝑋𝑘ℎ ) | 𝑋𝑡 = ·]

 𝜇𝑡 
= div 𝜇𝑡 ∇ ln + div 𝜇𝑡 E[∇𝑉 (𝑋𝑘ℎ ) − ∇𝑉 (𝑋𝑡 ) | 𝑋𝑡 = ·] .


𝜋
4.2. PROOF VIA INTERPOLATION ARGUMENT 149

Corollary 4.2.3. Along the law (𝜇𝑡 )𝑡 ≥0 of the interpolated process (4.2.1),

3
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] .
4

Recall that the Fisher information is FI(𝜇 k 𝜋) B E𝜇 [k∇ ln(𝜇/𝜋)k 2 ] if 𝜇 has a smooth
density with respect to 𝜋.

Proof. Using Proposition 4.2.2,


D 𝜇𝑡 𝜇𝑡 E
𝜕𝑡 KL(𝜇𝑡 k 𝜋) = − E𝜇𝑡 ∇ ln , ∇ ln + E[∇𝑉 (𝑋𝑘ℎ ) − ∇𝑉 (𝑋𝑡 ) | 𝑋𝑡 = ·]
𝜋 D𝜋 𝜇 E
= − FI(𝜇𝑡 k 𝜋) + E𝜇𝑡 ∇ ln , E[∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ ) | 𝑋𝑡 = ·] .
𝑡
𝜋
Using Young’s inequality,
D 𝜇𝑡 E
E𝜇𝑡 ∇ ln , E[∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ ) | 𝑋𝑡 = ·]
𝜋
1
≤ FI(𝜇𝑡 k 𝜋) + E kE[∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ ) | 𝑋𝑡 ] k 2
 
4
1
≤ FI(𝜇𝑡 k 𝜋) + E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] . 
4
Before we give the convergence proof for LMC, we need one more lemma. Recall that
for Theorem 4.1.1, we needed to control E[k𝑋𝑘ℎ k 2 ], which we accomplished via strong
convexity of 𝑉 (Lemma 4.1.4). Under the weaker assumption of an LSI, it is trickier to
control the moments of the LMC iterates, but we have the following magic lemma.

Lemma 4.2.4 ([Che+21a, Lemma 16]). Suppose that 𝜋 ∝ exp(−𝑉 ) where ∇𝑉 is 𝛽-


Lipschitz. Then, for any probability measure 𝜇,

E𝜇 [k∇𝑉 k 2 ] ≤ FI(𝜇 k 𝜋) + 2𝛽𝑑 .

Proof. For the generator ℒ of the Langevin diffusion (with potential 𝑉 ), we can calculate
ℒ𝑉 = Δ𝑉 − k∇𝑉 k 2 . Also, since ∇2𝑉  𝛽𝐼𝑑 , then Δ𝑉 ≤ 𝛽𝑑. Thus, using the fundamental
integration by parts identity (Theorem 1.2.14),

d𝜇 d𝜇
∫ ∫
2
d𝜋 = 𝛽𝑑 + d𝜋

E𝜇 [k∇𝑉 k ] = E𝜇 [Δ𝑉 − ℒ𝑉 ] ≤ 𝛽𝑑 + (−ℒ𝑉 ) ∇𝑉 , ∇


d𝜋 d𝜋
150 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

d𝜇

∇𝑉 , ∇ ln
d𝜇

= 𝛽𝑑 +
d𝜋
1 1
≤ 𝛽𝑑 + E𝜇 [k∇𝑉 k 2 ] + FI(𝜇 k 𝜋) .
2 2
Rearranging the inequality yields the result. 
Also, recall that an LSI implies KL(· k 𝜋) ≤ 𝐶 LSI
2 FI(· k 𝜋).

Theorem 4.2.5 ([VW19]). For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies LSI and that ∇𝑉 is
𝛽-Lipschitz. Then, for all ℎ ≤ 4𝛽1 , for all 𝑁 ∈ N,
 𝑁ℎ 
KL(𝜇 𝑁ℎ k 𝜋) ≤ exp − KL(𝜇0 k 𝜋) + 𝑂 (𝐶 LSI 𝛽 2𝑑ℎ + 𝛽 2𝑑ℎ 2 ) .
𝐶 LSI
√ 2
In particular, for all 𝜀 ∈ [0, 𝐶 LSI 𝛽 𝑑], if we take ℎ  𝐶LSI𝜀 𝛽 2𝑑 , then we obtain the guarantee
KL(𝜇 𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁

 𝐶 2 𝛽 2𝑑 KL(𝜇0 k 𝜋) 
LSI
𝑁 =𝑂 log iterations .
𝜀2 𝜀

Proof. In light of Corollary 4.2.3, we focus our attention on the discretization error term

E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ≤ 𝛽 2 E[k𝑋𝑡 − 𝑋𝑘ℎ k 2 ]


= 𝛽 2 (𝑡 − 𝑘ℎ) 2 E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] + 2𝛽 2 E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ] .

In order to apply Lemma 4.2.4, it is more convenient to have E[k∇𝑉 (𝑋𝑡 )k 2 ] instead of
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ]. So, we use

E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] ≤ 2 E[k∇𝑉 (𝑋𝑡 )k 2 ] + 2 E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] .


1
If ℎ ≤ 2𝛽 , we can combine this inequality with the previous one and rearrange to obtain

E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ≤ 4𝛽 2 (𝑡 − 𝑘ℎ) 2 E[k∇𝑉 (𝑋𝑡 )k 2 ] + 4𝛽 2 E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ]


≤ 4𝛽 2 (𝑡 − 𝑘ℎ) 2 E[k∇𝑉 (𝑋𝑡 )k 2 ] + 4𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
1
For the first term, we apply Lemma 4.2.4, yielding for ℎ ≤ 4𝛽

4𝛽 2 (𝑡 − 𝑘ℎ) 2 E[k∇𝑉 (𝑋𝑡 )k 2 ] ≤ 4𝛽 2ℎ 2 FI(𝜇𝑡 k 𝜋) + 8𝛽 3𝑑 (𝑡 − 𝑘ℎ) 2


4.3. PROOF VIA CONVEX OPTIMIZATION 151

1
≤ FI(𝜇𝑡 k 𝜋) + 2𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
4
Combining with our differential inequality from Corollary 4.2.3 and LSI,
1 1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) ≤ − KL(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
2 𝐶 LSI
This implies that
h 𝑡 − 𝑘ℎ  i 𝑡 − 𝑘ℎ 
𝜕𝑡 exp KL(𝜇𝑡 k 𝜋) ≤ 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) exp
𝐶 LSI 𝐶 LSI
and upon integration,
 ℎ 
KL(𝜇 (𝑘+1)ℎ k 𝜋) ≤ exp − KL(𝜇𝑘ℎ k 𝜋) + 3𝛽 2𝑑ℎ 2 .
𝐶 LSI
Iterating and splitting into cases based on whether or not ℎ ≤ 𝐶 LSI ,
 𝑁ℎ 
KL(𝜇 𝑁ℎ k 𝜋) ≤ exp − KL(𝜇 0 k 𝜋) + 𝑂 (max{𝐶 LSI 𝛽 2𝑑ℎ, 𝛽 2𝑑ℎ 2 }) . 
𝐶 LSI
Recall from Theorem 2.2.15 that an LSI implies exponential decay in every Rényi
divergence, not just the KL divergence. Working with Rényi divergences of order 𝑞 > 1
introduces substantial new difficulties for the discretization analysis, which is why it is
remarkable that the proof above can be adapted to the Rényi case with the introduction
of some additional tricks; see Chapter 5.

4.3 Proof via Convex Optimization


Next, we turn towards an astonishing proof, due to [DMM19], which is inspired by
convex optimization. This proof also yields the state-of-the-art dependence of LMC on
the condition number 𝜅 of the target 𝜋.
Let ∫the target be 𝜋 = exp(−𝑉 ) (for this proof, we are assuming that 𝑉 is normalized
so that exp(−𝑉 ) = 1; this just simplifies the notation but does not change the algorithm
nor the analysis). We now view sampling as the composite optimization problem of
minimizing the objective
∫ ∫
KL(𝜇 k 𝜋) B 𝑉 d𝜇 + 𝜇 ln 𝜇 ,
| {z } | {z }
CE(𝜇) CH(𝜇)
152 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

where the two terms are the energy and the (negative) entropy. Accordingly, we break up
the iterates of LMC into the steps
+
𝑋𝑘ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋𝑘ℎ ) ,

+
𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .

The first step is simply a deterministic gradient descent update on the function 𝑉 . If
we write 𝜇𝑘ℎ+ for the law of 𝑋 + , then in the space of measures one can show that 𝜇 +
𝑘ℎ 𝑘ℎ
is obtained from 𝜇𝑘ℎ by taking a gradient step for the energy functional E w.r.t. the
Wasserstein geometry.1 On the other hand, the second step applies the heat flow; in
the space of measures, this is a Wasserstein gradient flow for the entropy functional H.
Since the gradient descent algorithm is sometimes known as the “forwards” method in
optimization (as opposed to a proximal step which is the “backwards” method), this has
led to LMC being dubbed the “forward-flow” algorithm (see [Wib18] for more on this
perspective).
The strategy of the proof is to show that the forward step of LMC dissipates the energy
while not increasing the entropy too much, and that the flow step of LMC dissipates the
entropy while not increasing the energy too much.
The key lemma is the following one-step inequality.

Lemma 4.3.1. Let 𝜋 = exp(−𝑉 ) be the target and assume that 0  𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 .
Let (𝜇𝑘ℎ )𝑘∈N denote the iterates of LMC with step size ℎ ∈ [0, 𝛽1 ]. Then,

2ℎ KL(𝜇 (𝑘+1)ℎ k 𝜋) ≤ (1 − 𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) − 𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) + 2𝛽𝑑ℎ 2 . (4.3.2)

From this, we deduce the following results.

Theorem 4.3.3 ([DMM19]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝑉 satisfies 0  𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Let (𝜇𝑘ℎ )𝑘∈N denote the law of LMC.

1. (weakly convex case) Suppose that 𝛼 = 0. For any 𝜀 ∈ [0, 𝑑], if we take step
𝜀2
size ℎ  𝛽𝑑 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds that
Í𝑁

1 Technicallythis is only true if the step size ℎ is chosen so that ℎ k∇2𝑉 k op ≤ 1. This is because on any
Riemannian manifold in which geodesics cannot be extended indefinitely, the gradient descent steps must
be short enough to ensure that the iterates are still travelling along geodesics.
4.3. PROOF VIA CONVEX OPTIMIZATION 153

KL( 𝜇¯𝑁ℎ k 𝜋) ≤ 𝜀 after


√︁

 𝛽𝑑 𝑊 2 (𝜇 0, 𝜋) 
2
𝑁 =𝑂 iterations .
𝜀4

2. (strongly convex case) Suppose that 𝛼 √ > 0 and let 𝜅 B 𝛽/𝛼 denote the con-
𝜀2
dition number. Then, for any 𝜀 ∈ [0, 𝑑], with step size ℎ  𝛽𝑑 we obtain

𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 and KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁


 𝜅𝑑 𝛼 𝑊2 (𝜇0, 𝜋) 
𝑁 = 𝑂 2 log iterations ,
𝜀 𝜀
Í2𝑁
where 𝜇¯𝑁ℎ,2𝑁ℎ B 𝑁 −1 𝑘=𝑁 +1 𝜇𝑘ℎ .

Proof. We use Lemma 4.3.1.

1. By summing the inequality (4.3.2) and using the convexity of the KL divergence,

1 ∑︁
𝑁
𝑊 2 (𝜇 0, 𝜋)
KL( 𝜇¯𝑁ℎ k 𝜋) ≤ KL(𝜇𝑘ℎ k 𝜋) ≤ 2 + 𝛽𝑑ℎ .
𝑁 2𝑁ℎ
𝑘=1

The result follows from our choice of ℎ and 𝑁 .

2. First, we prove the 𝑊2 guarantee. Using the fact that KL(𝜇 (𝑘+1)ℎ k 𝜋) ≥ 0 and
iterating the inequality (4.3.2) we obtain
𝑁 −1
𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ (1 − 𝛼ℎ) 𝑁 𝑊22 (𝜇0, 𝜋) + 2𝛽𝑑ℎ 2
∑︁
(1 − 𝛼ℎ)𝑘
𝑘=0
≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 (𝜅𝑑ℎ) .

With our choice of ℎ and 𝑁 , we obtain 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀.
Next, forget about the previous 𝑁 iterations of LMC and consider 𝜇 𝑁ℎ to be the
new initialization to LMC. Applying the weakly convex result now yields the KL
guarantee KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀.
√︁


We next turn towards the proof of Lemma 4.3.1.

Proof of Lemma 4.3.1. We break the proof into three steps.


154 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

1. The forward step dissipates the energy. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋𝑘ℎ .
Then, applying the strong convexity and smoothness of 𝑉 ,

E(𝜇𝑘ℎ
+
) − E(𝜋) = E[𝑉 (𝑋𝑘ℎ
+
) − 𝑉 (𝑋𝑘ℎ ) + 𝑉 (𝑋𝑘ℎ ) − 𝑉 (𝑍 )]
h 𝛽
+
≤ E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑋𝑘ℎ i + k𝑋𝑘ℎ +
− 𝑋𝑘ℎ k 2
2 i
𝛼 2
+ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i − k𝑋𝑘ℎ − 𝑍 k
2
h 𝛽 𝛼 i
+
= E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i + k𝑋𝑘ℎ +
− 𝑋𝑘ℎ k 2 − k𝑋𝑘ℎ − 𝑍 k 2 . (4.3.4)
2 2
Next, we have the expansion

k𝑋𝑘ℎ − 𝑍 k 2 = k𝑋𝑘ℎ
+
− 𝑋𝑘ℎ k 2 + k𝑋𝑘ℎ
+
− 𝑍 k 2 − 2 h𝑋𝑘ℎ
+ +
− 𝑋𝑘ℎ , 𝑋𝑘ℎ − 𝑍i
+
= k𝑋𝑘ℎ − 𝑋𝑘ℎ k 2 + k𝑋𝑘ℎ
+
− 𝑍 k 2 + 2ℎ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ
+
− 𝑍i .

Substituting this into (4.3.4), it yields


1
E(𝜇𝑘ℎ
+
) − E(𝜋) ≤ E[(1 − 𝛼ℎ) k𝑋𝑘ℎ − 𝑍 k 2 − k𝑋𝑘ℎ +
− 𝑍 k 2 − (1 − 𝛽ℎ) k𝑋𝑘ℎ
+
− 𝑋𝑘ℎ k 2 ]
2ℎ
1
≤ {(1 − 𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) − 𝑊22 (𝜇𝑘ℎ
+
, 𝜋)} , (4.3.5)
2ℎ
where the last inequality uses the assumption ℎ ≤ 1/𝛽.
2. The flow step does not substantially increase the energy. Next, using the 𝛽-
smoothness of 𝑉 ,

E(𝜇 (𝑘+1)ℎ ) − E(𝜇𝑘ℎ


+
) = E[𝑉 (𝑋 (𝑘+1)ℎ ) − 𝑉 (𝑋𝑘ℎ+
)]
 + + 𝛽 + 2

≤ E h∇𝑉 (𝑋𝑘ℎ ), 𝑋 (𝑘+1)ℎ − 𝑋𝑘ℎ i + k𝑋 (𝑘+1) − 𝑋𝑘ℎ k
√ 2
= E[ 2 h∇𝑉 (𝑋𝑘ℎ ), 𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ i + 𝛽 k𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ k 2 ]
+

= 𝛽𝑑ℎ . (4.3.6)

3. The flow step dissipates the entropy. Let (𝑄𝑡 )𝑡 ≥0 denote the heat semigroup,

i.e., 𝑄𝑡 𝑓 (𝑥) B E 𝑓 (𝑥 + 2 𝐵𝑡 ), so that 𝜇 (𝑘+1)ℎ = 𝜇𝑘ℎ
+ 𝑄 . Then, since the heat flow is the

Wasserstein gradient flow of H, and the Wasserstein gradient of H is ∇𝑊2 H(𝜇) = ∇ ln 𝜇,
one can show that

𝜕𝑡𝑊22 (𝜇𝑘ℎ
+
𝑄𝑡 , 𝜋) ≤ 2 Eh∇ ln 𝜇 (𝑋𝑘ℎ+𝑡
+ +
), 𝑍 − 𝑋𝑘ℎ+𝑡 i
4.3. PROOF VIA CONVEX OPTIMIZATION 155

where 𝑋𝑘ℎ+𝑡
+ + 𝑄 and 𝑍 ∼ 𝜋 are optimally coupled. This follows from the formula
∼ 𝜇𝑘ℎ 𝑡
for the gradient of the squared Wasserstein distance (Theorem 1.4.11); it may be justified
more rigorously using, e.g., [AGS08, Theorem 10.2.2].
On the other hand, we showed that H is geodesically convex (see (1.4.3)), so
H(𝜋) − H(𝜇𝑘ℎ
+
𝑄𝑡 ) ≥ Eh∇ ln 𝜇 (𝑋𝑘ℎ+𝑡
+ +
), 𝑍 − 𝑋𝑘ℎ+𝑡 i.
Using the fact that 𝑡 ↦→ H(𝜇𝑘ℎ
+ 𝑄 ) is decreasing (which also follows because 𝑡 ↦→ 𝜇 + 𝑄 is
𝑡 𝑘ℎ 𝑡
the gradient flow of H), we then have
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) − 𝑊22 (𝜇𝑘ℎ
+
, 𝜋) ≤ 2ℎ {H(𝜋) − H(𝜇 (𝑘+1)ℎ )} . (4.3.7)
Concluding the proof. Combine (4.3.5), (4.3.6), and (4.3.7) to obtain (4.3.2). 

Non-smooth case. The proof via convex optimization can also handle the non-smooth
case in which we only assume that 𝑉 is convex and Lipschitz. As before, we deduce the
convergence result from a key one-step inequality.

Lemma 4.3.8. Let 𝜋 = exp(−𝑉 ) be the target and assume that 𝑉 is convex and 𝐿-
Lipschitz. Let (𝜇𝑘ℎ )𝑘∈N denote the iterates of LMC with step size ℎ > 0. Then,

2ℎ KL(𝜇 (𝑘+1)ℎ k 𝜋) ≤ 𝑊22 (𝜇𝑘ℎ


+
, 𝜋) − 𝑊22 (𝜇 +(𝑘+1)ℎ , 𝜋) + 𝐿 2ℎ 2 .

Theorem 4.3.9 ([DMM19]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝑉 is convex and 𝐿-Lipschitz. Let (𝜇𝑘ℎ )𝑘∈N denote the law of LMC. For any 𝜀 > 0, if
2
we take step size ℎ  𝐿𝜀 2 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds
Í𝑁

that KL( 𝜇¯𝑁ℎ k 𝜋) ≤ 𝜀 after


√︁

 𝐿 2 𝑊 2 (𝜇 +, 𝜋) 
2 0
𝑁 =𝑂 iterations .
𝜀4

Proof. This follows from Lemma 4.3.8 in exactly the same way that the weakly convex
case of Theorem 4.3.3 follows from Lemma 4.3.1. 
Proof of Lemma 4.3.8. The main task here is to obtain dissipation of the energy functional
E under our new assumptions. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋 (𝑘+1)ℎ . Then, similarly
to before, we have
k𝑋 (𝑘+1)ℎ − 𝑍 k 2 = k𝑋 (𝑘+1)ℎ
+
− 𝑋 (𝑘+1)ℎ k 2 + k𝑋 (𝑘+1)ℎ
+
− 𝑍 k 2 + 2ℎ h∇𝑉 (𝑋 (𝑘+1)ℎ ), 𝑋 (𝑘+1)ℎ
+
− 𝑍i .
156 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

Rearranging and using the convexity and Lipschitzness of 𝑉 ,


+
k𝑋 (𝑘+1)ℎ − 𝑍 k 2 − k𝑋 (𝑘+1)ℎ − 𝑍 k 2 = 2ℎ h∇𝑉 (𝑋 (𝑘+1)ℎ ), 𝑍 − 𝑋 (𝑘+1)ℎ
+ +
i − k𝑋 (𝑘+1)ℎ − 𝑋 (𝑘+1)ℎ k 2
= 2ℎ h∇𝑉 (𝑋 (𝑘+1)ℎ ), 𝑍 − 𝑋 (𝑘+1)ℎ + 𝑋 (𝑘+1)ℎ − 𝑋 (𝑘+1)ℎ
+ +
i − k𝑋 (𝑘+1)ℎ − 𝑋 (𝑘+1)ℎ k 2
≤ 2ℎ {𝑉 (𝑍 ) − 𝑉 (𝑋 (𝑘+1)ℎ )} + 2ℎ k∇𝑉 (𝑋 (𝑘+1)ℎ )k k𝑋 (𝑘+1)ℎ
+
− 𝑋 (𝑘+1)ℎ k
+
− k𝑋 (𝑘+1)ℎ − 𝑋 (𝑘+1)ℎ k 2
≤ 2ℎ {𝑉 (𝑍 ) − 𝑉 (𝑋 (𝑘+1)ℎ )} + ℎ 2 k∇𝑉 (𝑋 (𝑘+1)ℎ )k 2
≤ 2ℎ {𝑉 (𝑍 ) − 𝑉 (𝑋 (𝑘+1)ℎ )} + 𝐿 2ℎ 2 .

Taking expectations,

2ℎ {E(𝜇 (𝑘+1)ℎ ) − E(𝜋)} ≤ 𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) − 𝑊22 (𝜇 +(𝑘+1)ℎ , 𝜋) + 𝐿 2ℎ 2 . (4.3.10)

On the other hand, recall from (4.3.7) that

2ℎ {H(𝜇 (𝑘+1)ℎ ) − H(𝜋)} ≤ 𝑊22 (𝜇𝑘ℎ


+
, 𝜋) − 𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) .

Together with (4.3.10), this completes the proof. 

4.4 Proof via Girsanov’s Theorem


The idea behind the next proof is to control the discretization error KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ), where
𝜇 𝑁ℎ is the law of the LMC iterate 𝑋 𝑁ℎ and 𝜋 𝑁ℎ is the law of the Langevin diffusion 𝑍 𝑁ℎ
initialized at 𝜇0 . This is accomplished using Girsanov’s theorem from stochastic calculus.
Unlike the previous proofs, which assumed strong log-concavity or LSI for 𝜋, control-
ling this discretization error will only require mild assumptions, such as smoothness of
𝑉 and sub-Gaussianity of 𝜋. The point, however, is that control of KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) does
not immediately yield quantitative convergence of 𝜇 𝑁ℎ → 𝜋; indeed, this also requires
quantitative convergence of 𝜋 𝑁ℎ → 𝜋, which does require some kind of assumption such
as strong log-concavity or a functional inequality.
Another remark is that the preceding proofs all had the following structure: for a
fixed step size ℎ > 0, as the number of iterations 𝑁 → ∞, the error of LMC is at most a
quantity depending on ℎ, and which can be made as small as we like by taking ℎ small. In
particular, to achieve a desired error it suffices that the step size be sufficiently small and
that the number of iterations be sufficiently large. In contrast, for the following proof we
will only be able to establish a bound on KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) which grows with the iteration
number 𝑁 . Consequently, in our final sampling guarantee, we will not be able to take 𝑁
4.4. PROOF VIA GIRSANOV’S THEOREM 157

too large; our guarantee will only imply that the error of LMC is small if 𝑁 lies in some
range. Conceptually, this is unsatisfying because “running the Markov chain too long”
should not be a problem, and it only arises as an artefact of the proof. Nonetheless, it is
worthwhile learning the proof because it is broadly applicable.
Historically, the Girsanov method was one of the first discretization techniques utilized
in the modern quantitative study of sampling (see [DT12]). The argument we present here
is similar in spirit to [DT12], although we have made a few refinements.

Background on Girsanov’s theorem. The law 𝜇 (𝑘+1)ℎ of the LMC iterate is obtained
from 𝜇𝑘ℎ by first applying the mapping id − ℎ ∇𝑉 and then convolving with a Gaussian.
This is a somewhat complicated recursive process, and it is not tractable to write down
and explicitly work with the density 𝜇𝑘ℎ . On the other hand, conditioned on 𝑋𝑘ℎ , the
conditional law of 𝑋 (𝑘+1)ℎ is simply a Gaussian, with an explicit mean (depending on 𝑋𝑘ℎ )
and covariance matrix. Consequently, it is easy to write down an explicit and simple
expression for the joint law of 𝑋 0, 𝑋ℎ , 𝑋 2ℎ , . . . , 𝑋𝑘ℎ .
The Langevin diffusion 𝑡 ↦→ 𝑍𝑡 is not as straightforward because it is the solution to a
stochastic differential equation, but infinitesimally we can think of it in the same way as
the LMC iteration: namely, conditioned on the past, the conditional law of the Langevin
diffusion in the next instant is a Gaussian with some mean and covariance. So, it should
be easier to consider the “joint law” of the Langevin diffusion. In fact, the content of
Girsanov’s theorem is that there is an explicit and simple formula for the “ratio of the
joint laws” of any two processes driven by the same Brownian motion.
To make this idea rigorous, consider the SDEs

d𝑋𝑡 = −∇𝑉 (𝑋𝑘ℎ ) d𝑡 + 2 d𝐵𝑡 , for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] ,

d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 .

Then, (𝑋𝑡 )𝑡 ≥0 is the interpolation of LMC that we encountered in (4.2.1) and (𝑍𝑡 )𝑡 ≥0 is
the Langevin diffusion. Since (𝑋𝑡 )𝑡 ∈[0,𝑇 ] is a random continuous path [0,𝑇 ] → R𝑑 , its law
is a probability measure P𝑇 on the space C([0,𝑇 ]; R𝑑 ). Similarly, the law of (𝑍𝑡 )𝑡 ∈[0,𝑇 ]
is another probability measure Q𝑇 on C([0,𝑇 ]; R𝑑 ). We will interpret P𝑇 and Q𝑇 as the
“joint laws” of the processes.
The precise way to interpret Girsanov’s theorem can be confusing, so we will carefully
outline the logic. Start with the Langevin diffusion (𝑍𝑡 )𝑡 ∈[0,𝑇 ] defined over a probability
space, so that its law is Q𝑇 . Girsanov’s theorem states that it is possible to define a new
dP̃𝑇
measure P̃𝑇 , defined explicitly through the Radon-Nikodym derivative dQ 𝑇
, such that
under P̃𝑇 the process ( 𝐵˜ 𝑡 )𝑡 ∈[0,𝑇 ] given by d𝐵˜ 𝑡 B d𝐵𝑡 + √ {∇𝑉 (𝑍𝑘ℎ ) − ∇𝑉 (𝑍𝑡 )} d𝑡 is a
1
2
158 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

standard Brownian motion (where 𝑘 is the largest integer with 𝑘ℎ ≤ 𝑡). But that means
that under P̃𝑇 ,
√ √
d𝑍𝑡 = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 d𝐵𝑡 = −∇𝑉 (𝑍𝑘ℎ ) d𝑡 + 2 d𝐵˜ 𝑡 .

Thus, under P̃𝑇 , the process (𝑍𝑡 )𝑡 ∈[0,𝑇 ] solves the SDE corresponding to the interpolation of
dP𝑇
LMC, and hence P̃𝑇 = P𝑇 . So, this theorem indeed gives us a formula for dQ 𝑇
.
Here is a statement of (a version of) Girsanov’s theorem.

Theorem 4.4.1 (Girsanov). Suppose that we have two SDEs



d𝑋𝑡 = 𝑏𝑡𝑋 d𝑡 + 2 d𝐵𝑡 ,

d𝑌𝑡 = 𝑏𝑡𝑌 d𝑡 + 2 d𝐵𝑡 .

Let P𝑇 and Q𝑇 be their respective laws on C([0,𝑇 ]; R𝑑 ). Then, provided that Novikov’s
condition holds,
1 ∫ 𝑇 
E exp
Q𝑇
k𝑏𝑡𝑋 − 𝑏𝑡𝑌 k 2 d𝑡 < ∞ ,
4 0
we have the formula

dP𝑇  1 1
∫ 𝑇 ∫ 𝑇 
= exp √ h𝑏𝑡𝑋 − 𝑏𝑡𝑌 , d𝐵𝑡 i − k𝑏𝑡𝑋 − 𝑏𝑡𝑌 k 2 d𝑡 ,
dQ𝑇 2 0 4 0

where (𝐵𝑡 )𝑡 ∈[0,𝑇 ] is a standard Brownian motion under Q𝑇 .


[Equivalently, ( 𝐵˜ 𝑡 )𝑡 ∈[0,𝑇 ] defined by d𝐵˜ 𝑡 B d𝐵𝑡 + √1 (𝑏𝑡𝑌 − 𝑏𝑡𝑋 ) d𝑡 is a standard
2
Brownian motion under P𝑇 .]

TODO: Move into dedicated section on stochastic calculus.

Discretization analysis. In the following theorem, we assume that 𝜋 is strongly log-


concave for simplicity.

Theorem 4.4.2. Let 𝜋 ∝ exp(−𝑉 ) be the target and assume 0  𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 and
∇𝑉 (0) = 0. Let (𝜇𝑘ℎ )𝑡 ≥0 denote the law of LMC, and let (𝜋𝑡 )𝑡 ≥0 denote the law of the
4.4. PROOF VIA GIRSANOV’S THEOREM 159

Langevin diffusion initialized at 𝜇0 . Then, for ℎ ∈ [0, 𝛽1 ],


∫ 𝑑
4 3 2
KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) . 𝛽 ℎ 𝑁 k·k d𝜇 0 + + 𝛽 2𝑑ℎ 2 𝑁 .
𝛼

Proof. By the data-processing inequality (Theorem 1.5.3), KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) ≤ KL(P𝑁ℎ k Q𝑁ℎ ),


so it suffices to bound the latter. By Girsanov’s theorem,

dP𝑁ℎ
KL(P𝑁ℎ k Q𝑁ℎ ) = EP𝑁 ℎ ln
dQ𝑁ℎ
−1 
1
𝑁
∑︁ ∫ (𝑘+1)ℎ
=E P𝑁 ℎ
√ h∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ ), d𝐵𝑡 i
𝑘=0 2 𝑘ℎ
1 (𝑘+1)ℎ
∫ 
− k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 d𝑡 .
4 𝑘ℎ

However, we must be cautious! Here, (𝐵𝑡 )𝑡 ∈[0,𝑁ℎ] is a Q𝑁ℎ -standard Brownian. As a


sanity check, if (𝐵𝑡 )𝑡 ∈[0,𝑁ℎ] were a P𝑁ℎ -standard Brownian motion, then the first term
would vanish (since stochastic integrals have zero mean) and the KL divergence would
be negative, which is absurd. Instead, write d𝐵˜ 𝑡 = d𝐵𝑡 + √1 {∇𝑉 (𝑋𝑘ℎ ) − ∇𝑉 (𝑋𝑡 )} so that
2
( 𝐵˜ 𝑡 )𝑡 ∈[0,𝑁ℎ] is a P𝑁ℎ -Brownian motion, and hence

𝑁 −1 ∫
1 ∑︁ (𝑘+1)ℎ P𝑁 ℎ
KL(P𝑁ℎ k Q𝑁ℎ ) = E [k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] d𝑡
4
𝑘=0 𝑘ℎ

𝛽 2 ∑︁ (𝑘+1)ℎ P𝑁 ℎ
𝑁 −1 ∫
≤ E [k𝑋𝑡 − 𝑋𝑘ℎ k 2 ] d𝑡
4
𝑘=0 𝑘ℎ

𝛽 2 ∑︁ (𝑘+1)ℎ
𝑁 −1 ∫
= {(𝑡 − 𝑘ℎ) 2 EP𝑁 ℎ [k∇𝑉 (𝑋𝑘ℎ )k 2 ] + 2𝑑 (𝑡 − 𝑘ℎ)} d𝑡
4 𝑘ℎ
𝑘=0

𝛽 ℎ 3 ∑︁
4 𝛽 2𝑑ℎ 2 𝑁
𝑁 −1
≤ EP𝑁 ℎ [k𝑋𝑘ℎ k 2 ] + .
12 4
𝑘=0

We can use the bound on the second moment of the LMC iterates (Lemma 4.1.4) to get

𝑑
KL(P𝑁ℎ k Q𝑁ℎ ) . 𝛽 4ℎ 3 𝑁 EP𝑁 ℎ [k𝑋 0 k 2 ] + + 𝛽 2𝑑ℎ 2 𝑁 . 
𝛼
160 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

The preceding argument is similar to the Wasserstein coupling proof (Theorem 4.1.1),
and indeed in both proofs we used strong log-concavity. However, in the Wasserstein cou-
pling proof, the strong log-concavity assumption is crucial because it implies contraction
in the Wasserstein metric, whereas the preceding discretization argument only requires a
bound on the second moment of the LMC iterates which can be obtained in other ways.
√ What sampling guarantee does Theorem 4.4.2 imply? Unfortunately, neither KL nor
KL satisfy the triangle inequality, which poses a difficulty for bounding the distance
of
√ 𝜇 𝑁ℎ from the target 𝜋. One way to skirt this difficulty is to simply use the fact that
KL & k·k TV (Pinsker’s inequality, Exercise 2.10) and the fact that the total variation
distance satisfies the triangle inequality.

Corollary 4.4.3. Let 𝜋 ∝ exp(−𝑉 ) be the target and assume 0  𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑
and ∇𝑉 (0) = 0. Let (𝜇𝑘ℎ )𝑡 ≥0 denote the law of LMC initialized at the distribution
e 𝛼𝜀2 2 ), we obtain the
𝜇0 = normal(0, 𝛽 −1𝐼𝑑 ). Then, for all 𝜀 ∈ (0, 1) and for ℎ = Θ( 𝛽 𝑑
guarantee k𝜇 𝑁ℎ − 𝜋 k TV ≤ 𝜀 provided that
 𝜅 2𝑑 
𝑁 =Θ iterations .
𝜀2
e

Proof. Since strong log-concavity implies an LSI, which in turn implies exponential con-
vergence of the Langevin√︁ diffusion to its target in KL divergence (Theorem 1.2.24 and Theo-
provided 𝑁ℎ & 𝛼1 log 𝜀 20 . With our choice
KL(𝜇 k𝜋)
rem 1.2.28), we obtain KL(𝜋 𝑁ℎ k 𝜋) ≤ √𝜀
2∫
of initialization, KL(𝜇 0 k 𝜋) . 𝑑 log 𝜅 and k·k 2 d𝜇 0 . 𝑑/𝛽 ≤ 𝑑/𝛼.
We now take the number of iterations to satisfy 𝑁ℎ  𝛼1 log 𝜀 20 . Then, with our
KL(𝜇 k𝜋)

choice of ℎ, we obtain from Theorem 4.4.2 that KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) ≤ √𝜀 . By the triangle


√︁
2
inequality and Pinsker’s inequality,

k𝜇 𝑁ℎ − 𝜋 k TV ≤ k𝜇 𝑁ℎ − 𝜋 𝑁ℎ k TV + k𝜋 𝑁ℎ − 𝜋 k TV
1 1
√︂ √︂
≤ KL(𝜇 𝑁ℎ k 𝜋 𝑁ℎ ) + KL(𝜋 𝑁ℎ k 𝜋) ≤ 𝜀 .
2 2
1 KL(𝜇 0 k𝜋)
Finally, plugging in the choice of ℎ into 𝑁ℎ  𝛼 log 𝜀2
yields the result. 

Although the quantitative dependence in Corollary 4.4.3 matches prior results (e.g.,
via the interpolation method in Theorem 4.2.5), the final result is unsatisfying because
we have moved to a weaker metric (TV rather than KL) for a seemingly silly reason (the
failure of the triangle inequality for the KL divergence). Indeed, we have a convergence
4.4. PROOF VIA GIRSANOV’S THEOREM 161

result for the Langevin diffusion in KL, and our discretization bound is in KL, yet our final
result is in TV. Can we remedy this?
To address this, we can introduce the Rényi divergences (defined in (2.2.14)); recall
that KL = R1 . We have a continuous-time result for the Langevin diffusion in Rényi
divergence (Theorem 2.2.15), and it turns out that with some additional tricks it is possible
to extend the Girsanov discretization argument to any Rényi divergence. Moreover, the
Rényi divergences satisfy a weak triangle inequality: for any 𝑞 > 1, any 𝜆 ∈ (0, 1), and
any probability measures 𝜇, 𝜈, 𝜋:
𝑞−𝜆
R𝑞 (𝜇 k 𝜋) ≤ R𝑞/𝜆 (𝜇 k 𝜈) + R (𝑞−𝜆)/(1−𝜆) (𝜈 k 𝜋) . (4.4.4)
𝑞−1
This allows us to combine a continuous-time Rényi result with a Rényi discretization
argument to yield a Rényi sampling guarantee. We provide the details for this approach
in Chapter 5.

Bibliographical Notes
Historically, the LMC algorithm, which is called unadjusted because of the lack of a
Metropolis–Hastings filter, was only studied relatively recently in non-asymptotic settings.
Before the work of [DT12], it was more common to study MALA (which we introduce and
study in Chapter 7). The ideas which go into the basic 𝑊2 coupling proof for Theorem 4.1.1
were developed in a series of works on strongly log-concave sampling: [DT12; Dal17a;
Dal17b; DM17; DM19]. The Girsanov argument of Theorem 4.4.2 is also due to [DT12].
There are two other notable proof techniques that we have omitted from this chapter:
reflection coupling [Ebe11; Ebe16] and mean squared analysis [Li+19; Li+22; LZT22].
Reflection coupling uses a carefully chosen coupling of the Brownian motions rather than
just taking the two Brownian motions to be the same as we have done (the latter coupling
is called the synchronous coupling). Mean squared analysis is a general framework
which combines local errors (one-step discretization bounds) into global error bounds.
These two methods are useful for performing discretization analysis under more general
sets of assumptions, but they are limited to providing guarantees in 𝑊1 or 𝑊2 .

Exercises
Proof via Wasserstein Coupling
162 CHAPTER 4. ANALYSIS OF LANGEVIN MONTE CARLO

D Exercise 4.1 explicit computations for a Gaussian target


Suppose that the target distribution is a Gaussian, 𝜋 = normal(0, Σ), and that LMC is
initialized at a Gaussian. Can you write down the iterates and stationary distribution of
LMC explicitly? What happens when Σ = 𝐼𝑑 ?
Perform some explicit computations for this example and compare them to the general
results for LMC that we derived in this chapter.

D Exercise 4.2 second moment bounds for LMC


Prove the second moment bound in Lemma 4.1.4. (Recall that the generator ℒ of the
Langevin diffusion satisfies E𝜋 ℒ𝑓 = 0. Apply this with 𝑓 = k·k 2 /2. For the LMC iterates,
write a recursion for E[k𝑋 (𝑘+1)ℎ k 2 ] in terms of E[k𝑋𝑘ℎ k 2 ]. In order to prove the lemma for
all step sizes ℎ ≤ 𝛽1 , you may need to appeal to coercivity of the gradient, Lemma 6.2.3.)

D Exercise 4.3 LMC with decaying step size


1
Show that by considering LMC with a decaying step size ℎ𝑘  𝛼𝑘 , one can obtain an
iteration complexity which removes the logarithmic factor in Theorem 4.1.1.

Proof via Convex Optimization

D Exercise 4.4 optimization analogues of sampling proofs


Suppose that we wish to minimize a function 𝑉 : R𝑑 → R satisfying the conditions
0  𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . For each of the proofs in Section 4.3, translate it into a corresponding
optimization proof.
CHAPTER 5

Convergence in Rényi Divergence

In this chapter, we study convergence guarantees for the LMC algorithm in Rényi di-
vergences. Recall that the Rényi divergences are a family of information divergences,
indexed by a parameter 𝑞, such that the Rényi divergence of order 𝑞 = 1 is the KL diver-
gence, and the Rényi divergence of order 2 is related to the chi-squared divergence via
R2 = ln(1 + 𝜒 2 ). We studied the continuous-time convergence of the Langevin diffusion
in Rényi divergence under either a Poincaré inequality or a log-Sobolev inequality in
Section 2.2.4. Here, we will build upon these results in order to study the discretized
algorithm. Rényi divergence guarantees are stronger than the 𝑊2 or KL guarantees from
Chapter 4, and they have recently found use in areas such as differential privacy [Mir17].

5.1 Proof under LSI via Interpolation Argument


In this section, we follow [Che+21a], which generalizes the argument of Theorem 4.2.5 and
provides a clean Rényi convergence proof for LMC under the assumption of a log-Sobolev
inequality (LSI).
As in Section 4.2, we begin by writing a differential inequality for the Rényi diver-
gence along the interpolation (4.2.1) of LMC. The proof is a combination of the proofs
of Theorem 2.2.15 and Corollary 4.2.3, so it is left as Exercise 5.1. Throughout this section,
let 𝑞 ≥ 2 be fixed.

163
164 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE

Proposition 5.1.1. Along the law (𝜇𝑡 )𝑡 ≥0 of the interpolated process (4.2.1),

3 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 𝑞 E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ,
𝑞 E𝜋 (𝜌𝑡 )
d𝜇𝑡
where 𝜌𝑡 B and 𝜓𝑡 B 𝜌𝑡
𝑞−1 𝑞
d𝜋 /E𝜋 (𝜌𝑡 ).

In analogy with the usual Fisher information, the quantity

4 E𝜋 [k∇(𝜌 𝑞/2 )k 2 ] d𝜇
FI𝑞 (𝜇 k 𝜋) B , 𝜌B ,
𝑞 E𝜋 (𝜌 𝑞 ) d𝜋

may be considered the “Rényi Fisher information”. As in the proof of Theorem 2.2.15,
under a log-Sobolev inequality, we have
2
FI𝑞 (𝜇 k 𝜋) ≥ R𝑞 (𝜇 k 𝜋) ,
𝑞𝐶 LSI

so the first term in Proposition 5.1.1 provides a decay in the Rényi divergence. In the
discretization analysis, our task is to control the second term.
Note that
𝑞−1
𝜌
E𝜓𝑡 (𝑋𝑡 ) = E𝜇𝑡 𝜓𝑡 = E𝜋 𝜌𝑡 𝑡 𝑞 = 1 ,
 
E𝜋 (𝜌𝑡 )

so 𝜓𝑡 (𝑋𝑡 ) acts as a change of measure. The main difficulty of the proof is that whereas
we know how to control the term k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 under the original probability
measure P (indeed, this is precisely what we accomplished in Theorem 4.2.5), it is not
dP
straightforward to control this term under the measure P defined by dP = 𝜓𝑡 (𝑋𝑡 ). To-
wards this end, we shall employ change of measure inequalities that allow us to relate
expectations under P to expectations under P. Note that 𝜓𝑡 = 1 when 𝑞 = 1, which is why
these difficulties can be avoided when working with the KL divergence.
The main theorem that we wish to prove is as follows.

Theorem 5.1.2 ([Che+21a]). For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies LSI and that ∇𝑉 is
1
𝛽-Lipschitz. Also, for simplicity, assume that 𝐶 LSI, 𝛽 ≥ 1. Then, for all ℎ ≤ 192𝐶LSI 𝛽 2𝑞 2
,
5.1. PROOF UNDER LSI VIA INTERPOLATION ARGUMENT 165

for all 𝑁 ≥ 𝑁 0 , it holds that


 (𝑁 − 𝑁 )ℎ 
0 e(𝐶 LSI 𝛽 2𝑑ℎ𝑞) ,
R𝑞 (𝜇 𝑁ℎ k 𝜋) ≤ exp − R2 (𝜇 0 k 𝜋) + 𝑂
4𝐶 LSI

where 𝑁 0 B d 2𝐶ℎLSI ln(𝑞 − 1)e. In particular, for all 𝜀 ∈ [0, 𝑑/𝑞], if we choose the step
√︁

e 2 𝜀 2 ), then we obtain the guarantee R𝑞 (𝜇 𝑁ℎ k 𝜋) ≤ 𝜀 after


size ℎ = Θ(
√︁
𝛽 𝑑𝑞𝐶 LSI

 𝐶 2 𝛽 2𝑑𝑞 
LSI
𝑁 =𝑂 log R (𝜇
2 0 k 𝜋) iterations .
𝜀2
e

For clarity of exposition, we begin with a discretization analysis that incurs a worse
dependence on 𝑞. Afterwards, we show how to improve the dependence on 𝑞 via a
hypercontractivity argument.
Proof of Theorem 5.1.2 with suboptimal dependence on 𝑞. As in the proof of Theorem 4.2.5,
our aim is to control the error term E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ], where from the
𝛽-smoothness of 𝑉 and from ℎ ≤ 3𝛽1 we have

k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ≤ 9𝛽 2 (𝑡 − 𝑘ℎ) 2 k∇𝑉 (𝑋𝑡 )k 2 + 6𝛽 2 (𝑡 − 𝑘ℎ) k𝐵𝑡 − 𝐵𝑘ℎ k 2 .


There are two terms to control. For the first term, applying the duality lemma for the
Fisher information (Lemma 4.2.4) to the measure 𝜓𝑡 𝜇𝑡 ,
d𝜇𝑡  2 
E𝜓𝑡 𝜇𝑡 [k∇𝑉 k 2 ] ≤ FI(𝜓𝑡 𝜇𝑡 k 𝜋) + 2𝛽𝑑 = E𝜇𝑡 𝜓𝑡 ∇ ln 𝜓𝑡 + 2𝛽𝑑

d𝜋
E𝜋 [𝜌𝑡 k∇ ln(𝜌𝑡 )k 2 ] 4 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞 𝑞 𝑞/2
= 𝑞 + 2𝛽𝑑 = 𝑞 + 2𝛽𝑑 ,
E𝜋 (𝜌𝑡 ) E𝜋 (𝜌𝑡 )
where we used the identity

d𝜇𝑡  2  4 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
E𝜇𝑡 𝜓𝑡 ∇ ln 𝜓𝑡 (5.1.3)

=
d𝜋 𝑞
E𝜋 (𝜌𝑡 )
which follows from the chain rule from calculus.
For the second error term, we must control the term k𝐵𝑡 − 𝐵𝑘ℎ k 2 under the measure
dP
P, where dP = 𝜓𝑡 (𝑋𝑡 ). The difficulty is that under P, 𝐵 is no longer a standard Brownian
motion, so it is difficult to control this term directly. Instead, we apply the Donsker–
Varadhan variational principle (Theorem 1.5.4) to relate the expectation under P (denoted
E) with the expectation under P. For any random variable 𝜁 , it yields
E𝜁 ≤ KL(P k P) + ln E exp 𝜁 .
166 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE

Applying this to 𝜁 B 𝑐 (k𝐵𝑡 − 𝐵𝑘ℎ k − Ek𝐵𝑡 − 𝐵𝑘ℎ k) 2 for a constant 𝑐 > 0 to be chosen
later, we obtain
2
E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ] ≤ 2 E[k𝐵𝑡 − 𝐵𝑘ℎ k 2 ] + E𝜁
𝑐
2
≤ 2𝑑 (𝑡 − 𝑘ℎ) + KL(P k P) + ln E exp 𝑐 (k𝐵𝑡 − 𝐵𝑘ℎ k − Ek𝐵𝑡 − 𝐵𝑘ℎ k) 2 .

𝑐
Under P, 𝐵𝑡 − 𝐵𝑘ℎ ∼ normal(0, (𝑡 − 𝑘ℎ) 𝐼𝑑 ). Applying concentration of measure for the
1
Gaussian distribution (see, e.g., Theorem 2.4.8), if 𝑐 . 𝑡−𝑘ℎ , then

E exp 𝑐 (k𝐵𝑡 − 𝐵𝑘ℎ k − Ek𝐵𝑡 − 𝐵𝑘ℎ k) 2 ≤ 2 .




1
In fact, it suffices to take 𝑐 = 8 (𝑡−𝑘ℎ) . Next, using the LSI for 𝜋,

𝑞−1
𝑞−1 𝑞
𝜌𝑡 𝜌𝑡
KL(P k P) = E𝜓𝑡 𝜇𝑡 ln𝜓𝑡 = E𝜓𝑡 𝜇𝑡 ln 𝑞−1
= E𝜓 𝑡 𝜇 𝑡 ln
E𝜇𝑡 (𝜌𝑡 ) 𝑞 𝑞−1 𝑞/(𝑞−1)
E𝜇𝑡 (𝜌𝑡 )
𝑞−1  1
𝑞
𝜌𝑡
ln ln
𝑞−1
= E𝜓𝑡 𝜇𝑡 − E (𝜌 )
E𝜇𝑡 (𝜌𝑡 ) 𝑞 − 1
𝑞−1 𝜇 𝑡 𝑡
𝑞
| {z }
≥0
𝑞−1
≤ KL(𝜓𝑡 𝜇𝑡 k 𝜋)
𝑞
(𝑞 − 1) 𝐶 LSI d𝜇𝑡  2  2 (𝑞 − 1) 𝐶 LSI E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
E𝜓𝑡 𝜇𝑡 ∇ ln 𝜓𝑡

≤ = ,
2𝑞 d𝜋 𝑞 𝑞
E𝜋 (𝜌𝑡 )

where we applied the identity (5.1.3). Hence,

E[𝜓𝑡 (𝑋𝑡 ) k𝐵𝑡 − 𝐵𝑘ℎ k 2 ]


32𝐶 LSIℎ (𝑞 − 1) E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
≤ 2𝑑 (𝑡 − 𝑘ℎ) + 𝑞 + (16 ln 2) (𝑡 − 𝑘ℎ)
𝑞 E𝜋 (𝜌𝑡 )
E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
≤ 14𝑑 (𝑡 − 𝑘ℎ) + 32𝐶 LSIℎ 𝑞 .
E𝜋 (𝜌𝑡 )

All in all, applying Proposition 5.1.1 and collecting the error terms,

3 E𝜋 [k∇(𝜌𝑡 )k 2 ] 2
2 4 E𝜋 [k∇(𝜌𝑡 )k ]
𝑞/2 n 𝑞/2
o
2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 9𝛽 𝑞 (𝑡 − 𝑘ℎ) 𝑞 + 2𝛽𝑑
𝑞 E𝜋 (𝜌𝑡 ) E𝜋 (𝜌𝑡 )
5.1. PROOF UNDER LSI VIA INTERPOLATION ARGUMENT 167

E𝜋 [k∇(𝜌𝑡 )k 2 ] o
n 𝑞/2
2
+ 6𝛽 𝑞 14𝑑 (𝑡 − 𝑘ℎ) + 32𝐶 LSIℎ 𝑞 .
E𝜋 (𝜌𝑡 )
1
From 𝐶 LSI, 𝛽 ≥ 1 and ℎ ≤ 192𝐶LSI 𝛽 2𝑞 2
, we can absorb some of the error terms into the decay
term and apply the LSI for 𝜋 (see Theorem 2.2.15), yielding

1 E𝜋 [k∇(𝜌𝑡 )k 2 ]
𝑞/2
𝜕𝑡 R𝑞 (𝜇𝑡 k 𝜋) ≤ − 𝑞 + 𝑂 (𝛽 3𝑑ℎ 2𝑞 + 𝛽 2𝑑ℎ𝑞)
𝑞 E𝜋 (𝜌𝑡 )
1
≤− R𝑞 (𝜇𝑡 k 𝜋) + 𝑂 (𝛽 2𝑑ℎ𝑞) .
2𝑞𝐶 LSI

This implies the differential inequality

𝑡 − 𝑘ℎ  𝑡 − 𝑘ℎ  2
𝜕𝑡 exp R𝑞 (𝜇𝑡 k 𝜋) . exp 𝛽 𝑑ℎ𝑞 . 𝛽 2𝑑ℎ𝑞 .

2𝑞𝐶 LSI 2𝑞𝐶 LSI

Integrating this over 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] yields

ℎ 
R𝑞 (𝜇 (𝑘+1)ℎ k 𝜋) ≤ exp − R𝑞 (𝜇𝑘ℎ k 𝜋) + 𝑂 (𝛽 2𝑑ℎ 2𝑞) .
2𝑞𝐶 LSI

Unrolling the recursion,


 𝑁ℎ 
R𝑞 (𝜇 𝑁ℎ k 𝜋) ≤ exp − R𝑞 (𝜇0 k 𝜋) + 𝑂 (𝐶 LSI 𝛽 2𝑑ℎ𝑞 2 ) . 
2𝑞𝐶 LSI

We pause to reflect upon the proof. As discussed above, the key steps are to use
change of measure inequalities in order to relate expectations under P to expectations
under P. This is accomplished via the Fisher information duality lemma (Lemma 4.2.4)
and the Donsker–Varadhan variational principle (Theorem 1.5.4). These inequalities yield
an additional error term of the form FI(𝜓𝑡 𝜇𝑡 k 𝜋) (for the latter, this error term appears
after an application of the LSI). The magical part of the calculation is that FI(𝜓𝑡 𝜇𝑡 k 𝜋) is
precisely equal to the Rényi Fisher information (up to constants), and when the step size
ℎ is sufficiently small it can be absorbed into the decay term of the differential inequality
in Proposition 5.1.1.
The proof above implies an iteration complexity whose dependence on 𝑞 scales as
𝑁 = 𝑂 (𝑞 3 ). In order to improve the dependence on 𝑞, we modify the differential in-
equality of Proposition 5.1.1 by making the parameter 𝑞 time-dependent, similarly to the
hypercontractivity principle (Exercise 2.6). The proof is left as Exercise 5.1.
168 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE

Proposition 5.1.4 (hypercontractivity). Suppose that 𝜋 satisfies a log-Sobolev inequal-


ity. Along the law (𝜇𝑡 )𝑡 ≥0 of the interpolated process (4.2.1), if we define the parameter
𝑞(𝑡) B 1 + (𝑞 0 − 1) exp 2𝐶𝑡LSI , then

𝑞(𝑡)/2 2
 1 2 (𝑞(𝑡) − 1) E𝜋 [k∇(𝜌𝑡 )k ]
∫ 
ln 𝜌𝑡 d𝜋 ≤ −
𝑞(𝑡)
𝜕𝑡 2 𝑞(𝑡)
𝑞(𝑡) 𝑞(𝑡) E𝜋 (𝜌𝑡 )
+ 𝑞(𝑡) − 1 E[𝜓𝑡 (𝑋𝑡 ) k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋𝑘ℎ )k 2 ] ,


d𝜇𝑡
where 𝜌𝑡 B and 𝜓𝑡 B 𝜌𝑡
𝑞(𝑡)−1 𝑞(𝑡)
d𝜋 /E𝜋 (𝜌𝑡 ).

Proof of Theorem 5.1.2 with improved dependence on 𝑞. Let 𝑞¯ ≥ 3.


Initial waiting phase. We apply hypercontractivity (Proposition 5.1.4) with 𝑞 0 = 2
and for 𝑡 ≤ 𝑁 0ℎ, where 𝑁 0 B d 2𝐶ℎLSI ln(𝑞¯ − 1)e. Note that 𝑞¯ ≤ 𝑞(𝑁 0ℎ) ≤ 2𝑞.
¯ The bound on
the error term from the previous proof yields
 1 ∫ 
ln 𝜌𝑡 d𝜋 . 𝛽 2𝑑ℎ𝑞(𝑡) .
𝑞(𝑡)
𝜕𝑡
𝑞(𝑡)

Integrating this over 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] yields

1 1
∫ ∫
ln d𝜋 − ln d𝜋 . 𝛽 2𝑑ℎ 2𝑞¯ .
𝑞((𝑘+1)ℎ) 𝑞(𝑘ℎ)
𝜌 (𝑘+1)ℎ 𝜌𝑘ℎ
𝑞((𝑘 + 1)ℎ) 𝑞(𝑘ℎ)

Unrolling the recursion yields

1 1
∫ ∫
ln 𝜌 𝑁 0ℎ 0 d𝜋 − ln 𝜌 02 d𝜋 . 𝛽 2𝑑ℎ 2𝑞𝑁 e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
¯ 0 ≤𝑂 ¯ .
𝑞(𝑁 ℎ)
𝑞(𝑁 0ℎ) 2

Finishing the convergence analysis. Next, after shifting time indices and applying
the previous proof of Theorem 5.1.2 with 𝑞 = 2,

1 3
∫ ∫
2 e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
R𝑞¯ (𝜇 (𝑁 +𝑁0 )ℎ ln 𝜌 (𝑁 +𝑁 )ℎ d𝜋 ≤ ln d𝜋 + 𝑂 ¯
𝑞(𝑁 0ℎ)
k 𝜋) ≤ 𝜌 𝑁ℎ
𝑞(𝑁 0ℎ) − 1 0 4
3  𝑁ℎ 
e(𝐶 LSI 𝛽 2𝑑ℎ𝑞)
≤ exp − R2 (𝜇0 k 𝜋) + 𝑂 ¯ .
4 4𝐶 LSI

This proves the desired result. 


5.1. PROOF UNDER LSI VIA INTERPOLATION ARGUMENT 169

The proof of Theorem 5.1.2 is rather specific to the LSI case because we use the LSI to
bound the KL term KL(𝜓𝑡 𝜇𝑡 k 𝜋) via the Rényi Fisher information, which is then absorbed
into the differential inequality of Proposition 5.1.1. However, it turns out that rather than
assuming an LSI for 𝜋, it suffices to have an LSI for 𝜇𝑡 for all 𝑡 ≥ 0 (possibly with an LSI
constant that grows with 𝑡). One situation in which this holds is when we initialize LMC
with a measure 𝜇0 that satisfies an LSI, and the potential 𝑉 is convex. Note that this
situation is not included in the case when 𝜋 satisfies an LSI, because 𝑉 may only have
linear growth at infinity (whereas from Theorem 2.4.8, if 𝜋 ∝ exp(−𝑉 ) satisfies an LSI,
then 𝑉 necessarily has quadratic growth at infinity). We explore this in Exercise 5.2.

Bibliographical Notes
Discretization of LMC in Rényi divergence was first considered in [VW19], which proved
convergence of LMC to its biased stationary distribution. However, this does not lead to a
sampling guarantee unless the size of the “Rényi bias” (the Rényi divergence between the
biased stationary distribution and the true target distribution) can be estimated.
Motivated by applications to differential privacy, [GT20] provided the first Rényi
sampling guarantees for LMC under strong log-concavity by using a technique based
on the adaptive composition lemma for Rényi divergences. Then, [EHZ22] improved
the analysis of [GT20] via a two-phase analysis that weakens the assumption of strong
log-concavity to a dissipativity assumption and obtains a sharper bound, but which
still relies on the adaptive composition lemma. Subsequently, building off the earlier
work of [Che+21b] (which essentially does a one-step discretization argument in Rényi
divergence), in [Che+21a] it was realized that the earlier arguments of [GT20; EHZ22]
can be streamlined by replacing the adaptive composition lemma entirely with Girsanov’s
theorem. The proofs of Theorem 5.1.2 and Exercise 5.2 are also from [Che+21a].

Exercises
Proof under LSI via Interpolation Argument

D Exercise 5.1 Rényi differential inequality


Prove Proposition 5.1.1 and Proposition 5.1.4.
170 CHAPTER 5. CONVERGENCE IN RÉNYI DIVERGENCE

D Exercise 5.2 Rényi discretization bound for log-concave targets


Suppose that 𝜋 ∝ exp(−𝑉 ) is log-concave, that ∇𝑉 (0) = 0, and that ∇𝑉 is 𝛽-Lipschitz.
Also, suppose that we initialize LMC at 𝜇 0 = normal(0, 𝛽 −1𝐼𝑑 ). The goal of this exercise is
to prove a Rényi discretization bound for LMC under these assumptions.

1. First, show that 𝜇𝑡 satisfies an LSI for all 𝑡 ≥ 0, and write down a bound for 𝐶 LSI (𝜇𝑡 )
(the bound should grow linearly with 𝑡).
Hint: See Section 2.3.

2. Follow the proof of Theorem 5.1.2 (the first proof which incurs a suboptimal depen-
dence on 𝑞). Note that in Theorem 5.1.2, we bounded KL(P k P) ≤ 𝑞 KL(𝜓𝑡 𝜇𝑡 k 𝜋)
𝑞−1

and we applied the LSI for 𝜋. This time, use KL(P k P) = KL(𝜓𝑡 𝜇𝑡 k 𝜇𝑡 ) and apply
the LSI for 𝜇𝑡 instead.
Also, instead of using the decay of the Rényi divergence under a LSI, use the decay of
the Rényi divergence under a PI (Theorem 2.2.15). (Since 𝜋 is log-concave, it neces-
sarily satisfies a Poincaré inequality with some constant 𝐶 PI , see the Bibliographical
Notes to Chapter 2.)
Prove that if 𝜀 ≤ 1/𝑞 ∧ 𝐶 PI𝑑/𝛽, then with an appropriate choice of step size ℎ
√︁ √︁

and with 𝑁 = Θ(𝐶 e 2 𝛽 2𝑑 2𝑞 3 /𝜀 2 ) iterations of LMC, we obtain R𝑞 (𝜇 𝑁ℎ k 𝜋) ≤ 𝜀.


√︁
PI
(Unlike the guarantee of Theorem 5.1.2, here the guarantee does not allow 𝑁 to be
too large, due to the growing LSI constant of the iterates.)
CHAPTER 6

Faster Low-Accuracy Samplers

We now move beyond the basic LMC algorithm and consider samplers with better depen-
dence on the dimension and inverse accuracy. There are two main sources of improvement
that we explore in this chapter. The first is to use a more sophisticated discretization
method than the basic Euler–Maruyama discretization. The second is to consider a differ-
ent stochastic process, called the underdamped Langevin diffusion. By combining these
two ideas, we arrive at the state-of-the-art complexity bounds for low-accuracy samplers.

6.1 Randomized Midpoint Discretization


In this section, we study the randomized midpoint discretization, which was intro-
duced in [SL19]. The application to the Langevin diffusion was carried out in [HBE20].
Consider the continuous-time Langevin diffusion from time 𝑘ℎ to (𝑘 + 1)ℎ:
∫ (𝑘+1)ℎ √
𝑍 (𝑘+1)ℎ = 𝑍𝑘ℎ − ∇𝑉 (𝑍𝑡 ) d𝑡 + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
𝑘ℎ

In the Euler discretization, we approximate the second term via −ℎ ∇𝑉 (𝑍𝑘ℎ ). However, if
we want an unbiased estimator of the integral, then we can introduce an auxiliary random
variable 𝑢𝑘 ∼ uniform[0, 1] and use

𝑍 (𝑘+1)ℎ ≈ 𝑍𝑘ℎ − ℎ ∇𝑉 (𝑍 (𝑘+𝑢𝑘 )ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .

171
172 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

To compute this approximation, however, we need to know 𝑍 (𝑘+𝑢𝑘 )ℎ . We have


∫ (𝑘+𝑢𝑘 )ℎ √
𝑍 (𝑘+𝑢𝑘 )ℎ = 𝑍𝑘ℎ − ∇𝑉 (𝑍𝑡 ) d𝑡 + 2 (𝐵 (𝑘+𝑢𝑘 )ℎ − 𝐵𝑘ℎ ) .
𝑘ℎ
Note that we are in the same situation as before. In particular, if we desire, we can
draw another uniform random variable 𝑢𝑘0 and approximate the second term above via
−𝑢𝑘 ℎ ∇𝑉 (𝑍 (𝑘+𝑢𝑘 𝑢𝑘0 )ℎ ). In principle, this procedure can be repeated indefinitely. However,
we will see that just one step of this procedure suffices: further applications of this
procedure do not improve the discretization error. Instead, we will simply approximate
𝑍 (𝑘+𝑢𝑘 )ℎ via an Euler–Maruyama step:

𝑍 (𝑘+𝑢𝑘 )ℎ ≈ 𝑍𝑘ℎ − 𝑢𝑘 ℎ ∇𝑉 (𝑍𝑘ℎ ) + 2 (𝐵 (𝑘+𝑢𝑘 )ℎ − 𝐵𝑘ℎ ) .
To summarize, the randomized midpoint discretization of the Langevin diffusion,
which we will call RM-LMC, is the following update:

𝑋 (𝑘+1)ℎ B 𝑋𝑘ℎ − ℎ ∇𝑉 (𝑋 (𝑘+𝑢𝑘 )ℎ ) + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) ,
√ (RM-LMC)
𝑋 (𝑘+𝑢𝑘 )ℎ B 𝑋𝑘ℎ − 𝑢𝑘 ℎ ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵 (𝑘+𝑢𝑘 )ℎ − 𝐵𝑘ℎ ) ,
where (𝑢𝑘 )𝑘∈N is a sequence of i.i.d. uniform[0, 1] random variables which are independent
of 𝑋 0 and the Brownian motion. The algorithm uses two gradient evaluations per iteration.
Also, when implementing this recursion, it is important to note that the two Brownian
increments are coupled. To sample the Brownian increments, draw two i.i.d. standard
Gaussians 𝜉𝑘 and 𝜉𝑘0 , and set
√︁
𝐵 (𝑘+𝑢𝑘 )ℎ − 𝐵𝑘ℎ B 𝑢𝑘 ℎ 𝜉𝑘 ,
√︁ √︁
𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ B 𝑢𝑘 ℎ 𝜉𝑘 + (1 − 𝑢𝑘 )ℎ 𝜉𝑘0 .
We now analyze the complexity of this algorithm following the Wasserstein coupling
proof of Section 4.1.

Theorem 6.1.1. For 𝑘 ∈ N, let 𝜇𝑘ℎ denote the law of the 𝑘-th iterate of RM-LMC
with step size ℎ > 0. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and
𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Then, provided ℎ . 𝛽𝜅11/2 , for all 𝑁 ∈ N,

 𝛽 2𝑑ℎ 2 𝛽 4𝑑ℎ 3 𝛽 6𝑑ℎ 4 


𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇0, 𝜋) + 𝑂 + + .
𝛼 𝛼2 𝛼3
𝑑 1/4
In particular, if we initialize at 𝜇0 = 𝛿 0 and take ℎ  𝜀
𝛽𝑑 1/2
(1 ∧ 𝜀 1/2𝜅 1/2
), then for any
6.1. RANDOMIZED MIDPOINT DISCRETIZATION 173
√ √
𝜀 ∈ [0, 𝑑] we obtain the guarantee 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 after
 𝜅𝑑 1/2 𝜀 1/2𝜅 1/2  
𝑁 =𝑂
e 1∨ iterations .
𝜀 𝑑 1/4

Proof. Recall from the proof of Theorem 4.1.1 that we started with a one-step discretization
bound, and then we derived a multi-step discretization bound. In particular, for the one-
step bound, we showed that if 𝜋ℎ is the law of the continuous-time Langevin diffusion
started at 𝜇0 , then
𝑊22 (𝜇ℎ , 𝜋ℎ ) . 𝛽 4ℎ 4 E[k𝑍 0 k 2 ] + 𝛽 2𝑑ℎ 3 . (6.1.2)
It turns out that (6.1.2) still holds for RM-LMC. Since the proof is almost the same as
before, we leave it as an exercise (Exercise 6.1).
The benefits of the randomized midpoint discretization enter once we consider the
multi-step discretization. As in Theorem 4.1.1, we let 𝑋𝑘ℎ ∼ 𝜇𝑘ℎ and 𝑍𝑘ℎ ∼ 𝜋 be optimally
coupled, and we let (𝑋¯𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] and (𝑍𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] denote continuous-time Langevin
diffusions initialized at 𝑋𝑘ℎ and 𝑍𝑘ℎ respectively; all of these processes are coupled by
using the same Brownian motion to drive them. We bound
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ E[k𝑋 (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ]
= E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ]
+ 2 Eh𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ , 𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ i .
Now observe that in the cross term, the only quantity that depends on the uniform random
variable 𝑢𝑘 is 𝑋 (𝑘+1)ℎ . In particular, if we let ℱ(𝑘+1)ℎ denote the 𝜎-algebra generated by
𝑋𝑘ℎ and (𝐵𝑡 )𝑡 ∈[𝑘ℎ,(𝑘+1)ℎ] , then for any 𝜆 > 0,

2 Eh𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ , 𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ i


= 2 Eh𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ , E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ i
1 
≤ 𝜆 E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + E kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ k 2

𝜆
and plugging this in,
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ (1 + 𝜆) E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] + E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ]
1 
+ E kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ k 2 .

𝜆
Before jumping into the calculations, let us see what we have gained, focusing on the
dependence on the step size. As before, we need to take 𝜆  ℎ. Previously, in Theorem 4.1.1,
174 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

the error term was 𝑂 ( ℎ1 ) E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] = 𝑂 (ℎ 2 ). In this calculation, we have
split the error term into E[k𝑋 (𝑘+1) − 𝑋¯ (𝑘+1)ℎ k 2 ] = 𝑂 (ℎ 3 ) as well as another error term,
which has a factor of 𝑂 ( ℎ1 ) but also has the expectation over the uniform random variable
inside the norm. If we can show that the expectation over the uniform random variable
makes the error smaller order than before (the interpretation being that the randomized
midpoint reduces the bias), then we obtain smaller discretization error.
The one-step discretization bound (6.1.2) yields

E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 ] . 𝛽 4ℎ 4 E[k𝑋𝑘ℎ k 2 ] + 𝛽 2𝑑ℎ 3 .

Next,

E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] = 𝑋𝑘ℎ − ℎ E[∇𝑉 (𝑋 (𝑘+𝑢𝑘 )ℎ ) | ℱ(𝑘+1)ℎ ] + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ )
∫ (𝑘+1)ℎ √
= 𝑋𝑘ℎ − ∇𝑉 (𝑋𝑡 ) d𝑡 + 2 (𝐵 (𝑘+1)ℎ − 𝐵𝑘ℎ ) .
𝑘ℎ

Hence,
h ∫ (𝑘+1)ℎ 2 i
¯ 2
{∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋¯𝑡 )} d𝑡
 
E kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋 (𝑘+1)ℎ k = E

𝑘ℎ
∫ (𝑘+1)ℎ
≤ℎ E[k∇𝑉 (𝑋𝑡 ) − ∇𝑉 (𝑋¯𝑡 )k 2 ] d𝑡
𝑘ℎ
∫ (𝑘+1)ℎ
2
≤𝛽 ℎ E[k𝑋𝑡 − 𝑋¯𝑡 k 2 ] d𝑡 .
𝑘ℎ

By definition, 𝑋𝑡 = 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ ), so

E kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ k 2


 
∫ (𝑘+1)ℎ h ∫ 𝑡 2 i
≤𝛽 ℎ2
{∇𝑉 (𝑋¯𝑘ℎ ) − ∇𝑉 (𝑋𝑠 )} d𝑠 d𝑡

E
𝑘ℎ 𝑘ℎ
∫ (𝑘+1)ℎ ∫ 𝑡
≤ 𝛽 2ℎ 2 E[k∇𝑉 (𝑋¯𝑘ℎ ) − ∇𝑉 (𝑋𝑠 )k 2 ] d𝑠 d𝑡
𝑘ℎ 𝑘ℎ
∫ (𝑘+1)ℎ ∫ 𝑡
≤ 𝛽 4ℎ 2 E[k𝑋¯𝑘ℎ − 𝑋𝑠 k 2 ] d𝑠 d𝑡 . 𝛽 4ℎ 4 {𝛽 2ℎ 2 E[k𝑋𝑘ℎ k 2 ] + 𝑑ℎ}
𝑘ℎ 𝑘ℎ

where the last inequality uses the movement bound for the Langevin process that we
proved in Lemma 4.1.5.
6.1. RANDOMIZED MIDPOINT DISCRETIZATION 175

Next, recall that by contraction of the Langevin diffusion under strong log-concavity
(Theorem 1.4.10), E[k𝑋¯ (𝑘+1)ℎ − 𝑍 (𝑘+1)ℎ k 2 ] ≤ exp(−2𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋). Choosing 𝜆 = 𝛼ℎ
2 ,

3𝛼ℎ  2
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp − 𝑊2 (𝜇𝑘ℎ , 𝜋)
2

4 4 2 2 3 𝛽 6ℎ 5 2 𝛽 4𝑑ℎ 4 
+ 𝑂 𝛽 ℎ E[k𝑋𝑘ℎ k ] + 𝛽 𝑑ℎ + E[k𝑋𝑘ℎ k ] + .
𝛼 𝛼

At this point, we could bound E[k𝑋𝑘ℎ k 2 ] recursively, similarly to Lemma 4.1.4, but instead
we will use a trick:

𝑑
E[k𝑋𝑘ℎ k 2 ] = 𝑊22 (𝜇𝑘ℎ , 𝛿 0 ) . 𝑊22 (𝜇𝑘ℎ , 𝜋) + 𝑊22 (𝜋, 𝛿 0 ) . 𝑊22 (𝜇𝑘ℎ , 𝜋) + .
𝛼
1
It implies that if we take ℎ . 𝛽𝜅 1/2
, then

 𝛽 4𝑑ℎ 4 𝛽 6𝑑ℎ 5 
𝑊22 (𝜇 (𝑘+1)ℎ , 𝜋) ≤ exp(−𝛼ℎ) 𝑊22 (𝜇𝑘ℎ , 𝜋) 2
+ 𝑂 𝛽 𝑑ℎ + 3
+ .
𝛼 𝛼2

Unrolling the recursion,

 𝛽 2𝑑ℎ 2 𝛽 4𝑑ℎ 3 𝛽 6𝑑ℎ 4 


𝑊22 (𝜇 𝑁ℎ , 𝜋) ≤ exp(−𝛼𝑁ℎ) 𝑊22 (𝜇 0, 𝜋) + 𝑂 + + .
𝛼 𝛼2 𝛼3

From the analysis, it can be seen that the bottleneck term (at least for dimension de-
pendence) is not from the term involving E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ]. This justifies our earlier
comment that one step of the randomized midpoint procedure already suffices. 

The complexity guarantee for RM-LMC is considerably better than that for LMC.
In fact, it is known that the randomized midpoint method is essentially an optimal
discretization method (which is not the same as saying that RM-LMC is an optimal
sampling algorithm); see [CLW21].
One notable downside of the randomized midpoint discretization is that the analysis
seems specific to the Wasserstein coupling approach. In particular, it is currently not
known how to obtain matching guarantees in KL divergence.
In the above result, we proved a slightly weaker complexity bound in order to stream-
line the proof. It is possible to improve the second term of 𝜅 3/2𝑑 1/4 /𝜀 1/2 in the guarantee
of Theorem 6.1.1; see Exercise 6.2.
176 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

6.2 Hamiltonian Monte Carlo


The next algorithm we introduce, known as Hamiltonian Monte Carlo (HMC), was
popularized in the context of sampling by Neal [Nea11]. As the name suggests, it is
inspired by Hamiltonian mechanics. Although this algorithm is usually combined with a
Metropolis–Hastings filter, we defer a discussion of this until Chapter 7. In this section,
we instead focus on an analysis of the ideal (i.e., continuous-time) dynamics.

6.2.1 Introduction to Ideal HMC


First, we augment the target distribution 𝜋 to add a momentum variable 𝑝. Specifically,
define the distribution 𝝅 on phase space R𝑑 × R𝑑 via
1
𝝅 (𝑥, 𝑝) ∝ exp −𝑉 (𝑥) − k𝑝 k 2 .

2
The first marginal of 𝝅 is 𝜋 ∝ exp(−𝑉 ), so if we obtain a sample from 𝝅 then upon
projecting to the first coordinate we obtain a sample from 𝜋.
The augmented target can also be written as 𝝅 ∝ exp(−𝐻 ), where 𝐻 is the Hamil-
tonian 𝐻 (𝑥, 𝑝) B 𝑉 (𝑥) + 12 k𝑝 k 2 . In Hamiltonian mechanics, which is a reformulation
of classical mechanics, the laws of motion are governed by Hamilton’s equations, a
system of coupled first-order ODEs:1
𝑥¤𝑡 = ∇𝑝 𝐻 (𝑥𝑡 , 𝑝𝑡 ) = 𝑝𝑡 ,
𝑝¤𝑡 = −∇𝑥 𝐻 (𝑥𝑡 , 𝑝𝑡 ) = −∇𝑉 (𝑥𝑡 ) .
Introducing the antisymmetric matrix
0 𝐼𝑑
 
𝑱 B ,
−𝐼𝑑 0
Hamilton’s equations can be written succinctly as
(𝑥¤𝑡 , 𝑝¤𝑡 ) = 𝑱 ∇𝐻 (𝑥𝑡 , 𝑝𝑡 ) .
Let 𝐹𝑡 : R𝑑 × R𝑑 → R𝑑 × R𝑑 denote the flow map, i.e., 𝐹𝑡 (𝑥 0, 𝑝 0 ) is the solution (𝑥𝑡 , 𝑝𝑡 )
to Hamilton’s equations started from (𝑥 0, 𝑝 0 ). Then, we show that 𝐹𝑡 leaves the augmented
target 𝝅 invariant: (𝐹𝑡 ) # 𝝅 = 𝝅. Indeed, if 𝑓 : R𝑑 × R𝑑 → R is a function on phase space
and (𝑥𝑡 , 𝑝𝑡 )𝑡 ≥0 evolve via Hamilton’s equations started at (𝑥 0, 𝑝 0 ) ∼ 𝝅,

𝜕𝑡 𝑡=0 E 𝑓 (𝑥𝑡 , 𝑝𝑡 ) = Eh∇𝑓 (𝑥 0, 𝑝 0 ), (𝑥¤0, 𝑝¤0 )i = h∇𝑓 (𝑥 0, 𝑝 0 ), (𝑥¤0, 𝑝¤0 )i d𝝅 (𝑥 0, 𝑝 0 )

1 In contrast, Newton’s law 𝑥¥𝑡 = −∇𝑉 (𝑥𝑡 ) is a second-order ODE.


6.2. HAMILTONIAN MONTE CARLO 177
∫  𝑝  
=− 𝑓 div 𝝅
−∇𝑉
∇𝑥 ln 𝝅 Eo
∫ n   D   
𝑝 𝑝
=− 𝑓 div + , d𝝅 = 0 .
−∇𝑉 −∇𝑉 ∇𝑝 ln 𝝅

Further properties of the Hamiltonian dynamics are explored in Exercise 6.4.


However, simply running Hamilton’s equations does not yield a convergent sampling
algorithm. For example, suppose that 𝑉 (𝑥) = 12 k𝑥 k 2 ; then, each flow map 𝐹𝑡 is actually
a diffeomorphism. This implies, for example, that KL((𝐹𝑡 ) # 𝝁 k (𝐹𝑡 ) # 𝝅) = KL(𝝁 k 𝝅) for
any initial distribution 𝝁 on phase space. To get around this issue, we can “refresh” the
momentum periodically. More specifically, we pick an integration time 𝑇 > 0, and every
𝑇 units of time we draw a new momentum vector from the standard Gaussian distribution
(which is the distribution of the momentum under 𝝅).
Ideal HMC: Pick an integration time 𝑇 > 0 and draw (𝑋 0, 𝑃0 ) ∼ 𝝁 0 . For each iteration
𝑘 = 0, 1, 2, . . . :

1. Refresh the velocity by drawing 𝑃𝑘𝑇


0 ∼ normal(0, 𝐼 ).
𝑑

2. Integrate Hamilton’s equations: set (𝑋 (𝑘+1)𝑇 , 𝑃 (𝑘+1)𝑇 ) B 𝐹𝑇 (𝑋𝑘𝑇 , 𝑃𝑘𝑇


0 ).

Since both steps of each iteration preserve 𝝅, the entire algorithm preserves 𝝅. At
this stage, though, this algorithm is still idealized because it assumes the ability to exactly
integrate Hamilton’s equations. This may be possible for very special cases, but it is not
in general, and certainly not within our oracle model. Nevertheless, it is instructive to
first analyze the ideal algorithm.

6.2.2 Analysis of Ideal HMC

Theorem 6.2.1 (ideal HMC, [CV19]). Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies
𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . For 𝑘 ∈ N, let 𝜋𝑘𝑇 denote the law of the 𝑘-th iterate 𝑋𝑘𝑇 of ideal HMC
with integration time 𝑇 > 0. Then, if we set 𝑇 = √1 , we obtain
2 𝛽

 𝑁  2
𝑊22 (𝜇 𝑁𝑇 , 𝜋) ≤ exp − 𝑊2 (𝜇 0, 𝜋) .
16𝜅

It is known that the convergence rate in this theorem is optimal, see [CV19]. We now
follow the proof, which is a purely deterministic analysis of Hamilton’s equations. First,
we need two lemmas.
178 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

Lemma 6.2.2 (a priori bound). Let (𝑥𝑡 , 𝑝𝑡 )𝑡 ≥0 and (𝑥𝑡0, 𝑝𝑡0 )𝑡 ≥0 denote two solutions to
Hamilton’s equations of motion with 𝑝 0 = 𝑝 00 and a potential 𝑉 satisfying ∇2𝑉  𝛽𝐼𝑑 .
Then, for all 𝑡 ∈ [0, √1 ], it holds that
2 𝛽

1
k𝑥 0 − 𝑥 00 k 2 ≤ k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ 2 k𝑥 0 − 𝑥 00 k 2 .
2

Proof. First, note that 𝜕𝑡 k𝑝𝑡 −∫𝑝𝑡0 k ≤ k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k ≤ 𝛽 k𝑥𝑡 − 𝑥𝑡0 k. It follows that
|𝜕𝑡 k𝑥𝑡 − 𝑥𝑡0 k| ≤ k𝑝𝑡 − 𝑝𝑡0 k ≤ 𝛽 0 k𝑥𝑠 − 𝑥𝑠0 k d𝑠 and hence
𝑡

∫ 𝑡∫ 𝑠
0 0
k𝑥𝑡 − 𝑥𝑡 k ≤ k𝑥 0 − 𝑥 0 k + 𝛽 k𝑥𝑟 − 𝑥𝑟0 k d𝑟 d𝑠 .
0 0

Applying an ODE comparison lemma, one may deduce that k𝑥𝑡 − 𝑥𝑡0 k ≤ 2 k𝑥 0 − 𝑥 00 k. The
lower bound is similar. 

Lemma 6.2.3 (coercivity). Suppose 𝑓 : R𝑑 → R satisfies 0  ∇2 𝑓  𝛽𝐼𝑑 . Then, for all


𝑥, 𝑦 ∈ R𝑑 , it holds that

𝛽 h∇𝑓 (𝑥) − ∇𝑓 (𝑦), 𝑥 − 𝑦i ≥ k∇𝑓 (𝑥) − ∇𝑓 (𝑦)k 2 .

Proof. See Exercise 6.5. 


We now prove the contraction result for the Hamiltonian dynamics.

Proposition 6.2.4 (contraction of Hamilton’s equations). Consider any two solutions


(𝑥𝑡 , 𝑝𝑡 )𝑡 ≥0 and (𝑥𝑡0, 𝑝𝑡0 )𝑡 ≥0 to Hamilton’s equations of motion with 𝑝 0 = 𝑝 00 and a potential
𝑉 satisfying 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Then, for all 𝑡 ∈ [0, √1 ], it holds that
2 𝛽

𝛼𝑡 2 
k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ exp − k𝑥 0 − 𝑥 00 k 2 .
4

Proof. We compute
1
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 = h𝑥𝑡 − 𝑥𝑡0, 𝑝𝑡 − 𝑝𝑡0 i ,
2
1 2
𝜕 k𝑥𝑡 − 𝑥𝑡0 k 2 = k𝑝𝑡 − 𝑝𝑡0 k 2 − h𝑥𝑡 − 𝑥𝑡0, ∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)i = −𝜌𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 + k𝑝𝑡 − 𝑝𝑡0 k 2 ,
2 𝑡
6.2. HAMILTONIAN MONTE CARLO 179

where we define
h∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0), 𝑥𝑡 − 𝑥𝑡0i
𝜌𝑡 B .
k𝑥𝑡 − 𝑥𝑡0 k 2
To bound k𝑝𝑡 −𝑝𝑡0 k 2 , we use |𝜕𝑡 k𝑝𝑡 −𝑝𝑡0 k| ≤ k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k. Also, by the coercivity
lemma (Lemma 6.2.3),
k∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0)k 2 ≤ 𝛽 h∇𝑉 (𝑥𝑡 ) − ∇𝑉 (𝑥𝑡0), 𝑥𝑡 − 𝑥𝑡0i = 𝛽𝜌𝑡 k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ 2𝛽𝜌𝑡 k𝑥 0 − 𝑥 00 k 2
where we used Lemma 6.2.2. Hence, by the Cauchy–Schwarz inequality,
∫ 𝑡 2 𝑡 √︁ 2

0 2
k𝑝𝑡 − 𝑝𝑡 k ≤ 𝜕𝑠 k𝑝𝑠 − 𝑝𝑠 k d𝑠 ≤
0
2𝛽𝜌𝑠 k𝑥 0 − 𝑥 00 k d𝑠

0 0
∫ 𝑡
≤ 2𝛽𝑡 k𝑥 0 − 𝑥 00 k 2 𝜌𝑠 d𝑠 .
0
From this and Lemma 6.2.2, we deduce
 ∫ 𝑡 
2 0 2
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡 k ≤ − 𝜌𝑡 − 4𝛽𝑡 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2 .
0
Integrating and using 𝜌 ≥ 0,
∫ 𝑡 ∫ 𝑡 ∫ 𝑟 
0 2
𝜕𝑡 k𝑥𝑡 − 𝑥𝑡 k ≤ − 𝜌𝑠 d𝑠 − 4𝛽 𝑠 𝜌𝑟 d𝑟 d𝑠 k𝑥 0 − 𝑥 00 k 2
0 0
∫ 𝑡 ∫ 𝑡 0 
≤− 𝜌𝑠 d𝑠 − 2𝛽𝑡 2 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2
0 0
∫ 𝑡  𝛼𝑡
= −(1 − 2𝛽𝑡 2 ) 𝜌𝑠 d𝑠 k𝑥 0 − 𝑥 00 k 2 ≤ − k𝑥 0 − 𝑥 00 k 2 .
0 2
Integrating again then yields
𝛼𝑡 2 𝛼𝑡 2 
k𝑥𝑡 − 𝑥𝑡0 k 2 ≤ k𝑥 0 − 𝑦0 k 2 − k𝑥 0 − 𝑥 00 k 2 ≤ exp − k𝑥 0 − 𝑥 00 k 2 . 
4 4
If we choose 𝑇 = √1 , then the contraction factor is exp(− 16𝜅
1
) ≤ 1 − 1/(32𝜅). In
2 𝛽
particular, let 𝑃 denote the transition kernel of one step of ideal HMC with integration
time 𝑇 . We have the 𝑊1 contraction
1 
𝑊1 𝑃 ((𝑥, 𝑝), ·), 𝑃 ((𝑥 0, 𝑝 0), ·) ≤ 1 − k(𝑥, 𝑝) − (𝑥 0, 𝑝 0)k ,

32𝜅
which, in the language of Section 2.7, says that the coarse Ricci curvature of 𝑃 is bounded
below by 𝜅/32. Hence, by Theorem 2.7.5, we immediately obtain the following corollary.
180 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

Corollary 6.2.5. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Let
𝑃 be the Markov kernel for ideal HMC with integration time 𝑇 = √1 . Then, 𝑃 satisfies
2 𝛽
a Poincaré inequality with constant at most 32𝜅, where 𝜅 B 𝛽/𝛼.

We conclude this section by observing that, since Hamilton’s equations are a deter-
ministic system of ODEs, we can approximately integrate them using any ODE solver;
unlike for the Langevin diffusion, there is no need to consider any SDE discretization here.
By following this approach, [CV19] also provide the following sampling guarantee.

Theorem 6.2.6 (unadjusted HMC, [CV19]). Assume that the target 𝜋 ∝ exp(−𝑉 )
satisfies 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 and ∇𝑉 (0) = 0. Then, there is a√sampling algorithm based on
a discretization of ideal HMC which outputs 𝜇 satisfying 𝛼 𝑊2 (𝜇, 𝜋) ≤ 𝜀 using
 3/2 1/2 
e 𝜅 𝑑
𝑂 gradient queries .
𝜀

6.3 The Underdamped Langevin Diffusion


In ideal HMC, we refresh the momentum periodically. We now consider a variant in which
the momentum is refreshed continuously. The underdamped Langevin diffusion is the
solution to the SDE

d𝑋𝑡 = 𝑃𝑡 d𝑡 ,
d𝑃𝑡 = −∇𝑉 (𝑋𝑡 ) d𝑡 − 𝛾𝑃𝑡 d𝑡 + 2𝛾 d𝐵𝑡 .
√︁

Here, 𝛾 > 0 is a parameter known as the friction parameter; as the name suggests, the
physical interpretation is that the Hamiltonian system is damped by friction.
Unlike the Langevin diffusion, the underdamped Langevin diffusion is not a reversible
Markov process. Moreover, it is an example of hypocoercive dynamics, which means that
the Markov semigroup approach based on Poincaré and log-Sobolev inequalities fails,
necessitating the use of more sophisticated PDE analysis.

6.3.1 Continuous-Time Considerations


It is illuminating to write the dynamics in the space of measures. By computing the
generator of this Markov process and writing down the corresponding Fokker–Planck
6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 181

equation, we arrive at the PDE


 
 −𝑝
𝜕𝑡 𝝅 𝑡 = 𝛾 Δ𝑝 𝝅 𝑡 + div 𝝅 𝑡
∇𝑉 + 𝛾𝑝
for the evolution of the law (𝝅 𝑡 )𝑡 ≥0 of the underdamped Langevin diffusion. We can write
this as the continuity equation
  ∇𝑝 ln 𝝅

𝜕𝑡 𝝅 𝑡 = div 𝝅 𝑡
−∇𝑥 ln 𝝅 − 𝛾 ∇𝑝 ln 𝝅 + 𝛾 ∇𝑝 ln 𝝅 𝑡

where 𝝅 (𝑥, 𝑝) ∝ exp(−𝐻 (𝑥, 𝑝)) = exp(−𝑉 (𝑥) − 12 k𝑝 k 2 ). However, taking advantage of
the fact that
 −∇ ln 𝝅  
div 𝝅 𝑡 𝑝 𝑡
= 0,
∇𝑥 ln 𝝅 𝑡
we have the more interpretable expression
∇𝑥 ln(𝝅 𝑡 /𝝅)  0 1
   
𝜕𝑡 𝝅 𝑡 = div 𝝅 𝑡 𝑱 𝛾 , 𝑱𝛾 B ,
∇𝑝 ln(𝝅 𝑡 /𝝅) −1 𝛾
or

𝜕𝑡 𝝅 𝑡 = div 𝝅 𝑡 𝑱 𝛾 [∇𝑊2 KL(· k 𝝅)] (𝝅 𝑡 ) . (6.3.1)




This shows that the underdamped Langevin diffusion is not interpreted as a gradient flow
of the KL divergence, but rather a “damped Hamiltonian flow” for the KL divergence.
We begin with a contraction result for the continuous-time process based on [Che+18].

Theorem 6.3.2. Let (𝑋𝑡0, 𝑃𝑡0 )𝑡 ≥0 and (𝑋𝑡1, 𝑃𝑡1 )𝑡 ≥0 be two copies of the underdamped
Langevin diffusion, driven by the same Brownian motion. Assume that the potential
satisfies 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Then, defining the modified norm
√︄
2 2
|||(𝑥, 𝑝)||| 2 B 𝑥 + 𝑝 + k𝑥 k 2

𝛽

and setting 𝛾 = 2𝛽, we obtain the contraction


√︁

 𝛼𝑡 
|||(𝑋𝑡1, 𝑃𝑡1 ) − (𝑋𝑡0, 𝑃𝑡0 )||| ≤ exp − √︁ |||(𝑋 01, 𝑃01 ) − (𝑋 00, 𝑃00 )||| .
2𝛽
182 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

Proof. Write 𝛿𝑋𝑡 B 𝑋𝑡1 − 𝑋𝑡0 and 𝛿𝑃𝑡 B 𝑃𝑡1 − 𝑃𝑡0 . Then, by Itô’s formula (Theorem 1.1.18),
d(𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) = 𝛿𝑃𝑡 − 𝜂 {∇𝑉 (𝑋𝑡1 ) − ∇𝑉 (𝑋𝑡0 )} − 𝛾𝜂 𝛿𝑃𝑡 d𝑡
 
h ∫ 1   i
= −(𝛾𝜂 − 1) 𝛿𝑃𝑡 − 𝜂 ∇2𝑉 (1 − 𝑠) 𝑋𝑡0 + 𝑠 𝑋𝑡1 d𝑠 𝛿𝑋𝑡 d𝑡
0
| {z }
C𝐻𝑡
h 1 1 i
(𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) + 𝛾 − − 𝜂𝐻𝑡 𝛿𝑋𝑡 d𝑡

= −𝛾−
𝜂 𝜂
as well as
1 1
d(𝛿𝑋𝑡 ) = 𝛿𝑃𝑡 d𝑡 = (𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 ) − 𝛿𝑋𝑡 d𝑡

𝜂 𝜂
so that
1
𝜕𝑡 {k𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 k 2 + k𝛿𝑋𝑡 k 2 }
2
𝛾 − 𝜂1 1
" #
D 𝛿𝑋 + 𝜂 𝛿𝑃  (𝜂𝐻𝑡 − 𝛾)

𝑡 𝑡 2 𝛿𝑋𝑡 + 𝜂 𝛿𝑃𝑡 E
=− , 1 1 .
𝑋𝑡 2 (𝜂𝐻𝑡 − 𝛾) 𝜂 𝑋𝑡
√︃
2 2
We now check that if 𝛾 = and 𝜂 =
𝜂 𝛽, then the eigenvalues of the matrix above are
lower bounded by 𝛼𝜂/2 = 𝛼/ 2𝛽.
√︁

We check that the new norm we defined is equivalent to the Euclidean norm.

Lemma 6.3.3. For all 𝑥, 𝑝 ∈ R𝑑 ,


1 2 2
k𝑥 k 2 + k𝑝 k 2 ≤ |||(𝑥, 𝑝)||| 2 ≤ 3 k𝑥 k 2 + k𝑝 k 2 .
 
3 𝛽 𝛽

Proof. The upper bound follows from


√︄
2 2 2
|||(𝑥, 𝑝)||| 2 = 𝑥 + 𝑝 + k𝑥 k 2 ≤ 2 k𝑥 k 2 + k𝑝 k 2 + k𝑥 k 2 .

𝛽 𝛽
The lower bound follows from
√︄
2 2 2
k𝑝 k 2 ≤ 2 𝑥 + 𝑝 + 2 k𝑥 k 2 .


𝛽 𝛽
6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 183

Consequently, the contraction result in Theorem 6.3.2 implies

2 1  √2 𝛼𝑡  2
1 0 2 0 2
k𝑋𝑡 − 𝑋𝑡 k + k𝑃𝑡 − 𝑃𝑡 k ≤ 9 exp − √︁ k𝑋 01 − 𝑋 00 k 2 + k𝑃01 − 𝑃00 k 2 .

𝛽 𝛽 𝛽

6.3.2 Wasserstein Coupling Argument


We now discretize the underdamped Langevin diffusion. Of course, we could apply a simple
Euler–Maruyama discretization to the SDE, but there is a slightly better discretization
here. We observe that if we fix the value of the gradient term at time 𝑘ℎ, then the rest of
the SDE is a linear SDE, and can be integrated exactly. Namely, consider

d𝑋𝑡 = 𝑃𝑡 d𝑡 ,
for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] . (ULMC)
d𝑃𝑡 = −∇𝑉 (𝑋𝑘ℎ ) d𝑡 − 𝛾𝑃𝑡 d𝑡 + 2𝛾 d𝐵𝑡 ,
√︁

Then, the solution to the SDE is given explicitly in the following lemma.

Lemma 6.3.4. Conditioned on (𝑋𝑘ℎ , 𝑃𝑘ℎ ), the law of (𝑋 (𝑘+1)ℎ , 𝑃 (𝑘+1)ℎ ) is explicitly given
as normal(𝑀 (𝑘+1)ℎ , Σ) where

𝑋𝑘ℎ + 𝛾 −1 (1 − exp(−𝛾ℎ)) 𝑃𝑘ℎ − 𝛾 −1 (ℎ − 𝛾 −1 (1 − exp(−𝛾ℎ))) ∇𝑉 (𝑋𝑘ℎ )


 
𝑀 (𝑘+1)ℎ =
𝑃𝑘ℎ exp(−𝛾ℎ) − 𝛾 −1 (1 − exp(−𝛾ℎ)) ∇𝑉 (𝑋𝑘ℎ )

and
2
{ℎ − 𝛾2 (1 − exp(−𝛾ℎ)) + 2𝛾1 (1 − exp(−2𝛾ℎ))}
" #

Σ= 𝛾
2 1 ⊗ 𝐼𝑑 .
𝛾 { 2 − exp(−𝛾ℎ) + exp(−2𝛾ℎ)} 1 − exp(−2𝛾ℎ)

The ∗ indicates that the entry is determined by symmetry.

The lemma is an exercise in stochastic calculus (Exercise 6.7). The point is that the
discretization given as ULMC is implementable.
We now proceed to a discretization analysis based on [Che+18].

Theorem 6.3.5. For 𝑘 ∈ N, let 𝝁 𝑘ℎ denote the law of the 𝑘-th iterate of ULMC with
appropriately tuned step size ℎ > 0 and friction parameter 𝛾 > 0. Also, let 𝜇𝑘ℎ denote
the law of 𝑋𝑘ℎ . Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies ∇𝑉 (0) = 0 and 𝛼𝐼𝑑 
184 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

∇2𝑉  𝛽𝐼𝑑 . Then, we obtain the guarantee 𝛼 𝑊2 (𝜇 𝑁ℎ , 𝜋) ≤ 𝜀 after
 𝜅 2𝑑 1/2 
𝑁 =𝑂
e iterations .
𝜀

Remark 6.3.6. This result is not the best possible. Indeed, a refined analysis by [DR20]
e(𝜅 3/2𝑑 1/2 /𝜀) by improving the continuous-time contrac-
obtains the iteration complexity 𝑂
tion result of Theorem 6.3.2.
Proof. One-step discretization bound. As in Theorem 4.1.1, we start with a one-step
bound. Let (𝑋𝑡 , 𝑃𝑡 )𝑡 ≥0 denote ULMC and let (𝑋¯𝑡 , 𝑃¯𝑡 )𝑡 ≥0 denote the continuous-time under-
damped Langevin diffusion, both driven by the same Brownian motion and started at the
same random pair. We want to bound the distance E[|||(𝑋ℎ , 𝑃ℎ ) − (𝑋¯ℎ , 𝑃¯ℎ )||| 2 ]. According
to Lemma 6.3.3, it suffices to bound E[k𝑋ℎ − 𝑋¯ℎ k 2 ] and E[k𝑃ℎ − 𝑃¯ℎ k 2 ] separately.
First,
h ∫ ℎ 2 i ∫ ℎ
¯ 2
E[k𝑋ℎ − 𝑋ℎ k ] = E
¯
{𝑃𝑡 − 𝑃𝑡 } d𝑡 ≤ ℎ

E[k𝑃𝑡 − 𝑃¯𝑡 k 2 ] d𝑡 .
0 0
Next,
h ∫ 𝑡 2 i
¯ 2
E[k𝑃𝑡 − 𝑃𝑡 k ] = E
¯ ¯ ¯
{−∇𝑉 (𝑋 0 ) + ∇𝑉 (𝑋𝑠 ) − 𝛾 (𝑃𝑠 − 𝑃𝑠 )} d𝑠

∫ 𝑡0
E[k∇𝑉 (𝑋¯𝑠 ) − ∇𝑉 (𝑋¯0 )k 2 ] + 𝛾 2 E[k𝑃𝑠 − 𝑃¯𝑠 k 2 ] d𝑠 .

.ℎ
0
1
By Grönwall’s inequality, if ℎ ≤ = √1 , then
𝛾 2𝛽
∫ 𝑡 ∫ 𝑡
E[k𝑃𝑡 − 𝑃¯𝑡 k 2 ] . ℎ E[k∇𝑉 (𝑋¯𝑠 ) − ∇𝑉 (𝑋¯0 )k 2 ] d𝑠 ≤ 𝛽 2ℎ E[k𝑋¯𝑠 − 𝑋¯0 k 2 ] d𝑡 .
0 0
Again, we need a movement bound for the underdamped Langevin diffusion, which is
done in Lemma 6.3.7. Substituting this in and assuming ℎ . 𝛽1 ,
1
E[|||(𝑋ℎ , 𝑃ℎ ) − (𝑋¯ℎ , 𝑃¯ℎ )||| 2 ] . E[k𝑋ℎ − 𝑋¯ℎ k 2 ] + E[k𝑃ℎ − 𝑃¯ℎ k 2 ]
𝛽
. 𝛽 ℎ E[|||(𝑋 0, 𝑃0 )||| 2 ] + 𝛽 3/2𝑑ℎ 5 .
2 4

Multi-step discretization bound. Let W 2 denote the coupling cost for the norm
|||·||| 2 . Let 𝝁ˆ (𝑘+1)ℎ denote the law of the continuous-time underdamped Langevin diffusion
started at 𝝁 𝑘ℎ . Then, from Theorem 6.3.2 and the one-step discretization bound,
W (𝝁 (𝑘+1)ℎ , 𝝅) ≤ W ( 𝝁ˆ (𝑘+1)ℎ , 𝝅) + W (𝝁 (𝑘+1)ℎ , 𝝁ˆ (𝑘+1)ℎ )
6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 185
 𝛼ℎ 
≤ exp − √︁ W (𝝁 𝑘ℎ , 𝝅) + 𝑂 𝛽ℎ 2 W (𝝁 𝑘ℎ , 𝜹 0 ) + 𝛽 3/4𝑑 1/2ℎ 5/2 .

2𝛽
Also, W (𝝁 𝑘ℎ , 𝜹 0 ) ≤ W (𝝁 𝑘ℎ , 𝝅) + W (𝝅, 𝜹 0 ) . W (𝝁 𝑘ℎ , 𝝅) + 𝑑/𝛼, where we used the
√︁
1
moment bound in Lemma 4.1.4. If ℎ . 𝛽 1/2 𝜅
, then we can absorb the W (𝝁 𝑘ℎ , 𝝅) term into
the contraction rate and deduce
 𝛼ℎ   𝛽𝑑 1/2ℎ 2 
3/4 1/2 5/2
W (𝝁 (𝑘+1)ℎ , 𝝅) ≤ exp − √︁ W (𝝁 𝑘ℎ , 𝝅) + 𝑂 + 𝛽 𝑑 ℎ .
2 𝛽 𝛼 1/2
Iterating,
 𝛼𝑁ℎ   𝛽 3/2𝑑 1/2ℎ 𝛽 5/4𝑑 1/2ℎ 3/2 
W (𝝁 𝑁ℎ , 𝝅) ≤ exp − √︁ W (𝝁 0, 𝝅) + 𝑂 + .
2 𝛽 𝛼 3/2 𝛼
Choosing the step size appropriately yields the result. 
The next lemma provides the movement bound (Exercise 6.8).

Lemma 6.3.7. Let (𝑋¯𝑡 , 𝑃¯𝑡 )𝑡 ≥0 denote the underdamped Langevin diffusion with potential
𝑉 satisfying ∇2𝑉  𝛽𝐼𝑑 and ∇𝑉 (0) = 0. If 𝑡 ≤ 𝛾1 ∧ √1 , then
𝛽

E[k𝑋¯𝑡 − 𝑋¯0 k 2 ] . 𝑡 2 E[k 𝑃¯0 k 2 ] + 𝛾𝑑𝑡 3 + 𝛽 2𝑡 4 E[k𝑋¯0 k 2 ] .

6.3.3 Randomized Midpoint Discretization


The randomized midpoint method of Section 6.1 can be applied to the underdamped
Langevin diffusion to yield an even better sampling guarantee. This was carried out
in [SL19], and we state the final result here.

Theorem 6.3.8. Assume that the target 𝜋 ∝ exp(−𝑉 ) satisfies 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 and
∇𝑉 (0) = 0. Then, the randomized
√ midpoint discretization of the underdamped Langevin
diffusion outputs 𝜇 such that 𝛼 𝑊2 (𝜇, 𝜋) ≤ 𝜀 using
 1/3 𝜀 1/3𝜅 1/6  
e 𝜅𝑑
𝑂 1 ∨ gradient queries .
𝜀 2/3 𝑑 1/6

With respect to the dimension dependence, this is the current state-of-the-art guaran-
tee for sampling from strongly log-concave distributions.
TODO: Provide a proof.
186 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

Bibliographical Notes
In the analysis of the randomized midpoint discretization of the Langevin diffusion
(Theorem 6.1.1), we have simplified the original proof of [HBE20] at the cost of proving a
slightly weaker result. The sharper argument of [HBE20] is outlined in Exercise 6.2.
Besides the randomized midpoint method, the shifted ODE discretization [FLO21]
e(𝑑 1/3 /𝜀 2/3 ) when applied to the
also achieves a state-of-the-art iteration complexity of 𝑂
underdamped Langevin diffusion.
HMC and its variants are some of the most popular algorithms employed in practice,
especially the no-U-turn sampler (NUTS) [HG14] which adaptively sets the integration
time. In terms of complexity analysis, the paper [CV19] (whose proof we followed
in Theorem 6.2.1) provided the tight analysis of ideal HMC. Other complexity results
obtained for HMC under various assumptions include [MV18; MS19; BEZ20].
The underdamped Langevin diffusion has been studied quantitatively in [Che+18;
EGZ19; DR20; Ma+21].
TODO: Literature on hypocoercivity.

Exercises
Randomized Midpoint Discretization

D Exercise 6.1 one-step discretization bound for RM-LMC


Prove the one-step discretization bound (6.1.2) for RM-LMC.

D Exercise 6.2 sharper rate for RM-LMC


In this exercise we show how to obtain a slightly sharper guarantee for RM-LMC than
the one in Theorem 6.1.1. Note that Theorem 6.1.1 provides the rate
( √
 𝜅𝑑 1/2 𝜅 3/2𝑑 1/4  𝜅𝑑 1/2 /𝜀 , 𝜅≤ 𝑑/𝜀 ,
𝑁 =𝑂 ∨ = √ (6.E.1)
𝜅 3/2𝑑 1/4 /𝜀 1/2 , 𝜅 ≥
e
𝜀 𝜀 1/2 𝑑/𝜀 ,

whereas the rate we show in this exercise is


( √
 𝜅𝑑 1/2 𝜅 4/3𝑑 1/3  𝜅𝑑 1/2 /𝜀 , 𝜅≤ 𝑑/𝜀 ,
𝑁 =𝑂
e ∨ = 4/3 1/3 2/3
√ (6.E.2)
𝜀 𝜀 2/3 𝜅 𝑑 /𝜀 , 𝜅 ≥ 𝑑/𝜀 .

1. Check that the rate (6.E.2) is indeed better than (6.E.1).


6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 187

The main idea behind the improved rate is that throughout the proof of Theorem 6.1.1, we
used the inequality
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] ≤ 𝛽 2 E[k𝑋𝑘ℎ k 2 ] , (6.E.3)
which is wasteful. Instead, we will show that
𝑁 −1
1 ∑︁
E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] . 𝛽𝑑 . (6.E.4)
𝑁
𝑘=0

Note that since we expect E[k𝑋𝑘ℎ k 2 ]  𝑑/𝛼, then the new bound (6.E.4) is an improvement
by a factor of 𝜅.
2. Rewrite the proof of Theorem 6.1.1, avoiding the use of the inequality (6.E.3), and
leaving the error bound in terms of 𝑘=0 E[k∇𝑉 (𝑋𝑘ℎ )k 2 ]. As a sanity check, the
Í𝑁 −1
result should imply the rate (6.E.2) once the key inequality (6.E.4) is proven.
3. Applying Itô’s formula (Theorem 1.1.18) to 𝑉 (𝑋¯𝑡 ), write down an expression for
E[𝑉 (𝑋¯ (𝑘+1)ℎ ) − 𝑉 (𝑋𝑘ℎ )]. By bounding the error terms carefully and assuming that
ℎ . 𝛽1 , prove that

E[𝑉 (𝑋 (𝑘+1)ℎ ) − 𝑉 (𝑋¯ (𝑘+1)ℎ )]



≥ E[𝑉 (𝑋 (𝑘+1)ℎ ) − 𝑉 (𝑋𝑘ℎ )] + E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] − 𝑂 (𝛽𝑑ℎ) .
4
4. Using the smoothness inequality,
E[𝑉 (𝑋 (𝑘+1)ℎ ) | ℱ(𝑘+1)ℎ ]
≤ 𝑉 (𝑋¯ (𝑘+1)ℎ ) + h∇𝑉 (𝑋¯ (𝑘+1)ℎ ), E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ i

+ E[k𝑋 (𝑘+1)ℎ − 𝑋¯ (𝑘+1)ℎ k 2 | ℱ(𝑘+1)ℎ ] .


𝛽
2
Applying the Cauchy–Schwarz and Young’s inequality to the middle term,
h∇𝑉 (𝑋¯ (𝑘+1)ℎ ), E[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ i
1
≤ 𝜆 k∇𝑉 (𝑋¯ (𝑘+1)ℎ )k 2 + kE[𝑋 (𝑘+1)ℎ | ℱ(𝑘+1)ℎ ] − 𝑋¯ (𝑘+1)ℎ k 2
𝜆
for an appropriate choice of 𝜆 > 0. Use this to show that
E[𝑉 (𝑋 (𝑘+1)ℎ ) − 𝑉 (𝑋¯ (𝑘+1)ℎ )] . 𝛽ℎ 2 E[k∇𝑉 (𝑋𝑘ℎ )k 2 ] + 𝛽 3𝑑ℎ 3 .

5. Combining these inequalities, assuming that 𝛽ℎ is sufficiently small and that we


initialize with 𝑋 0 = arg min 𝑉 , prove the key inequality (6.E.4).
188 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS

Hamiltonian Monte Carlo

D Exercise 6.3 Gaussian calculations for HMC


Suppose that the target 𝜋 is a standard Gaussian distribution. Compute the flow map 𝐹𝑡 for
the Hamiltonian dynamics. Also, if we start at the initial distribution 𝜇 0 = normal(0, 𝜎 2𝐼𝑑 ),
show that the distribution 𝝁 𝑡 over phase space at time 𝑡 of ideal HMC is a Gaussian
distribution, normal(0, Σ𝑡 ), and compute Σ𝑡 ∈ R2𝑑×2𝑑 .

D Exercise 6.4 basic properties of Hamiltonian dynamics


In this exercise, we explore some fundamental properties of the Hamiltonian dynamics.

1. (conservation of energy) Along the Hamiltonian dynamics, (𝑥𝑡 , 𝑝𝑡 )𝑡 ≥0 , show that


𝐻 (𝑥𝑡 , 𝑝𝑡 ) = 𝐻 (𝑥 0, 𝑝 0 ). In fact, for any function 𝑓 : R𝑑 × R𝑑 → R whose Poisson
bracket with 𝐻 vanishes, i.e.,

{𝑓 , 𝐻 } B h∇𝑥 𝑓 , ∇𝑝 𝐻 i − h∇𝑝 𝑓 , ∇𝑥 𝐻 i = 0 ,

it holds that 𝑓 (𝑥𝑡 , 𝑝𝑡 ) = 𝑓 (𝑥 0, 𝑝 0 ).

2. (conservation of volume) By differentiating 𝑡 ↦→ det ∇𝐹𝑡 (𝑥, 𝑝) and using the flow
map equation 𝜕𝑡 𝐹𝑡 (𝑥, 𝑝) = 𝑱 ∇𝐻 (𝐹𝑡 (𝑥, 𝑝)), prove that det ∇𝐹𝑡 (𝑥, 𝑝) = 1 for all 𝑡 ≥ 0
and 𝑥, 𝑝 ∈ R𝑑 . This shows that 𝐹𝑡 : R𝑑 → R𝑑 is a volume-preserving map.

3. (time reversibility) Suppose that (𝑥𝑡 , 𝑝𝑡 )𝑡 ∈[0,𝑇 ] solve Hamilton’s equations. Show
that (𝑥𝑇 −𝑡 , −𝑝𝑇 −𝑡 )𝑡 ∈[0,𝑇 ] also solve Hamilton’s equations. In other words, if 𝑅 is the
moment reversal operator, i.e.,

𝐼𝑑 0
 
𝑅= ,
0 −𝐼𝑑

then 𝐹𝑇−1 = 𝑅 ◦ 𝐹𝑇 ◦ 𝑅.

D Exercise 6.5 coercivity


Prove Lemma 6.2.3.
Hint: Let 𝑧 B 𝑦 − 𝛽1 {∇𝑓 (𝑦) − ∇𝑓 (𝑥)}. Apply the convexity inequality to 𝑓 (𝑥) − 𝑓 (𝑧),
and the smoothness inequality to 𝑓 (𝑧) − 𝑓 (𝑦), in order to upper bound 𝑓 (𝑥) − 𝑓 (𝑦).
Combine this with the symmetric inequality for 𝑓 (𝑦) − 𝑓 (𝑥).
6.3. THE UNDERDAMPED LANGEVIN DIFFUSION 189

The Underdamped Langevin Diffusion

D Exercise 6.6 Fokker–Planck equation for the underdamped Langevin diffusion


Compute the generator ℒ of the underdamped Langevin diffusion and show that it can
be written as ℒ = 𝛾ℒOU + ℒHam , where ℒOU is the generator of the Ornstein–Uhlenbeck
process (Exercise 1.5) acting on the momentum coordinate,

ℒOU 𝑓 B Δ𝑝 𝑓 − h𝑝, ∇𝑝 𝑓 i ,

and ℒHam captures the Hamiltonian part of the dynamics,

ℒHam 𝑓 B h𝑝, ∇𝑥 𝑓 i − h∇𝑉 , ∇𝑝 𝑓 i .

Then, check the various calculations leading up to the Fokker–Planck equation (6.3.1).

D Exercise 6.7 derivation of the ULMC updates


Solve the SDE for ULMC to prove Lemma 6.3.4.

D Exercise 6.8 movement bound for the underdamped Langevin diffusion


Prove the movement bound for underdamped Langevin (Lemma 6.3.7).
190 CHAPTER 6. FASTER LOW-ACCURACY SAMPLERS
CHAPTER 7

High-Accuracy Samplers

So far, we have focused on discretizations of diffusions. Discretization of a continuous-


time Markov process yields a discrete-time Markov chain whose stationary distribution is
no longer equal to the target 𝜋; the algorithm is biased. Nevertheless, we showed that the
size of the bias can be made smaller than any desired accuracy 𝜀 by choosing a small step
size ℎ, which then leads to quantitative sampling guarantees.
However, the number of iterations of the algorithm is proportional to the inverse
step size 1/ℎ, and consequently the complexity of the algorithms scaled as poly(1/𝜀). In
this section, we address the problem of designing high-accuracy samplers, i.e., samplers
whose complexity scales as polylog(1/𝜀). To accomplish this, we must fix the bias of the
sampling algorithm, which is accomplished via the Metropolis–Hastings filter.

7.1 Rejection Sampling


Before introducing the Metropolis–Hastings filter, we begin with a warm up and introduce
the concept of rejection via the rejection sampling algorithm.
Rejection Sampling: Let 𝜋 be the target distribution and let 𝜋e be an unnormalized
version of 𝜋, i.e., 𝜋e ∝ 𝜋. Suppose we can sample from a distribution 𝜇 and that an
unnormalized version e 𝜇 of 𝜇 satisfies e
𝜇 ≥ 𝜋e. Then, repeat until acceptance:
1. Draw 𝑋 ∼ 𝜇.
2. Accept 𝑋 with probability 𝜋e(𝑋 )/e
𝜇 (𝑋 ).

191
192 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Unlike the other sampling algorithms we have considered thus far, rejection sampling
always terminates with an exact sample from 𝜋.

Theorem 7.1.1. The output of rejection sampling is a sample drawn exactly from 𝜋.
Also, the number of samples drawn from 𝜇 until
∫ a sample is accepted follows a geometric
distribution with mean 𝑍 𝜇 /𝑍 𝜋 , where 𝑍 𝜇 B e𝜇 and 𝑍 𝜋 B 𝜋e.

Proof. The probability of acceptance is


∫ ∫
𝜋e 𝑍𝜋 𝜋 𝑍𝜋
P(acceptance) = d𝜇 = d𝜇 =
𝜇
e 𝑍𝜇 𝜇 𝑍𝜇

and clearly the number of samples drawn until acceptance is geometrically distributed.
To show that the output 𝑋 of rejection sampling is drawn exactly according to 𝜋, let
∞ i.i.d. ∞ i.i.d.
(𝑈𝑖 )𝑖=1 ∼ uniform[0, 1] and (𝑋𝑖 )𝑖=1 ∼ 𝜇 be independent. Then, for any event 𝐴,

∑︁  𝜋e(𝑋𝑖 ) 𝜋e(𝑋𝑛+1 ) 
P(𝑋 ∈ 𝐴) = P 𝑋𝑛+1 ∈ 𝐴, 𝑈𝑖 > ∀𝑖 ∈ [𝑛], 𝑈𝑛+1 ≤
𝑛=0
𝜇 (𝑋𝑖 )
e 𝜇 (𝑋𝑛+1 )
e

∑︁  𝜋e(𝑋𝑛+1 )   𝜋e(𝑋 1 )  𝑛
= P 𝑋𝑛+1 ∈ 𝐴, 𝑈𝑛+1 ≤ P 𝑈1 >
𝑛=0
𝜇 (𝑋𝑛+1 )
e 𝜇 (𝑋 1 )
e
∞ ∫ ∞ 
𝜋e   𝑛 𝑍 𝜋

∑︁ 𝜋e   ∑︁ 𝑍𝜋  𝑛
= d𝜇 1− d𝜇 = 𝜋 (𝐴) 1− = 𝜋 (𝐴) . 
𝑛=0 𝐴
𝜇
e 𝜇
e 𝑍𝜇 𝑛=0
𝑍𝜇

The rejection sampling algorithm requires the construction of the upper envelope e 𝜇.
We now demonstrate how to construct this envelope for our usual class of distributions.
Namely, suppose that 𝜋 ∝ exp(−𝑉 ) satisfies 0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 , and that ∇𝑉 (0) = 0.
We can assume that our unnormalized version 𝜋e = exp(−𝑉 ) of 𝜋 satisfies 𝑉 (0) = 0 (if not,
replace 𝑉 by 𝑉 − 𝑉 (0)). Then, by strong convexity of 𝑉 , we see that 𝜋e ≤ exp(− 𝛼2 k·k 2 ),
and we can take e 𝜇 B exp(− 𝛼2 k·k 2 ), which means that the normalized distribution is
𝜇 = normal(0, 𝛼 −1𝐼𝑑 ). To understand the efficiency of rejection sampling, we need to
bound the ratio 𝑍 𝜇 /𝑍 𝜋 of normalizing constants. By smoothness of 𝑉 ,

𝑍𝜇 (2π/𝛼)𝑑/2 (2π/𝛼)𝑑/2 (2π/𝛼)𝑑/2


=∫ ≤ ∫ = = 𝜅𝑑/2 ,
𝑍𝜋 exp(−𝑉 ) 𝛽 2
exp(− 2 k·k ) (2π/𝛽)
𝑑/2

with 𝜅 B 𝛽/𝛼. We summarize this result in the following proposition.


7.2. THE METROPOLIS–HASTINGS FILTER 193

Proposition 7.1.2. Let the target 𝜋 ∝ exp(−𝑉 ) on R𝑑 satisfy 0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 ,
𝜇 B exp(− 𝛼2 k·k 2 )
𝑉 (0) = 0, and ∇𝑉 (0) = 0. Then, rejection sampling with the envelope e
returns an exact sample from 𝜋 with a number of iterations that is a geometric random
variable with mean at most 𝜅𝑑/2 , where 𝜅 B 𝛽/𝛼.

The rejection sampling guarantee can be formulated in one of two ways. We can think
of the algorithm as returning an exact sample from 𝜋, with a random number of iterations
(the number of iterations is geometrically distributed). Alternatively, if we place an upper
bound 𝑁 on the number of iterations of the algorithm and output “FAIL” if we have not
terminated by iteration 𝑁 , then the probability of “FAIL” is at most 𝜀 B (1 − 1/𝜅𝑑/2 ) ,
𝑁

and if 𝜇 𝑁 denotes the law of the output of the algorithm, then k𝜇 𝑁 − 𝜋 k TV ≤ 𝜀. If we flip
this around and fix the target accuracy 𝜀, we see that the number of iterations required to
achieve this guarantee is 𝑁 ≥ 𝜅𝑑/2 ln(1/𝜀).
Although this result is acceptable in low dimension, the complexity of this approach
quickly becomes intractable even for moderately high-dimensional problems. In the next
section, we will see that by combining the idea of rejection with local proposals, we can
obtain tractable sampling algorithms in high dimension.

7.2 The Metropolis–Hastings Filter


A Metropolis–Hastings algorithm consists of proposing moves from a proposal kernel
𝑄, and then accepting or rejecting each move with a carefully chosen probability which
ensures that the resulting Markov chain has the desired stationary distribution 𝜋.
In more detail, let 𝑄 be a kernel on R𝑑 × R𝑑 , that is: for each 𝑥 ∈ R𝑑 , 𝑄 (𝑥, ·) is a
probability measure on R𝑑 . We will mostly consider proposals such that each 𝑄 (𝑥, ·) has
a density with respect to Lebesgue measure, and via an abuse of notation we will write
𝑄 (𝑥, 𝑦) for this density evaluated at 𝑦 (an exception is when we consider MHMC below).
Starting from 𝑋 ∈ R𝑑 , we tentatively propose a new point 𝑌 ∼ 𝑄 (𝑋, ·). We then accept
the point 𝑌 with probability 𝐴(𝑋, 𝑌 ) (called the acceptance probability); otherwise, we
stay at the old point 𝑋 . Iterate this process until convergence.
There are different possible choices for the acceptance probability 𝐴, but the choice
we consider here is the Metropolis–Hastings filter

𝜋 (𝑦) 𝑄 (𝑦, 𝑥)
𝐴(𝑥, 𝑦) B 1 ∧ . (7.2.1)
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)

The overall algorithm is summarized as follows.


194 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Metropolis–Hastings algorithm (with proposal 𝑄): initialize at a point 𝑋 0 ∈ R𝑑 .


Then, iterate the following steps for 𝑘 = 1, 2, 3, . . . :
1. Propose a new point 𝑌𝑘 ∼ 𝑄 (𝑋𝑘−1, ·).

2. With probability 𝐴(𝑋𝑘−1, 𝑌𝑘 ), set 𝑋𝑘 B 𝑌𝑘 ; otherwise, set 𝑋𝑘 B 𝑋𝑘−1 . Here, 𝐴 is the


acceptance probability defined via (7.2.1).
This algorithm defines a discrete-time Markov chain whose transition kernel 𝑇 can be
written explicitly as
 ∫ 
𝑇 (𝑥, d𝑦) = 𝑄 (𝑥, d𝑦) 𝐴(𝑥, 𝑦) + 1 − 𝑄 (𝑥, d𝑦 ) 𝐴(𝑥, 𝑦 ) 𝛿𝑥 (d𝑦) .
0 0
(7.2.2)
| {z }
rejection probability

A discrete-time Markov chain with transition kernel 𝑃 is called reversible with


respect to 𝜋 if it holds that 𝜋 (d𝑥) 𝑃 (𝑥, d𝑦) = 𝜋 (d𝑦) 𝑃 (𝑦, d𝑥). Similarly to our discussion
in Section 1.2, discrete-time reversible Markov chains can be studied via spectral theory.

Theorem 7.2.3. The Metropolis–Hastings algorithm with proposal 𝑄 is reversible with


respect to 𝜋.

Proof. We want to check that 𝜋 (𝑥) 𝑇 (𝑥, 𝑦) = 𝜋 (𝑦) 𝑇 (𝑦, 𝑥) for all 𝑥, 𝑦 ∈ R𝑑 with 𝑥 ≠ 𝑦. We
can write
 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)
𝜋 (𝑥) 𝑇 (𝑥, 𝑦) = 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) 𝐴(𝑥, 𝑦) = 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) min 1,
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)
= min{𝜋 (𝑥) 𝑄 (𝑥, 𝑦), 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)}

and this expression is symmetric in 𝑥 and 𝑦. 


Take note of the flexibility of the Metropolis–Hastings algorithm! Regardless of the
choice of proposal kernel 𝑄, the filter always makes the algorithm unbiased. Of course, the
choice of 𝑄 will be crucial later in order to guarantee rapid convergence to stationarity.

Implementability of the Metropolis–Hastings algorithm. To implement the algo-


rithm, the proposal 𝑄 must be simple enough such that (1) we can sample from 𝑄 (𝑥, ·)
easily, and (2) we can compute the density 𝑄 (𝑥, 𝑦) easily (which is required to compute
the acceptance probability). Note that although the target density 𝜋 appears in the expres-
sion (7.2.1) for the acceptance probability, it only appears as a ratio, and in particular we
7.2. THE METROPOLIS–HASTINGS FILTER 195

do not need to know the normalization constant of 𝜋. Hence, the Metropolis–Hastings


filter can be implemented using queries to the density of 𝜋 up to normalization, which
are “zeroth-order queries” (unlike, e.g., LMC, which uses first-order information through
queries to the gradient ∇𝑉 ).

Metropolis–Hastings as a projection. There is a nice geometric interpretation of the


Metropolis–Hastings filter as a projection, due to [BD01]. Given a proposal kernel 𝑄,
let 𝑇 (𝑄) denote the Metropolis–Hastings kernel obtained from 𝑄 (see (7.2.2)). Then, the
mapping 𝑄 ↦→ 𝑇 (𝑄) is a projection of the proposal kernel 𝑄 onto the space of reversible
Markov chains with stationary distribution 𝜋 with respect to an 𝐿 1 notion of distance.
The distance is defined as follows:

0
d(𝑇 ,𝑇 ) B |𝑇 (𝑥, 𝑦) − 𝑇 0 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 (7.2.4)
(R𝑑 ×R𝑑 )\diag

where diag B {(𝑥, 𝑥) | 𝑥 ∈ R𝑑 } is the diagonal in R𝑑 × R𝑑 .

Theorem 7.2.5 ([BD01]). Let ℛ(𝜋) denote the space of kernels 𝑇 which are reversible
with respect to 𝜋, and such that for each 𝑥 ∈ R𝑑 , 𝑇 (𝑥, ·) admits a density with respect to
Lebesgue measure (except possibly having an atom at 𝑥). Then,

𝑇 (𝑄) ∈ arg min d(𝑄,𝑇 ) .


𝑇 ∈ℛ(𝜋)

Proof. Let 𝑇 ∈ ℛ(𝜋), and let 𝑆 B {(𝑥, 𝑦) ∈ R𝑑 | 𝜋 (𝑥) 𝑄 (𝑥, 𝑦) > 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)}. Then,

d(𝑄,𝑇 ) = |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦
(R𝑑 ×R𝑑 )\diag
∫ ∫
= |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 + |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦
𝑆 (R𝑑 ×R𝑑 )\(𝑆∪diag)
∫ ∫
= |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 + |𝑄 (𝑦, 𝑥) − 𝑇 (𝑦, 𝑥)| 𝜋 (d𝑦) d𝑥 .
𝑆 𝑆

Using reversibility of 𝑇 , the second term is


∫ ∫
|𝑄 (𝑦, 𝑥) − 𝑇 (𝑦, 𝑥)| 𝜋 (d𝑦) d𝑥 = |𝜋 (𝑥) 𝑄 (𝑦, 𝑥) − 𝜋 (𝑥) 𝑇 (𝑥, 𝑦)| d𝑥 d𝑦
𝑆 ∫ 𝑆 ∫
≥ |𝜋 (𝑥) 𝑄 (𝑥, 𝑦) − 𝜋 (𝑦) 𝑄 (𝑦, 𝑥)| d𝑥 d𝑦 − |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 . (7.2.6)
𝑆 𝑆
196 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Putting this together, d(𝑄,𝑇 ) ≥ 𝑆 |𝜋 (𝑥) 𝑄 (𝑥, 𝑦) −𝜋 (𝑦) 𝑄 (𝑦, 𝑥)| d𝑥 d𝑦, so we have obtained

a lower bound which does not depend on 𝑇 . On the other hand, we can check that the
only inequality (7.2.6) that we used is an equality for 𝑇 = 𝑇 (𝑄). 

7.3 An Overview of High-Accuracy Samplers


As we already discussed, the Metropolis–Hastings framework is quite flexible: by instan-
tiating it with different choices for the proposal 𝑄, we obtain several different algorithms.

Metropolized random walk (MRW). Perhaps the simplest proposal is to simply


take 𝑄 (𝑥, ·) = normal(𝑥, ℎ 𝐼𝑑 ), which yields the Metropolized random walk (MRW)
algorithm. This corresponds to simply taking a random walk around the state space, where
some steps are occasionally rejected. Note that since the proposal is independent of the
target 𝜋, the overall algorithm only uses queries to the density of 𝜋 up to normalization
(to implement the filter); thus, it is the only algorithm we have discussed so far (besides
rejection sampling) which uses only a zeroth-order oracle for the potential 𝑉 .

Metropolis-adjusted Langevin algorithm (MALA). A better choice of proposal is

𝑄 (𝑥, ·) = normal 𝑥 − ℎ ∇𝑉 (𝑥), 2ℎ 𝐼𝑑




which is simply one step of the LMC algorithm; this yields the Metropolis-adjusted
Langevin algorithm (MALA). We will carefully study the convergence guarantees for
MALA in this chapter.

Metropolized Hamiltonian Monte Carlo (MHMC). Recall the Hamiltonian Monte


Carlo (HMC) algorithm that we introduced in Section 6.2. The ideal HMC algorithm is not
implementable because it requires the ability to exactly integrate Hamilton’s equations,
and this is generally not possible outside of a few special cases.
We now consider approximately implementing Hamilton’s equations through the use
of a numerical integrator. Although several choices are available, for Hamilton’s equations
it is preferable to use a symplectic integrator.1 We will focus on the simplest and most
well-known such integrator, called the leapfrog integrator.
1 When placed within the framework of geometry, Hamiltonian mechanics is encoded via symplectic
geometry, which is the study of manifolds equipped with a symplectic 2-form. The flow map for Hamilton’s
equations preserves this symplectic form, and is therefore known as a symplectomorphism. Symplectic
integrators are special integrators which also preserve the symplectic form. This property leads to stability,
especially for long integration times.
7.3. AN OVERVIEW OF HIGH-ACCURACY SAMPLERS 197

Leapfrog Integrator: Pick a step size ℎ > 0 and a total number of iterations 𝐾,
corresponding to the total integration time via 𝑇 = 𝐾ℎ. Let (𝑥 0, 𝑝 0 ) be the initial point.
For 𝑘 = 0, 1, 2, . . . , 𝐾 − 1:

1. Set 𝑝 (𝑘+ 1 )ℎ B 𝑝𝑘ℎ − ℎ2 ∇𝑉 (𝑥𝑘ℎ ).


2

2. Set 𝑥 (𝑘+1)ℎ B 𝑥𝑘ℎ + ℎ 𝑝 (𝑘+ 1 )ℎ .


2

3. Set 𝑝 (𝑘+1)ℎ B 𝑝 (𝑘+ 1 )ℎ − ℎ2 ∇𝑉 (𝑥 (𝑘+1)ℎ ).


2

Once we apply the leapfrog integrator to HMC, we obtain a discrete-time sampling


algorithm which is once again biased. We then correct the bias through the use of the
Metropolis–Hastings filter. Specifically, for an integration time 𝑇 = 𝐾ℎ, let

𝐹 leap (𝑥, 𝑝) = output 𝑥𝑇 of the leapfrog integrator with 𝐾 steps,


started at (𝑥, 𝑝) .

Remarkably, the acceptance probability can be computed in closed form, and this relies on
specific properties of the leapfrog integrator. The full algorithm is summarized as follows.
Metropolized Hamiltonian Monte Carlo (MHMC): Initialize at 𝑋 0 ∼ 𝜇0 . For
iterations 𝑘 = 0, 1, 2, . . . :

1. Refresh the momentum: draw 𝑃𝑘 ∼ normal(0, 𝐼𝑑 ).

2. Propose a trajectory: let (𝑋𝑘0 , 𝑃𝑘0 ) B 𝐹 leap (𝑋𝑘 , 𝑃𝑘 ).

3. Accept the trajectory with probability 1 ∧ exp{𝐻 (𝑋𝑘 , 𝑃𝑘 ) − 𝐻 (𝑋𝑘0 , 𝑃𝑘0 )}. If the trajec-
tory is accepted, set 𝑋𝑘+1 B 𝑋𝑘0 ; otherwise, we set 𝑋𝑘+1 B 𝑋𝑘 .

It turns out that when 𝐾 = 1, the MHMC algorithm reduces to MALA (Exercise 7.1).
We next justify why the MHMC algorithm leaves 𝝅 invariant. Actually, although we
have written down the MHMC algorithm in the form which is easiest to implement, it
obscures the underlying structure of the algorithm. The proof of the next theorem will
clarify this point.

Theorem 7.3.1. The augmented target distribution 𝝅 ∝ exp(−𝐻 ) is invariant for the
MHMC algorithm.

Proof. First, we note that the step of refreshing the momentum leaves 𝝅 invariant, so it
suffices to study the proposal and acceptance steps.
198 CHAPTER 7. HIGH-ACCURACY SAMPLERS

For the moment, let us pretend that the proposal actually uses the true flow map
𝐹𝑇 (which exactly integrates Hamilton’s equations for time 𝑇 ) rather than the leapfrog
integrator 𝐹 leap . Then, the proposal kernel is deterministic, 𝑄 ((𝑥, 𝑝), ·) = 𝛿 𝐹𝑇 (𝑥,𝑝) . Up until
now, we have been assuming that the proposal kernel admits a density w.r.t. Lebesgue
measure, which certainly does not hold here, but we will brush over this technicality as it
is not the key point here.
If we naı̈vely apply the Metropolis–Hastings filter, the probability of accepting a
proposal (𝑥 0, 𝑝 0) starting from (𝑥, 𝑝) involves a ratio 𝑄 ((𝑥 0, 𝑝 0), (𝑥, 𝑝))/𝑄 ((𝑥, 𝑝), (𝑥 0, 𝑝 0)),
but this ratio is ill-defined in our setting. The problem is that if (𝑥 0, 𝑝 0) = 𝐹𝑇 (𝑥, 𝑝), then it
is not the case that (𝑥, 𝑝) = 𝐹𝑇 (𝑥 0, 𝑝 0); the proposal is not reversible. Hence, we would be
led to reject every single trajectory.
To fix this, recall from Exercise 6.4 that we have the following time reversibility
property: if 𝑅 denotes the momentum flip operator (𝑥, 𝑝) ↦→ (𝑥, −𝑝), then it holds that
𝐹𝑇−1 = 𝑅 ◦ 𝐹𝑇 ◦ 𝑅. It implies that 𝐹𝑇 ◦ 𝑅 = 𝑅 ◦ 𝐹𝑇−1 = (𝐹𝑇 ◦ 𝑅) −1 , so 𝐹𝑇 ◦ 𝑅 is idempotent. In
other words, if we use the proposal 𝐹𝑇 ◦ 𝑅 (i.e., first flip the momentum before integrating
Hamilton’s equations), then the proposal would be reversible and the above issue does not
arise, as the ratio 𝑄 ((𝑥 0, 𝑝 0), (𝑥, 𝑝))/𝑄 ((𝑥, 𝑝), (𝑥 0, 𝑝 0)) would equal 1. Observe also that
using 𝐹𝑇 ◦ 𝑅 instead of 𝐹𝑇 does not change the algorithm since we refresh the momentum
at each step (and if 𝑃𝑘 ∼ normal(0, 𝐼𝑑 ), then −𝑃𝑘 ∼ normal(0, 𝐼𝑑 ) as well).
Once we use the proposal (𝑥 0, 𝑝 0) = (𝐹𝑇 ◦ 𝑅)(𝑥, 𝑝) = 𝐹𝑇 (𝑥, −𝑝), the Metropolis–
Hastings acceptance probability is calculated to be

𝝅 (𝑥 0, 𝑝 0)
1∧ = 1 ∧ exp{𝐻 (𝑥, 𝑝) − 𝐻 (𝑥 0, 𝑝 0)} . (7.3.2)
𝝅 (𝑥, 𝑝)

When we use the exact flow map 𝐹𝑇 , then the Hamiltonian is conserved (Exercise 6.4) so
the above probability is one; every trajectory is accepted. However, the above expression
is indeed meaningful if we instead use the leapfrog integrator 𝐹 leap .
So far, we have motivated the expression (7.3.2) based on the exact flow map 𝐹𝑇 , but
clearly the above argument holds just as well for the leapfrog integrator 𝐹 leap as soon as
we verify the property 𝐹 leap
−1 = 𝑅 ◦ 𝐹
leap ◦ 𝑅, and this is where we use the specific form of
the leapfrog integrator. We leave the verification as Exercise 7.2. 

Remark 7.3.3. The proof shows that the proposal of MHMC should really be thought of
as 𝐹 leap ◦ 𝑅, instead of 𝐹 leap . In fact, if we did not refresh the momentum, then repeatedly
applying the idempotent operator 𝐹 leap ◦ 𝑅 would just cause the algorithm to jump back
and forth between two points (𝑥, 𝑝) and (𝑥 0, 𝑝 0), which is silly; hence one should also
apply another momentum flip after the filter. In symbols, if MH denotes the Metropolis–
Hastings filter step, and Refresh denotes the momentum refreshment step, we should
7.3. AN OVERVIEW OF HIGH-ACCURACY SAMPLERS 199

think of MHMC as the composition

MHMC = 𝑅 ◦ MH(𝐹 leap ◦ 𝑅) ◦ Refresh .

This is simplified to

MHMC = MH(𝐹 leap ◦ 𝑅) ◦ Refresh .

because Refresh ◦ 𝑅 = Refresh.

Lazy chains. Technically, many of the convergence results actually hold for lazy ver-
sions of the Markov chain. Specifically, for ℓ ∈ [0, 1], the ℓ-lazy version of a Markov
chain replaces its transition kernel 𝑇 with the modified kernel 𝑇ℓ given by

𝑇ℓ (𝑥, d𝑦) = (1 − ℓ) 𝑇 (𝑥, d𝑦) + ℓ 𝛿𝑥 (d𝑦) .

The laziness condition is familiar from the study of discrete-time Markov chains on discrete
state spaces, in which laziness is useful for avoiding periodic behavior. For the remainder
of this section, we will generally be considering 12 -lazy versions of the Metropolis–Hastings
chains without explicitly mentioning this. In any case, this modification only multiplies
the mixing time by a factor of 2, so it does not significantly alter the results.

Feasible start vs. warm start. When discussing Metropolis–Hastings algorithms, we


must distinguish between convergence rates when initialized at a feasible start, vs. a
warm start. These terms are not precisely defined, but loosely speaking a feasible start
refers to an easily computable distribution which works well uniformly over the class of
target distributions under consideration. In this section, a feasible start usually refers to
the normal(0, 𝛽 −1𝐼𝑑 ) distribution, where 𝛽 is the smoothness of 𝑉 and we assume that the
minimizer of 𝑉 is 0. On the other hand, a warm start is a distribution which is already
somewhat close to the target 𝜋; for this section, it can be taken to mean a distribution 𝜇0
such that 𝜒 2 (𝜇 0 k 𝜋) = 𝑂 (1). Unsurprisingly, the rates are faster with a warm start.
The situation at hand is similar to the discussion in Section 1.5. Basically, the simplest
way to study a Metropolis–Hastings chain is via spectral theory, which is related to
Poincaré inequalities and hence to the chi-squared divergence at initialization. We also
know that Poincaré inequalities tend to yield poor convergence guarantees in continuous
time, which can be remedied via stronger inequalities (such as a log-Sobolev inequality).
To an extent, this is also possible for Metropolis–Hastings algorithms. However, it is a
fairly recent2 finding that for MALA there is an intrinsic and substantial difference in
2 The phenomenon described here is anticipated, at least qualitatively, from older work on Markov chains.
The recent part of this story is the quantitative study of this effect in the context of MALA.
200 CHAPTER 7. HIGH-ACCURACY SAMPLERS

convergence rates for the feasible and warm start cases, even under the assumption of
strong log-concavity. This is unlike the case of LMC; e.g., the guarantee of Theorem 4.2.5
is not significantly improved by assuming KL(𝜇 0 k 𝜋) = 𝑂 (1).

State-of-the-art results. We now give the current state-of-the-art convergence guar-


antees for the Metropolis–Hastings algorithms that we have introduced.

Theorem 7.3.4 (feasible start case, [Dwi+19; Che+20a; LST20]). Suppose that the
target 𝜋 ∝ exp(−𝑉 ) satisfies 0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 and ∇𝑉 (0) = 0. Consider the
following Metropolis–Hastings algorithms initialized at normal(0, 𝛽 −1𝐼𝑑 ) and with an
appropriately tuned choice of parameters.

1. MRW outputs a measure 𝜇 𝑁 satisfying 𝜒 2 (𝜇 𝑁 k 𝜋) ≤ 𝜀 after


√︁

e 𝜅 2𝑑 polylog 1 iterations .

𝑁 =𝑂
𝜀

2. MALA outputs a measure 𝜇 𝑁 satisfying 𝜒 2 (𝜇 𝑁 k 𝜋) ≤ 𝜀 after


√︁

e 𝜅𝑑 polylog 1
𝑁 =𝑂 iterations .
𝜀

3𝑉 is bounded and that 𝜅  𝑑. Then, MHMC outputs
3. Assume in addition that ∇
a measure 𝜇 𝑁 satisfying 𝜒 2 (𝜇 𝑁 k 𝜋) ≤ 𝜀 after
√︁

e 𝜅 3/4𝑑 polylog 1 gradient queries .



𝑁 =𝑂
𝜀

Note that the result for MHMC is not directly comparable because it makes a stronger
second-order smoothness assumption.
Next, we present the results under a warm start.

Theorem 7.3.5 (warm start case, [Che+21b; WSC21]). Suppose that 𝜋 ∝ exp(−𝑉 )
satisfies 0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 . Consider MALA initialized at a distribution satisfying
2
𝜒 (𝜇0 k 𝜋) = 𝑂 (1). Then, MALA outputs a measure 𝜇 𝑁 satisfying KL(𝜇 𝑁 k 𝜋) ≤ 𝜀 (or
√︁
7.4. MARKOV CHAINS IN DISCRETE TIME 201

R𝑞 (𝜇 𝑁 k 𝜋) ≤ 𝜀 for any 1 ≤ 𝑞 < 2) after


√︁

1
𝑁 = 𝑂 𝜅𝑑 1/2 polylog iterations .
𝜀

Moreover, it is known that the results for MALA in both the feasible and warm start
cases are sharp in a suitable sense. The goal for the rest of the chapter is to prove these
MALA convergence results (up to some technical details).

7.4 Markov Chains in Discrete Time


As discussed in the introduction to this chapter, the key advantage of Metropolis–Hastings
algorithms is that they are unbiased and hence lead to high-accuracy algorithms. In order
to prove complexity bounds that scale as polylog(1/𝜀), where 𝜀 is the target accuracy,
it is important that we do not simply bound the distance between MALA and, e.g., the
continuous-time Langevin diffusion, as we did in Chapter 4. This is not to say that tools
from Chapter 4 are completely irrelevant, only that we must first develop some new
techniques for studying discrete-time Markov chains.

7.4.1 Markov Semigroup Theory


Let 𝑃 be a Markov kernel. It generates a discrete-time semigroup (𝑃 𝑘 )𝑘∈N , and some of the
ideas from Markov semigroup theory (Section 1.2) can be adapted to the present context.

Generator. We define the generator of the semigroup to be the operator ℒ B 𝑃 − id,


acting on 𝐿 2 (𝜋) via 𝑃 𝑓 (𝑥) B 𝑓 (𝑦) 𝑃 (𝑥, d𝑦) (where 𝜋 is the stationary distribution for

𝑃). Note that since the operator norm of 𝑃 is at most 1, then 𝑃 − id is always a negative
operator (similarly to the infinitesimal generator ℒ from Section 1.2).

Reversibility. We defined reversibility in Section 7.2 and showed that Metropolis–


Hastings algorithms are reversible w.r.t. the target distribution 𝜋. For the rest of the
section, we will focus on reversible Markov chains.

Spectral gap. The spectral gap of 𝑃 is the largest 𝜆 > 0 such that for all 𝑓 ∈ 𝐿 2 (𝜋)
with E𝜋 𝑓 = 0,

h𝑓 , (−ℒ)𝑓 i𝐿2 (𝜋) ≥ 𝜆 k𝑓 k 𝐿2 2 (𝜋) .


202 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Equivalently, if (𝑋 0, 𝑋 1 ) are two successive iterates of the chain started at stationarity, it


is equivalent to require

2𝜆 var 𝑓 (𝑋 0 ) ≤ E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| 2 ] . (7.4.1)

In analogy with Section 1.2, we also say that 𝑃 satisfies a Poincaré inequality with
constant 1/𝜆. We already saw in Section 2.7 that a Poincaré inequality is implied by a
lower bound on the coarse Ricci curvature.
The right-hand side of (7.4.1) can be interpreted as a Dirichlet energy,

ℰ(𝑓 , 𝑓 ) B h𝑓 , (−ℒ)𝑓 i𝐿2 (𝜋) ,

and the Markov chain can be viewed as an 𝐿 2 (𝜋) gradient descent on the Dirichlet energy;
see Exercise 7.3. We have the following convergence result.

Theorem 7.4.2. Suppose that the spectral gap of 𝑃 is 𝜆 > 0. Then, for the law (𝜇𝑘 )𝑘∈N
of the iterates of the 12 -lazy version of 𝑃, we have

𝜒 2 (𝜇 𝑁 k 𝜋) ≤ exp(−𝜆𝑁 ) 𝜒 2 (𝜇0 k 𝜋) .

Modified log-Sobolev inequality. We say that 𝑃 satisfies a modified log-Sobolev


inequality (MLSI) with constant 𝐶 MLSI if for all 𝑓 ∈ 𝐿 2 (𝜋) with 𝑓 ≥ 0,

𝐶 MLSI
ent𝜋 𝑓 ≤ ℰ(𝑓 , ln 𝑓 ) .
2
We have already encountered this inequality as Definition 1.2.23, although there we simply
called it the log-Sobolev inequality. In the context of discrete Markov processes, however,
since the chain rule fails and the different variants of the log-Sobolev inequality are no
longer equivalent, it is worth being careful about the terminology.
It is trickier to deduce entropy decay from the MLSI in discrete time, and to avoid
this issue we shall work in continuous time instead. The Markov kernel 𝑃 gives rise to
the generator ℒ B 𝑃 − id, which in turn generates a continuous-time semigroup (𝑃𝑡 )𝑡 ≥0
via 𝑃𝑡 B exp(𝑡ℒ). Note that the generator of (𝑃𝑡 )𝑡 ≥0 is also ℒ and hence the Dirichlet
energy for (𝑃𝑡 )𝑡 ≥0 coincides with the Dirichlet energy for (𝑃 𝑘 )𝑘∈N . Now, if we apply the
calculation (1.2.22) to the semigroup (𝑃𝑡 )𝑡 ≥0 , we find that under an MLSI,
 2𝑡 
KL(𝜇𝑃𝑡 k 𝜋) ≤ exp − KL(𝜇 k 𝜋) ,
𝐶 MLSI
7.4. MARKOV CHAINS IN DISCRETE TIME 203

see Theorem 1.2.24.


Moreover, the continuous-time semigroup (𝑃𝑡 )𝑡 ≥0 can be simulated. Namely, let
i.i.d.
(𝜏𝑘 )𝑘∈N+ ∼ exponential(1), 𝑇𝑘 B 𝑘𝑗=1 𝜏 𝑗 , and consider the following algorithm. Initialize
Í
at 𝑋 0 ∼ 𝜇 0 , and for 𝑘 = 0, 1, 2, . . . , let 𝑋𝑇𝑘+1 ∼ 𝑃 (𝑋𝑇𝑘 , ·), so that (𝑋𝑇𝑘 )𝑘∈N are the iterates of
the discrete-time Markov chain with kernel 𝑃. Also, for 𝑡 ≥ 0, if 𝑇𝑘 ≤ 𝑡 < 𝑇𝑘+1 , then set
𝑋𝑡 B 𝑋𝑇𝑘 . This yields a continuous-time Markov process (𝑋𝑡 )𝑡 ≥0 , and one can check that
the associated Markov semigroup is exactly (𝑃𝑡 )𝑡 ≥0 . Moreover, by concentration of i.i.d.
sums, it holds that 𝑇𝑘 ≈ 𝑘, so that if the semigroup (𝑃𝑡 )𝑡 ≥0 requires time 𝑇mix in order to
mix to a desired level of accuracy, then the algorithm which simulates (𝑋𝑡 )𝑡 ≥0 requires
≈ 𝑇mix iterations to reach the same level of mixing.
This argument can even be made rigorous, using concentration inequalities for the
Poisson random variable, in order to argue that a MLSI for 𝑃 implies a mixing time bound
(in total variation distance, say) for the discrete-time chain (𝑃 𝑘 )𝑘∈N . We omit the details
and content ourselves with the knowledge that an MLSI for 𝑃 at least leads to the existence
of an implementable algorithm (simulating (𝑋𝑡 )𝑡 ≥0 ) with good mixing.
We leave the converse implication (that entropy decay for the discrete-time Markov
chain generated by 𝑃 implies an MLSI for 𝑃) as Exercise 7.4.

7.4.2 Conductance
Unfortunately, it is usually quite challenging to prove either a Poincaré inequality or a
modified log-Sobolev inequality for discrete-time Markov chains, which motivates the
use of conductance.
The conductance of 𝑃 is the greatest number 𝔠 > 0 such that for all events 𝐴 ⊆ R𝑑 ,

𝑃 (𝑥, 𝐴c ) 𝜋 (d𝑥) ≥ 𝔠 𝜋 (𝐴) 𝜋 (𝐴c ) .
𝐴

A small conductance implies the presence of bottlenecks in the space: subsets 𝐴 of the
state space from which it is difficult for the Markov chain to exit. On the other hand, it is
a remarkable fact that once the presence of these bottlenecks is eliminated, then there is a
positive spectral gap. This is the content of a celebrated result of Cheeger.

Theorem 7.4.3 (Cheeger’s inequality, [LS88]). The conductance 𝔠 and the spectral gap
𝜆 satisfy the inequalities
1 2
𝔠 ≤ 𝜆 ≤𝔠.
8
204 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Both inequalities are sharp up to constants. The upper bound on 𝜆 is fairly immediate
(see Exercise 7.3), so we focus on the lower bound. We begin by reformulating the
conductance as a functional inequality.

Lemma 7.4.4. Let the conductance of the chain be 𝔠 > 0. Then, for all 𝑓 ∈ 𝐿 1 (𝜋),
1
E𝜋 |𝑓 − E𝜋 𝑓 | ≤ E|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| , (7.4.5)
𝔠
where (𝑋 0, 𝑋 1 ) are two successive iterates of the chain started at stationarity.

Proof. Let 𝑋 00 be an i.i.d. copy of 𝑋 0 . Then,

E𝜋 |𝑓 − E𝜋 𝑓 | = E|𝑓 (𝑋 00 ) − E 𝑓 (𝑋 0 )| ≤ E|𝑓 (𝑋 00 ) − 𝑓 (𝑋 0 )| .

On the other hand, by reversibility,



E|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| = |𝑓 (𝑥 1 ) − 𝑓 (𝑥 0 )| 𝑃 (𝑥 0, d𝑥 1 ) 𝜋 (d𝑥 0 )

=2 1{𝑓 (𝑥 1 ) > 𝑓 (𝑥 0 )} [𝑓 (𝑥 1 ) − 𝑓 (𝑥 0 )] 𝑃 (𝑥 0, d𝑥 1 ) 𝜋 (d𝑥 0 )

=2 1{𝑓 (𝑥 1 ) > 𝑡 ≥ 𝑓 (𝑥 0 )} 𝑃 (𝑥 0, d𝑥 1 ) 𝜋 (d𝑥 0 ) d𝑡
∫ ∫ 
=2 𝑃 (𝑥 0, {𝑓 ≤ 𝑡 } ) 𝜋 (d𝑥 0 ) d𝑡
c
{𝑓 ≤𝑡 }

≥ 2𝔠 𝜋 ({𝑓 ≤ 𝑡 }) 𝜋 ({𝑓 > 𝑡 }) d𝑡

= 2𝔠 1{𝑓 (𝑥 00 ) > 𝑡 ≥ 𝑓 (𝑥 0 )} 𝜋 (d𝑥 0 ) 𝜋 (d𝑥 00 ) d𝑡

= 2𝔠 1{𝑓 (𝑥 00 ) > 𝑓 (𝑥 0 )} [𝑓 (𝑥 00 ) − 𝑓 (𝑥 0 )] 𝜋 (d𝑥 0 ) 𝜋 (d𝑥 00 )

=𝔠 |𝑓 (𝑥 00 ) − 𝑓 (𝑥 0 )| 𝜋 (d𝑥 0 ) 𝜋 (d𝑥 00 ) = 𝔠 E|𝑓 (𝑋 00 ) − 𝑓 (𝑋 0 )| . 

Compare this with the relationship between the Cheeger isoperimetric inequality and
the 𝐿 1 –𝐿 1 Poincaré inequality in Theorem 2.5.14. Indeed, the trick above of passing to the
level sets of 𝑓 is the discrete version of the coarea inequality (Theorem 2.5.12).
Recall also that an 𝐿 1 –𝐿 1 Poincaré inequality implies an 𝐿 2 –𝐿 2 Poincaré inequality
with 𝐶 2,2 . 𝐶 1,1 , see Proposition 2.5.17. On the other hand, 𝐶 2,2 is the square root of the
7.4. MARKOV CHAINS IN DISCRETE TIME 205

usual Poincaré constant, 𝐶 2,2 = 1/ 𝜆, where 𝜆 is the spectral gap. To prove Cheeger’s
inequality, we are going to follow the same principle in discrete time. This is exactly the
source of the square in the lower bound 𝜆 & 𝔠 2 of Cheeger’s inequality.
Proof of Cheeger’s inequality (Theorem 7.4.3). We will prove the lower bound on the spec-
tral gap with a worse constant than 18 in order to make the proof more straightforward;
see [LS88] for a proof with the constant 81 .
Let 𝑓 : R𝑑 → R have 0 as a median and let 𝑔 B 𝑓 2 sgn 𝑓 , so that 0 is a median of 𝑔
as well. Assume that the chain has conductance 𝔠 > 0. Then, recalling the equivalence
between the mean and the median (Lemma 2.4.4) and using Lemma 7.4.4 on 𝑔,

E𝜋 [|𝑓 − E𝜋 𝑓 | 2 ]  E𝜋 [|𝑓 − med𝜋 𝑓 | 2 ] = E𝜋 |𝑔 − med𝜋 𝑔|  E𝜋 |𝑔 − E𝜋 𝑔|


1 1
≤ E|𝑔(𝑋 1 ) − 𝑔(𝑋 0 )| = E|𝑓 (𝑋 1 ) 2 − 𝑓 (𝑋 0 ) 2 |
𝔠 𝔠
1
= E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| |𝑓 (𝑋 0 ) + 𝑓 (𝑋 1 )|]
𝔠
1 √︁
. E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| 2 ] E𝜋 [𝑓 2 ]
𝔠
1 √︁
= E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| 2 ] E𝜋 [|𝑓 − med𝜋 𝑓 | 2 ]
𝔠
1 √︁
 E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| 2 ] E𝜋 [|𝑓 − E𝜋 𝑓 | 2 ]
𝔠
and rearranging this inequality proves the result. 

A lower bound on the conductance via overlaps. At this stage, it may not seem that
we have gained anything by moving from the spectral gap to the conductance. We now
introduce a key lemma, which provides a tractable lower bound on the conductance in
terms of two geometric quantities: a Cheeger isoperimetric inequality for target 𝜋 (we
introduced this inequality in Section 2.5.2), and overlap bounds on the Markov chain.
Recall that an 𝛼-strongly
√ log-concave measure 𝜋 satisfies the Cheeger isoperimetric
inequality with Ch . 1/ 𝛼 (Corollary 2.5.19).

Lemma 7.4.6. Assume the following:

1. The target 𝜋 satisfies a Cheeger isoperimetric inequality with constant Ch > 0.

2. There exists 𝑟 ∈ [0, Ch] such that for any points 𝑥, 𝑦 ∈ R𝑑 with k𝑥 − 𝑦 k ≤ 𝑟 , it
holds that k𝑃 (𝑥, ·) − 𝑃 (𝑦, ·)k TV ≤ 21 .
206 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Then, 𝔠 ≥ 𝑟 /(64 Ch).

Proof. Let 𝐴0 ⊆ R𝑑 ; for symmetry of notation, write 𝐴1 B 𝐴c0 . By reversibility,

1
∫ ∫ ∫ ∫ 
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) = 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦) = 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦) .
𝐴0 𝐴1 2 𝐴0 𝐴1

We want to lower bound this by a constant times 𝜋 (𝐴0 ) 𝜋 (𝐴1 ).


Define bad sets and a good set:

 1
𝐵 0 B 𝑥 ∈ 𝐴0 𝑃 (𝑥, 𝐴1 ) < ,
4
 1
𝐵 1 B 𝑦 ∈ 𝐴1 𝑃 (𝑦, 𝐴0 ) < ,
4
𝐺 B R𝑑 \ (𝐵 0 ∪ 𝐵 1 ) .

We can assume that 𝜋 (𝐵 0 ) ≥ 𝜋 (𝐴0 )/2 and 𝜋 (𝐵 1 ) ≥ 𝜋 (𝐴1 )/2. Indeed, if we have, e.g.,
𝜋 (𝐵 0 ) ≤ 𝜋 (𝐴0 )/2, then 𝜋 (𝐴0 \ 𝐵 0 ) ≥ 𝜋 (𝐴0 )/2, and

1 1
∫ ∫
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) ≥ 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) ≥ 𝜋 (𝐴0 \ 𝐵 0 ) ≥ 𝜋 (𝐴0 ) .
𝐴0 𝐴0 \𝐵 0 4 8

Next, suppose that 𝑥 ∈ 𝐵 0 and 𝑦 ∈ 𝐵 1 . Then, 𝑃 (𝑥, 𝐴0 ) ≥ 34 , whereas 𝑃 (𝑦, 𝐴0 ) < 14 . It


follows that k𝑃 (𝑥, ·) − 𝑃 (𝑦, ·)k TV > 21 . By our second assumption, k𝑥 − 𝑦 k > 𝑟 . This shows
that 𝐵 1 ⊆ (𝐵𝑟0 ) c , or 𝐵 1c ⊇ 𝐵𝑟0 . On the other hand, 𝐺 = 𝐵 0c ∩ 𝐵 1c ⊇ 𝐵𝑟0 \ 𝐵 0 . The integral form
of the isoperimetric inequality in (2.5.11) shows that

𝑟 𝑟
𝜋 (𝐺) ≥ 𝜋 (𝐵𝑟0 ) − 𝜋 (𝐵 0 ) ≥ 𝜋 (𝐵 0 ) 𝜋 (𝐵 1 ) ≥ 𝜋 (𝐴0 ) 𝜋 (𝐴1 ) .
2 Ch 8 Ch

Hence,

1
∫ ∫ 
𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦)
2 𝐴0 𝐴1
1
∫ ∫ 
≥ 𝑃 (𝑥, 𝐴1 ) 𝜋 (d𝑥) + 𝑃 (𝑦, 𝐴0 ) 𝜋 (d𝑦)
2 𝐴0 ∩𝐺 𝐴1 ∩𝐺
1  1 𝑟
≥ 𝜋 (𝐴0 ∩ 𝐺) + 𝜋 (𝐴1 ∩ 𝐺) = 𝜋 (𝐺) ≥ 𝜋 (𝐴0 ) 𝜋 (𝐴1 ) . 
8 8 64 Ch
7.4. MARKOV CHAINS IN DISCRETE TIME 207

From conductance to 𝑠-conductance. Unfortunately, the framework that we have


developed so far is not flexible enough to study MALA. In particular, requiring that the
second condition in Lemma 7.4.6 hold for all pairs of points 𝑥, 𝑦 ∈ R𝑑 is rather restrictive,
especially because there are many points which we are unlikely to ever visit in the course
of running the sampling algorithm. To address these issues, many variants of conductance
have been proposed in the literature. Here we will introduce only one other variant, the
𝑠-conductance, which seems reasonably flexible.
For 𝑠 ∈ [0, 1], the 𝑠-conductance of 𝑇 is the largest 𝔠𝑠 > 0 such that for all events
𝐴 ⊆ R𝑑 , it holds that

𝑃 (𝑥, 𝐴c ) 𝜋 (d𝑥) ≥ 𝔠𝑠 𝜋 (𝐴) − 𝑠 𝜋 (𝐴c ) − 𝑠 .
 
𝐴

Observe that if 𝜋 (𝐴) ≤ 𝑠, then the above inequality holds trivially. Hence, this definition
allows us to restrict attention to events which are reasonably probable under 𝜋.
For the conductance, we had Cheeger’s inequality which relates conductance to the
spectral gap and ultimately to convergence. For the 𝑠-conductance, the following theorem
is an appropriate substitute.

Theorem 7.4.7 ([LS93, Corollary 1.6]). For any 0 < 𝑠 ≤ 12 , let

Δ𝑠 B sup{|𝜇0 (𝐴) − 𝜋 (𝐴)| : 𝐴 ⊆ R𝑑 , 𝜋 (𝐴) ≤ 𝑠} .

Then, the law 𝜇 𝑁 of the 𝑁 -th iterate of a Markov chain with 𝑠-conductance 𝔠𝑠 and
initialized at 𝜇 0 satisfies

Δ𝑠  c2 𝑁 
k𝜇 𝑁 − 𝜋 k TV ≤ Δ𝑠 + exp − 𝑠 .
𝑠 2
In particular,
 c2 𝑁 
√︂
𝜒 2 (𝜇 0 k 𝜋)
≤ 𝑠 𝜒 2 (𝜇0 k 𝜋) +
√︁
k𝜇 𝑁 − 𝜋 k TV exp − 𝑠 .
𝑠 2

Proof. The first statement is from [LS93, Corollary 1.6] and the proof is omitted, as the
proof is not particularly straightforward.
The second statement follows from the first: indeed, for 𝐴 ⊆ R𝑑 with 𝜋 (𝐴) ≤ 𝑠,
∫ ∫ 𝜇0  √︁
|𝜇0 (𝐴) − 𝜋 (𝐴)| = 1𝐴 d(𝜇0 − 𝜋) = 1𝐴 − 1 d𝜋 ≤ 𝜋 (𝐴) 𝜒 2 (𝜇0 k 𝜋)

𝜋
208 CHAPTER 7. HIGH-ACCURACY SAMPLERS

so that Δ𝑠 ≤ 𝑠 𝜒 2 (𝜇 0 k 𝜋).
√︁


This result says that if 𝑠 = 𝜀 2 /(4 𝜒 2 (𝜇0 k 𝜋)), then we obtain k𝜇 𝑁 − 𝜋 k TV ≤ 𝜀 after
1 𝜒 2 (𝜇 0 k 𝜋) 
𝑁 = 𝑂 2 log iterations .
𝔠𝑠 𝜀2

Unfortunately, if 𝜒 2 (𝜇0 k 𝜋) = exp 𝑂 (𝑑), then the logarithmic term incurs additional
dimension dependence, which is why we can expect better mixing time bounds under the
warm start condition 𝜒 2 (𝜇0 k 𝜋) = 𝑂 (1).
The key advantage of the 𝑠-conductance is that it allows for a version of the key lemma
with weaker assumptions; the proof is left as Exercise 7.5.

Lemma 7.4.8. Assume the following:

1. The target 𝜋 satisfies a Cheeger isoperimetric inequality with constant Ch > 0.

2. There exists 𝑟 ∈ [0, Ch] and an event 𝐸 ⊆ R𝑑 with probability 𝜋 (𝐸) ≥ 1 − 𝑟𝑠


16 Ch
such that
1
∀𝑥, 𝑦 ∈ 𝐸, k𝑥 − 𝑦 k ≤ 𝑟 =⇒ k𝑃 (𝑥, ·) − 𝑃 (𝑦, ·)k TV ≤ .
2

Then, 𝔠𝑠 & 𝑟 /Ch.

7.5 Analysis of MALA for a Feasible Start


Using the tools we have developed, we now proceed to analyze the mixing time of MALA
under the assumptions of Theorem 7.3.4. However, we will not prove the full strength
of the result in Theorem 7.3.4; at the end of this section, we will indicate the extra steps
needed to reach Theorem 7.3.4.

Basic decomposition. The overall plan is to lower bound the 𝑠-conductance using the
key lemma (Lemma 7.4.8), which then upper bounds the mixing time via Theorem 7.4.7.
By strong log-concavity of 𝜋, the first hypothesis of Lemma 7.4.8 is verified, so it remains
to bound the overlaps. For a kernel 𝑇 , we use the shorthand 𝑇𝑥 B 𝑇 (𝑥, ·).
By the triangle inequality, we have the decomposition

k𝑇𝑥 − 𝑇𝑦 k TV ≤ k𝑄 𝑥 − 𝑇𝑥 k TV + k𝑄 𝑥 − 𝑄𝑦 k TV + k𝑄𝑦 − 𝑇𝑦 k TV . (7.5.1)


7.5. ANALYSIS OF MALA FOR A FEASIBLE START 209

The middle term k𝑄 𝑥 − 𝑄𝑦 k TV measures the overlap for the proposal kernel, and we will
shortly see that this term is easy to bound. Then, controlling the first and third terms
essentially amounts to lower bounding the acceptance probability of MALA, since the
only difference between 𝑄 and 𝑇 is the Metropolis–Hastings filter.

Overlap of the proposal kernel.

Lemma 7.5.2. For 𝑥 ∈ R𝑑 , let 𝑄 𝑥 B normal(𝑥 − ℎ ∇𝑉 (𝑥), 2ℎ 𝐼𝑑 ), and assume that


k∇2𝑉 k op ≤ 𝛽. Then, provided ℎ ≤ 𝛽1 , we have

k𝑥 − 𝑦 k
k𝑄 𝑥 − 𝑄𝑦 k TV ≤ √ .
2ℎ

Proof. By Pinsker’s inequality (Exercise 2.10),


1 k𝑥 − ℎ ∇𝑉 (𝑥) − 𝑦 + ℎ ∇𝑉 (𝑦)k 2 k𝑥 − 𝑦 k 2
k𝑄 𝑥 − 𝑄𝑦 k 2TV ≤ KL(𝑄 𝑥 k 𝑄𝑦 ) = ≤
2 8ℎ 2ℎ
where the last inequality uses the fact that id − ℎ ∇𝑉 is 2-Lipschitz. 

Control of the acceptance probability. Next, consider the term k𝑄 𝑥 − 𝑇𝑥 k TV . Com-


puting this term is slightly tricky because 𝑇𝑥 has an atom at 𝑥, but in the end we obtain
1h
∫ ∫ i
k𝑄 𝑥 − 𝑇𝑥 k TV = 1− 𝑄 (𝑥, d𝑦) 𝐴(𝑥, 𝑦) + |𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| d𝑦
2 R𝑑 \{𝑥 }
| {z }
from the atom of 𝑇𝑥
1h
∫ ∫ i
= 1− 𝑄 (𝑥, d𝑦) 𝐴(𝑥, 𝑦) + 𝑄 (𝑥, d𝑦) {1 − 𝐴(𝑥, 𝑦)} d𝑦
2 ∫
=1− 𝑄 (𝑥, d𝑦) 𝐴(𝑥, 𝑦) . (7.5.3)

This has a very clear interpretation: it is the probability that the proposed
√ move starting
at 𝑥 is rejected. If we let 𝜉 ∼ normal(0, 𝐼𝑑 ) and 𝑌 B 𝑥 − ℎ ∇𝑉 (𝑥) + 2ℎ 𝜉, we want a lower
bound on the quantity E 𝐴(𝑥, 𝑌 ), which comes from Markov’s inequality:
 𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)  𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
E 𝐴(𝑥, 𝑌 ) = E min 1, for all 0 < 𝜆 < 1 .

≥ 𝜆P ≥𝜆
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
The approach now is to write out the ratio more explicitly, and then carefully group
together and bound the terms. (Unfortunately, this is not the most enlightening.)
210 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Explicitly, we have

𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)  k𝑥 − 𝑌 + ℎ ∇𝑉 (𝑌 )k 2 k𝑌 − 𝑥 + ℎ ∇𝑉 (𝑥)k 2 
= exp −𝑉 (𝑌 ) − + 𝑉 (𝑥) + .
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 4ℎ 4ℎ
After some careful algebra,

𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
4 ln = ℎ {k∇𝑉 (𝑥)k 2 − k∇𝑉 (𝑌 )k 2 } (7.5.4)
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
− 2 {𝑉 (𝑌 ) − 𝑉 (𝑥) − h∇𝑉 (𝑥), 𝑌 − 𝑥i} (7.5.5)
+ 2 {𝑉 (𝑥) − 𝑉 (𝑌 ) − h∇𝑉 (𝑌 ), 𝑥 − 𝑌 i} . (7.5.6)

Note that the terms are grouped to more easily apply the strong convexity and smoothness
of 𝑉 . It yields

(7.5.5) ≥ −𝛽 k𝑥 − 𝑌 k 2 and (7.5.6) ≥ 𝛼 k𝑥 − 𝑌 k 2 ≥ 0 .

Also, for ℎ ≤ 𝛽1 ,

(7.5.4) = ℎ h∇𝑉 (𝑥) − ∇𝑉 (𝑌 ), ∇𝑉 (𝑥) + ∇𝑉 (𝑌 )i


≥ −ℎ k∇𝑉 (𝑥) − ∇𝑉 (𝑌 )k k∇𝑉 (𝑥) + ∇𝑉 (𝑌 )k
≥ −𝛽ℎ k𝑥 − 𝑌 k (2 k∇𝑉 (𝑥)k + 𝛽 k𝑥 − 𝑌 k) ≥ −𝛽ℎ 2 k∇𝑉 (𝑥)k 2 − 2𝛽 k𝑥 − 𝑌 k 2 .

Therefore,
𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥)
ln & −𝛽ℎ 2 k∇𝑉 (𝑥)k 2 − 𝛽 k𝑥 − 𝑌 k 2 & −𝛽ℎ 2 k∇𝑉 (𝑥)k 2 − 𝛽ℎ k𝜉 k 2 .
𝜋 (𝑥) 𝑄 (𝑥, 𝑌 )
At this stage, observe that we cannot lower bound this quantity (with high probability)
uniformly over 𝑥, since k∇𝑉 (𝑥)k → ∞ as k𝑥 k → ∞. This is why it is helpful to restrict to
𝑥 belonging to some high-probability event 𝐸, which is ultimately achieved by working
with 𝑠-conductance rather than conductance.
By standard concentration bounds, k𝜉 k 2 ≤ 2𝑑 with probability at least 1 − exp(−𝑑/2).
Also, let 𝐸𝑅 B {𝑥 ∈ R𝑑 : k∇𝑉 (𝑥)k ≤ 𝛽 𝑅}. It follows that for all 𝑥 ∈ 𝐸𝑅 , if we take
√︁
1
ℎ . 𝛽 (𝑑∨𝑅) with a sufficiently small constant, then

11 n 𝜋 (𝑌 ) 𝑄 (𝑌, 𝑥) 11 o 11  𝑑  5
E 𝐴(𝑥, 𝑌 ) ≥ P ≥ ≥ 1 − exp − ≥ ,
12 𝜋 (𝑥) 𝑄 (𝑥, 𝑌 ) 12 12 2 6

for sufficiently large 𝑑. Hence, for 𝑥 ∈ 𝐸𝑅 , we have k𝑄 𝑥 − 𝑇𝑥 k TV ≤ 61 .


7.5. ANALYSIS OF MALA FOR A FEASIBLE START 211

1
Completing the analysis. We have shown: if the step size is ℎ . 𝛽 (𝑑∨𝑅) , then for all
𝑥, 𝑦 ∈ 𝐸𝑅 with k𝑥 − 𝑦 k ≤ 𝑟 ,
1 𝑟 1 1
k𝑇𝑥 − 𝑇𝑦 k TV ≤ k𝑄 𝑥 − 𝑇𝑥 k TV + k𝑄 𝑥 − 𝑄𝑦 k TV + k𝑄𝑦 − 𝑇𝑦 k TV ≤ +√ + ≤
6 2ℎ 6 2

provided we take 𝑟 = 2ℎ/6. Applying Lemma 7.4.8 (assuming ℎ . 𝛼1 ), we deduce that
√ √
𝔠𝑠 & 𝛼ℎ provided 𝜋 (𝐸𝑅 ) ≥ 1 − 𝑐 0𝑠 𝛼ℎ, where 𝑐 0 > 0 is a universal constant. Since we
1
want the step size ℎ to be as large as possible, we take ℎ  𝛽 (𝑑∨𝑅) , where 𝑅 is chosen to
√︃
1
satisfy 𝜋 (𝐸𝑅c ) . 𝑠 𝜅 (𝑑∨𝑅) and 𝑠  𝜀 2 /𝜒 2 (𝜇 0 k 𝜋). The final mixing time bound implied
by Theorem 7.4.7 is then 𝑂 (𝜅 (𝑑 ∨ 𝑅) log(𝜒 2 (𝜇 0 k 𝜋)/𝜀 2 )) iterations.
Up until this point, the analysis is largely similar to [Dwi+19].

Gradient concentration. The bound involves the parameter 𝑅. By √︁ definition, 𝑅 is


such that the norm k∇𝑉 k of the gradient under 𝜋 is typically of size 𝛽 𝑅.√Recall from,
e.g., Lemma 4.2.4 that E𝜋 k∇𝑉 k . 𝛽𝑑, which suggests that we can take 𝑅 . 𝑑. However,
√︁

we need a high-probability bound on k∇𝑉 k, not a bound in expectation. Unfortunately,


a naı̈ve application of the fact that k∇𝑉 k is 𝛽-Lipschitz, together with sub-Gaussian
concentration of Lipschitz functions (Theorem √︁ 2.4.8), only shows that the fluctuations
of k∇𝑉 k around its expectation are of size 𝛽𝜅 (exercise!). When 𝜅  𝑑, this does
not recover the promised rate of 𝑂 e(𝜅𝑑) (ignoring the dependence on initialization and
accuracy). To resolve this issue, [LST20] introduced a new concentration inequality for
k∇𝑉 k via the Brascamp–Lieb inequality (Theorem 2.2.8).

Lemma 7.5.7 ([LST20]). Suppose that 𝜋 ∝ exp(−𝑉 ) and that 0  ∇2𝑉  𝛽𝐼𝑑 . Then,
for all 𝑡 ≥ 0,
 𝑡 
𝜋 {k∇𝑉 k ≥ E𝜋 k∇𝑉 k + 𝑡 } ≤ 3 exp − √︁ .
𝛽

This shows that the fluctuations of k∇𝑉 k around its expectation are only of size 𝛽.
√︁

From warm start to feasible start. The factor of log 𝜒 2 (𝜇0 k 𝜋) in the bound incurs
additional dimension dependence under a feasible start, since with a Gaussian initialization
we can only show 𝜒 2 (𝜇0 k 𝜋) ≤ 𝜅𝑑/2 . The problem is that the conductance-based analysis
relies upon Poincaré-type inequalities, instead of log-Sobolev inequalities. To address
this issue, we can replace the assumption of a Cheeger isoperimetric inequality with a
212 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Gaussian isoperimetric inequality (see Section 2.5.4). The essential difference is that under
a Cheeger isoperimetric inequality, as 𝑝 B 𝜋 (𝐴) & 0 we √︃ have 𝜋 + (𝐴) & 𝑝, whereas
under a Gaussian isoperimetric inequality we have 𝜋 + (𝐴) & 𝑝 log 𝑝1 . Using this stronger
assumption, [Che+20a] show that the dependence on the initialization can be improved to
log log 𝜒 2 (𝜇 0 k 𝜋). A similar effect can be achieved via the blocking conductance [KLM06],
which was used in [LST20]. We omit the details.

Lower bound. Finally, the analysis of MALA in Theorem 7.3.4 is tight, as shown in the
following lower bound.

Theorem 7.5.8 ([LST21a]). For every choice of step size ℎ > 0, there exists a target
distribution 𝜋 ∝ exp(−𝑉 ) on R𝑑 with 𝐼𝑑  ∇2𝑉  𝜅𝐼𝑑 , as well as an initialization 𝜇0
with 𝜒 2 (𝜇 0 k 𝜋) ≤ exp 𝑑, such that the number of iterations required for MALA to reach
total variation at most 14 from 𝜋 is at least Ω(𝜅𝑑).
e

This theorem is a lower bound in the sense that all of the known proofs for MALA do
not use any property of the initialization 𝜇0 except through 𝜒 2 (𝜇 0 k 𝜋). Thus, in order to
improve the analysis of MALA under a feasible start, one must use more specific properties
of the initialization, or use some other modification that bypasses the lower bound (e.g.,
random step sizes).

7.6 Analysis of MALA for a Warm Start


We next turn towards the warm start case (Theorem 7.3.5). The improvement under a warm
start was first shown in [Che+21b], which obtained a rate of 𝑂 e(𝜅 3/2𝑑 1/2 polylog(1/𝜀)). This
result was improved in [WSC21] which obtained the sharp rate of 𝑂 e(𝜅𝑑 1/2 polylog(1/𝜀))
via completely different techniques. In this section, we follow [Che+21b] because the
proof is more conceptual. (Anyway, we will see in Chapter 8 how to boost the condition
number dependence to 𝜅 using the proximal sampler.)
We still follow the 𝑠-conductance framework of the previous section, including the
basic decomposition (7.5.1). The main difference lies in the control of k𝑄 𝑥 − 𝑇𝑥 k TV , which
was previously accomplished by lower bounding the acceptance probability. Surprisingly,
the following proof never works directly with the acceptance probability, despite the fact
that k𝑄 𝑥 − 𝑇𝑥 k TV is precisely the rejection probability at 𝑥 (see (7.5.3)).

Using the projection property. The key insight is to use projection characterization
of the Metropolis–Hastings filter (Theorem 7.2.5): the MALA kernel 𝑇 is the closest kernel
7.6. ANALYSIS OF MALA FOR A WARM START 213

to the proposal 𝑄 (in an appropriate 𝐿 1 distance) among all reversible Markov chains with
stationary distribution 𝜋. Concretely, for any other kernel 𝑄¯ which is reversible w.r.t. 𝜋,
∬ ∬
|𝑄 (𝑥, 𝑦) − 𝑇 (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 ≤ |𝑄 (𝑥, 𝑦) − 𝑄¯ (𝑥, 𝑦)| 𝜋 (d𝑥) d𝑦 .
(R𝑑 ×R𝑑 )\diag (R𝑑 ×R𝑑 )\diag

Now supposing that 𝑄¯ has no atoms, this inequality is the same as


∫ ∫
k𝑄 𝑥 − 𝑇𝑥 k TV 𝜋 (d𝑥) ≤ 2 k𝑄 𝑥 − 𝑄¯𝑥 k TV 𝜋 (d𝑥) . (7.6.1)

Thus, we can indirectly bound k𝑄 𝑥 − 𝑇𝑥 k TV , at least on average. Moreover, there is a


very natural choice of 𝑄¯ here: since 𝑄 is obtained from a discretization of the Langevin
diffusion, we can take 𝑄¯ to be the continuous-time Langevin diffusion run for time ℎ,
which is indeed reversible with respect to 𝜋. The right-hand side of the above expression
then simply measures the discretization error, which we have already studied in detail.

Pointwise projection property. The projection property is not enough for our pur-
poses, however, since it only bounds k𝑄 𝑥 − 𝑇𝑥 k TV in average, whereas we really need
high-probability bounds. Thankfully, we can extend the projection property.

Theorem 7.6.2 (pointwise projection property, [Che+21b, Theorem 6]). Let 𝑄 be an


atomless proposal kernel and let 𝑇 be the corresponding Metropolis–Hastings kernel with
target 𝜋. Then, for any atomless kernel 𝑄¯ which is reversible with respect to 𝜋, and for
every 𝑥 ∈ R𝑑 ,

𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥) 𝑄 (𝑦, 𝑥)



¯
k𝑄 𝑥 − 𝑇𝑥 k TV ≤ 2 k𝑄 𝑥 − 𝑄 𝑥 k TV + − 1 d𝑦 .

𝜋 (𝑥) 𝑄¯ (𝑦, 𝑥)

Consequently, for any convex increasing function Φ : R+ → R+ ,

1
∫ ∫
Φ(k𝑄 𝑥 − 𝑇𝑥 k TV ) 𝜋 (d𝑥) ≤ Φ(4 k𝑄 𝑥 − 𝑄¯𝑥 k TV ) 𝜋 (d𝑥)
2
(7.6.3)
1

𝑄 (𝑥, 𝑦)
Φ2 ¯
− 1 𝑄 (𝑥, d𝑦) 𝜋 (d𝑥) .

+
𝑄¯ (𝑥, 𝑦)

2

We will not need the inequality (7.6.3), so the proof is left as Exercise 7.8. The reason
why (7.6.3) is included in the theorem is because it makes it clear why we can expect the
pointwise projection property to imply high-probability bounds for k𝑄 𝑥 − 𝑇𝑥 k TV . Note
that when we integrate the projection property w.r.t. 𝜋 (d𝑥), we recover (7.6.1) with a
factor of 4 on the right-hand side instead of 2.
214 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Proof. We can write


∫ ∫ h  𝜋 (𝑦) 𝑄 (𝑦, 𝑥) i
k𝑄 𝑥 − 𝑇𝑥 k TV = 1 − 𝑄 (𝑥, d𝑦) 𝐴(𝑥, 𝑦) = 1− 1∧ 𝑄 (𝑥, d𝑦)
𝜋 (𝑥) 𝑄 (𝑥, 𝑦)

𝜋 (𝑦) 𝑄 (𝑦, 𝑥)
≤ 1 − 𝑄 (𝑥, d𝑦)

𝜋 (𝑥) 𝑄 (𝑥, 𝑦)
𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥) 𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥)
∫ ∫ 𝑄 (𝑦, 𝑥)
≤ 1 − 𝑄 (𝑥, d𝑦) + − 1 d𝑦 .

¯
𝜋 (𝑥) 𝑄 (𝑥, 𝑦) 𝜋 (𝑥) 𝑄 (𝑦, 𝑥)
¯ the first term is
Using reversibility of 𝑄,

𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥)
∫ ∫
1 − 𝑄 (𝑥, d𝑦) = |𝑄 (𝑥, 𝑦) − 𝑄¯ (𝑥, 𝑦)| d𝑦 = 2 k𝑄 𝑥 − 𝑄¯𝑥 k TV ,

𝜋 (𝑥) 𝑄 (𝑥, 𝑦)

which completes the proof. 

Applying the pointwise projection property. Our goal is to bound k𝑄 𝑥 − 𝑇𝑥 k TV for


all 𝑥 which lies in an event 𝐸 of very high probability under 𝜋. We proceed by controlling
the two terms in the pointwise projection property separately.
We will omit many of the calculations from this point forwards. The calculations are
actually fairly straightforward (once one has some familiarity with stochastic calculus),
but are somewhat tedious. Moreover, the best way to learn these particular calculations
is to try them for oneself. We refer to [Che+21b] for details. Moreover, to simplify the
exposition, we will only focus on the dependence on 𝑑 and ℎ.
The first term, k𝑄 𝑥 − 𝑄¯𝑥 k TV , is more straightforward. It is helpful to apply Pinsker’s
inequality, leaving us to control KL(𝑄¯𝑥 k 𝑄 𝑥 ). This is precisely the kind of discretization
error that we controlled via Girsanov’s theorem in Section 4.4. In particular, it is possible
to show that k𝑄 𝑥 − 𝑄¯𝑥 k TV . ℎ 𝑑 + k𝑥 k 2 . Since this is Lipschitz in 𝑥, we can then apply
√︁

sub-Gaussian concentration under 𝜋 (Theorem 2.4.8) to obtain a high-probability bound


for this term under 𝜋. In particular, we expect a step size of ℎ . √1 to control this term.
𝑑
To control the second term with high probability, it suffices to control the moments of
this quantity under 𝜋: for 𝑝 ≥ 1,

𝜋 (𝑦) 𝑄¯ (𝑦, 𝑥) 𝑄 (𝑦, 𝑥)


∫ ∫ ∬
𝑝 𝑄 (𝑦, 𝑥)
− 1 d𝑦 𝜋 (d𝑥) ≤ − 1 𝑄¯ (𝑦, d𝑥) 𝜋 (d𝑦) .
𝑝
𝑄¯ (𝑦, 𝑥) 𝑄¯ (𝑦, 𝑥)

𝜋 (𝑥)

Let 𝑸¯ 𝑥 denote the measure on path space C([0, ℎ]; R𝑑 ) of the Langevin diffusion started at
𝑥. Similarly, let 𝑸 𝑥 denote the same for the interpolation of LMC. By the data-processing
7.6. ANALYSIS OF MALA FOR A WARM START 215

inequality, we obtain

¯  d𝑸
∬ ∫
𝑄 (𝑦, 𝑥)
¯
− 1 𝑄 (𝑦, d𝑥) 𝜋 (d𝑦) ≤ E𝑸 𝑥 𝑥 − 1 𝜋 (d𝑥) .
𝑝 𝑝 
𝑄¯ (𝑦, 𝑥) d𝑸¯ 𝑥

d𝑸¯
We have a formula for the Radon–Nikodym derivative d𝑸 𝑥 thanks to Girsanov’s theorem,
𝑥
so again we can approach this via stochastic calculus. Controlling this term is slightly
more involved than controlling the KL divergence (essentially we are controlling a Rényi
divergence instead) but nevertheless we can bound the term when ℎ . √1 .
𝑑
With these high probability bounds, we can then return to the 𝑠-conductance analysis,
which implies that the mixing time is of order ℎ1 . Hence, under a warm start, the mixing

time improves from 𝑑 to 𝑑.
TODO: Flesh out the calculations in this section.

Lower bound. Under a warm start, [Che+21b]
√ showed a lower bound of roughly Ω(
e 𝑑),
which was improved in [WSC21] to Ω(𝜅
e 𝑑). We state the result here.

Theorem 7.6.4 ([WSC21]). For every choice of step size ℎ > 0, there exists a target
distribution 𝜋 ∝ exp(−𝑉 ) on R𝑑 with 𝐼𝑑  ∇2𝑉  𝜅𝐼𝑑 , as well as an initialization 𝜇 0
with 𝜒 2 (𝜇0 k 𝜋) . 1, such that the number√of iterations required for MALA to reach
total total variation 𝜀 from 𝜋 is at least Ω(𝜅
e 𝑑 log(1/𝜀)).

Bibliographical Notes
TODO: Fill in.

Exercises
An Overview of High-Accuracy Samplers

D Exercise 7.1 MALA is a special case of MHMC


Show that when 𝐾 = 1, the MHMC algorithm reduces to MALA.

D Exercise 7.2 MH filter for the leapfrog integrator


For the leapfrog integrator 𝐹 leap , verify that 𝐹 leap
−1 = 𝑅 ◦ 𝐹
leap ◦ 𝑅.
Hint: First show that it suffices to consider 𝑇 = ℎ (i.e., 𝐾 = 1).
216 CHAPTER 7. HIGH-ACCURACY SAMPLERS

Markov Chains in Discrete Time

D Exercise 7.3 reversible Markov chains as gradient descent on the Dirichlet energy
Consider the setting of Section 7.4.

1. Show that ℰ(𝑓 , 𝑓 ) = 12 E[|𝑓 (𝑋 1 ) − 𝑓 (𝑋 0 )| 2 ], where (𝑋 0, 𝑋 1 ) are two successive


iterates of the Markov chain started at stationarity. We also write ℰ(𝑓 ) B ℰ(𝑓 , 𝑓 )
as a useful shorthand.

2. Show that if (𝜇𝑘 )𝑘∈N are the laws of the iterates of the ℓ-lazy version of 𝑃, then
the relative densities ( 𝜋𝑘 )𝑘∈N are the iterates of gradient descent on the Dirichlet
𝜇

energy ℰ in 𝐿 2 (𝜋). How does the laziness parameter ℓ relate to the step size of the
gradient descent?

3. Observe that ℰ is a convex quadratic functional; show that 0  ∇𝐿2 2 (𝜋)ℰ  2.


What does the theory of convex optimization suggest for the value of the laziness
parameter ℓ?

4. Next, prove a generalization of Theorem 7.4.2 for any value of the laziness param-
eter ℓ ∈ [ 21 , 1] by showing that the spectral gap condition is equivalent to strong
convexity of ℰ. Why do we want ℓ ≥ 12 here?

5. Show that the conductance of the chain can also be described as the largest 𝔠 > 0
such that for all events 𝐴 ⊆ R𝑑 , it holds that ℰ( 1𝐴 ) ≥ 𝔠 k 1𝐴 − 𝜋 (𝐴)k 𝐿2 2 (𝜋) . Hence,
conductance can be viewed as a restricted strong convexity condition (restricting
the space of functions to indicators of events). In particular, show the bound 𝜆 ≤ 𝔠
in Cheeger’s inequality (Theorem 7.4.3).

D Exercise 7.4 entropy decay implies MLSI


Suppose that 𝑃 satisfies the following entropy decay condition: there exists 𝑐 ∈ (0, 1) such
that for all probability measures 𝑃,

KL(𝜇𝑃 k 𝜋) ≤ (1 − 𝑐) KL(𝜇 k 𝜋) .

Prove that 𝑃 satisfies a MLSI with constant 𝐶 MLSI ≤ 2/𝑐.

D Exercise 7.5 𝑠-conductance lemma


Prove the 𝑠-conductance lemma (Lemma 7.4.8).
7.6. ANALYSIS OF MALA FOR A WARM START 217

Analysis of MALA for a Feasible Start

D Exercise 7.6 analysis of MRW


Follow the analysis in this section and adapt it to the Metropolized random walk (MRW)
algorithm. What mixing time bound can you prove?

D Exercise 7.7 mixing time for a Gaussian target


Adapt the analysis in this section to the case when the target distribution is the standard
Gaussian. Here, it is possible to do a much more refined analysis; see if you can show
that the mixing time of MALA is 𝑂 e(𝑑 1/3 polylog(1/𝜀)) from a warm start. See [Che+21b,
Appendix C] for hints.

Analysis of MALA for a Warm Start

D Exercise 7.8 pointwise projection property


Prove (7.6.3) from the pointwise projection property.
218 CHAPTER 7. HIGH-ACCURACY SAMPLERS
CHAPTER 8

The Proximal Sampler

In this chapter, we discuss the proximal sampler, which was introduced in [LST21c]. The
applications of the proximal sampler include improving the condition number dependence
of high-accuracy samplers and providing new state-of-the-art sampling guarantees for
various classes of target distributions. Besides these applications, the proximal sampler is
interesting in its own right due to its remarkable convergence analysis and its connections
with the proximal point method in optimization.

8.1 Introduction to the Proximal Sampler


Let 𝜋 ∝ exp(−𝑉 ) denote the target distribution. We fix ℎ > 0 and define the augmented
target distribution
 k𝑦 − 𝑥 k 2 
𝝅 (𝑥, 𝑦) ∝ exp −𝑉 (𝑥) − .
2ℎ
To avoid confusion, we will explicitly write 𝜋 𝑋 = 𝜋 for the 𝑋 -marginal, and 𝜋 𝑌 for the
𝑌 -marginal. Similarly, 𝜋 𝑋 |𝑌 and 𝜋 𝑌 |𝑋 denote the conditional distributions.
The proximal sampler applies Gibbs sampling to the augmented target. Explicitly, the
updates of the proximal sampler are as follows.
Proximal Sampler: Initialize 𝑋 0 ∼ 𝜇 0 . For 𝑘 = 0, 1, 2, . . . :
1. Draw 𝑌𝑘 ∼ 𝜋 𝑌 |𝑋 (· | 𝑋𝑘 ) = normal(𝑋𝑘 , ℎ𝐼𝑑 ).

219
220 CHAPTER 8. THE PROXIMAL SAMPLER

2. Draw 𝑋𝑘+1 ∼ 𝜋 𝑋 |𝑌 (· | 𝑌𝑘 ).
Since Gibbs sampling always forms a reversible Markov chain with respect to the
target distribution, we conclude that the proximal sampler is unbiased: its stationary
distribution of the proximal sampler is 𝝅. As written, however, the proximal sampler is
an idealized algorithm because it is not yet clear how to implement the second step of
sampling from 𝜋 𝑋 |𝑌 . Note that
 k𝑦 − 𝑥 k 2 
𝜋 𝑋 |𝑌 (𝑥 | 𝑦) ∝𝑥 exp −𝑉 (𝑥) − .
2ℎ
Also, recall that in optimization, we wish to minimize the function 𝑉 , whereas in sam-
pling we want to sample from 𝜋 ∝ exp(−𝑉 ). Via this correspondence, we see that the
optimization analogue of sampling from 𝜋 𝑋 |𝑌 is computing the proximal map
n k𝑦 − 𝑥 k 2 o
proxℎ𝑉 (𝑦) B arg min 𝑉 (𝑥) + .
𝑥 ∈R𝑑 2ℎ

The distribution 𝜋 𝑋 |𝑌 is known as the restricted Gaussian oracle (RGO), and the
proximal sampler can be viewed as the analogue of the proximal point method for sampling.
See Exercise 8.1 for another connection between the proximal sampler and the proximal
point method from optimization.

Implementability of the RGO. In order to obtain an actual algorithm from the proxi-
mal sampler, an implementation of the RGO must be provided. As we will see in Section 8.6,
the RGO can be implemented by using an auxiliary high-accuracy sampler such as MALA.
Although this may seem circular (if we need to use an auxiliary sampler to implement the
RGO, then why not use the auxiliary sampler in the first place without bothering with
the proximal sampler?), we will see there are benefits to the overall scheme. Namely, the
proximal sampler can boost the condition number dependence of the auxiliary sampler,
and it can be used to sample from a larger class of distributions.
For now, we will consider a simple implementation of the RGO based on rejection
sampling, which we studied in Section 7.1. Suppose that the potential 𝑉 is 𝛽-smooth.
1
Then, for 𝑉𝑦 (𝑥) B 𝑉 (𝑥) + 2ℎ k𝑦 − 𝑥 k 2 we have ( ℎ1 − 𝛽) 𝐼𝑑  ∇2𝑉𝑦  ( ℎ1 + 𝛽) 𝐼𝑑 . In particular,
if ℎ < 𝛽1 , then the RGO is strongly log-concave. Note that the condition number of 𝑉𝑦
is 𝜅 = ( ℎ1 + 𝛽)/( ℎ1 − 𝛽). If we now choose ℎ = 𝛽𝑑 1
, we can check that 𝜅 ≤ exp(4/𝑑) for
𝑑 ≥ 2. By Proposition 7.1.2, if we have access to the minimizer of 𝑉𝑦 (which is equivalent
to being able to compute the proximal operator for ℎ𝑉 ), we can construct an upper
envelope for which the average number of iterations of rejection sampling is bounded by
𝜅𝑑/2 ≤ exp(2) ≤ 8. We summarize this discussion as follows.
8.2. CONVERGENCE UNDER STRONG LOG-CONCAVITY 221

Implementing the RGO via Rejection Sampling: To sample from 𝜋 𝑋 |𝑌 (· | 𝑦),


where 𝑉 is 𝛽-smooth, we proceed via the following steps.
1
1. Compute the minimizer 𝑥𝑦★ of 𝑉𝑦 defined via 𝑉𝑦 (𝑥) B 𝑉 (𝑥) + 2ℎ k𝑦 − 𝑥 k 2 , and
compute the minimum value 𝑉𝑦 . This can be done exactly if we assume access

to the proximal mapping of 𝑉 (which is a natural assumption when designing a


proximal algorithm for sampling); otherwise, if ℎ < 𝛽1 , then this is a strongly convex
optimization problem and can be implemented using standard algorithms.
1/ℎ−𝛽
2. Let 𝜋e𝑋 |𝑌 (· | 𝑦) B exp{−(𝑉𝑦 − 𝑉𝑦★)} and e𝜇𝑦 B exp(− 2 k· − 𝑥𝑦★ k 2 ). Use rejection
sampling to sample from 𝜋 𝑋 |𝑌 with the envelope e
𝜇𝑦 .

Each iteration of rejection sampling requires one call to an evaluation oracle for 𝑉 (in
order to compute the acceptance probability). We summarize the guarantees for this
implementation of the RGO in the following theorem.

1
Theorem 8.1.1. Assume that 𝑉 is 𝛽-smooth. Then, if ℎ ≤ 𝛽𝑑 , rejection sampling
implements the RGO for 𝜋 exactly using one computation of the proximal map for 𝑉
𝑋

and 𝑂 (1) expected calls to an evaluation oracle for 𝑉 .

Notation. We write 𝜇𝑘𝑋 for the law of 𝑋𝑘 and 𝜇𝑘𝑌 for the law of 𝑌𝑘 for the iterates of
the proximal sampler. Observe that if (𝑄𝑡 )𝑡 ≥0 denotes the standard heat semigroup, i.e.
𝜇𝑄𝑡 = 𝜇 ∗ normal(0, 𝑡𝐼𝑑 ), then 𝜇𝑘𝑌 = 𝜇𝑘𝑋 𝑄ℎ and 𝜋 𝑌 = 𝜋 𝑋 𝑄ℎ .
We also abbreviate 𝜋 𝑋 |𝑌 (· | 𝑦) as 𝜋 𝑋 |𝑌 =𝑦 .

8.2 Convergence under Strong Log-Concavity


One of the most remarkable features of the proximal sampler is that its convergence
analysis closely mirrors the continuous-time theory for the Langevin diffusion. In this
section, we initiate this study starting with the strongly log-concave case.
Recall that under strong log-concavity, we have contraction of the Langevin diffusion
(Theorem 1.4.10). We prove the analogue of this fact for the proximal sampler.

Theorem 8.2.1. Assume that the target 𝜋 𝑋 is 𝛼-strongly log-concave. Also, let (𝜇𝑘𝑋 )𝑘∈N
222 CHAPTER 8. THE PROXIMAL SAMPLER

and ( 𝜇¯𝑘𝑋 )𝑘∈N denote two runs of the proximal sampler with target 𝜋 𝑋 . Then,

𝑊2 (𝜇0𝑋 , 𝜇¯0𝑋 )
𝑊2 (𝜇𝑘𝑋 , 𝜇¯𝑘𝑋 ) ≤ .
(1 + 𝛼ℎ)𝑘

The contraction factor matches the contraction for the proximal point method in
optimization, see Exercise 8.2. Since 𝜋 𝑋 is left invariant by the proximal sampler, the
contraction result also implies a convergence result in 𝑊2 .
We will give two proofs of this theorem. First, note that it suffices to consider
1
one iteration and to prove 𝑊2 (𝜇 1𝑋 , 𝜇¯1𝑋 ) ≤ 1+𝛼ℎ 𝑊2 (𝜇 0𝑋 , 𝜇¯0𝑋 ). Next, since the heat flow
is a Wasserstein contraction (which follows from (1.4.9) but can also be proven by a
straightforward coupling), it holds that 𝑊2 (𝜇𝑌0 , 𝜇¯𝑌0 ) ≤ 𝑊2 (𝜇0𝑋 , 𝜇¯0𝑋 ), so it suffices to show
1
𝑊2 (𝜇 1𝑋 , 𝜇¯1𝑋 ) ≤ 1+𝛼ℎ 𝑊2 (𝜇𝑌0 , 𝜇¯𝑌0 ).
We will use the following coupling lemma.

Lemma 8.2.2. Suppose that for all 𝑦, 𝑦¯ ∈ R𝑑 , we have

𝑊2 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 𝐶 k𝑦 − 𝑦¯ k . (8.2.3)

Then, 𝑊2 (𝜇1𝑋 , 𝜇¯1𝑋 ) ≤ 𝐶 𝑊2 (𝜇𝑌0 , 𝜇¯𝑌0 ).

The intuition is that since 𝜇 1𝑋 and 𝜇¯1𝑋 are obtained from 𝜇𝑌0 and 𝜇¯𝑌0 by sampling from
the RGO 𝜋 𝑋 |𝑌 , the contraction statement in (8.2.3) can be used to bound 𝑊2 (𝜇1𝑋 , 𝜇¯1𝑋 ). The
proof of the lemma is relatively straightforward and good practice for working with
couplings, so it is left as Exercise 8.3.
The first proof we present is from [LST21b].
¯
Proof of Theorem 8.2.1 via functional inequalities. To prove (8.2.3), we note that 𝜋 𝑋 |𝑌 (· | 𝑦)
1
is (𝛼 + ℎ )-strongly log-concave. Recall that by the Bakry–Émery theorem (Theorem 1.2.28)
and the Otto–Villani theorem (Exercise 1.16) this implies the log-Sobolev inequality (1.4.7)
and Talagrand’s T2 inequality (1.4.8). Applying these inequalities,
2 1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 1
KL(𝜋 𝑋 |𝑌 =𝑦
k 𝜋 𝑋 |𝑌 =𝑦¯
) ≤ 2
FI(𝜋 𝑋 |𝑌 =𝑦 k 𝜋 𝑋 |𝑌 =𝑦¯ ) .
𝛼+ℎ 1
(𝛼 + ℎ )

We can compute the Fisher information explicitly. Indeed,

𝜋 𝑋 |𝑌 =𝑦  k𝑦¯ − ·k 2 k𝑦 − ·k 2  𝑦 − 𝑦¯
∇ ln = ∇ − =
𝜋 𝑋 |𝑌 =𝑦¯ 2ℎ 2ℎ ℎ
8.2. CONVERGENCE UNDER STRONG LOG-CONCAVITY 223

so that
𝜋 𝑋 |𝑌 =𝑦 2  k𝑦 − 𝑦¯ k 2
FI(𝜋 𝑋 |𝑌 =𝑦 k 𝜋 𝑋 |𝑌 =𝑦¯ ) = E𝜋 𝑋 |𝑌 =𝑦 ∇ ln 𝑋 |𝑌 =𝑦¯ =

.
𝜋 ℎ2
Hence,
1 k𝑦 − 𝑦¯ k 2 1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 2 2
= 2
k𝑦 − 𝑦¯ k 2 . 
1 (1 +
(𝛼 + ℎ ) ℎ 𝛼ℎ)
The next proof, from [Che+22a], directly uses strong convexity in Wasserstein space.
Proof of Theorem 8.2.1 via Wasserstein calculus. This proof rests on the following interpre-
tation of the RGO. Let F(𝜇) B KL(𝜇 k 𝜋 𝑋 ). Then, by Exercise 8.1,
n 1 2 o
𝜋 𝑋 |𝑌 =𝑦
= arg min F(𝜇) + 𝑊2 (𝜇, 𝛿𝑦 ) C proxℎF (𝛿𝑦 ) .
𝜇∈P2 (R𝑑 ) 2ℎ
The first-order optimality conditions on Wasserstein space [AGS08, Lemma 10.1.2] reads
1
0 ∈ 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) +
(id − 𝑦) , 𝜋 𝑋 |𝑌 =𝑦 -a.s.

where 𝜕F is the subdifferential of F on Wasserstein space.
Using this, we obtain
id ∈ 𝑦 − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) , 𝜋 𝑋 |𝑌 =𝑦 -a.s.
id ∈ 𝑦¯ − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) , 𝜋 𝑋 |𝑌 =𝑦¯ -a.s.
Let 𝑇 be the optimal transport map from 𝜋 𝑋 |𝑌 =𝑦 to 𝜋 𝑋 |𝑌 =𝑦¯ . The second condition above
can then be rewritten as
𝑇 ∈ 𝑦¯ − ℎ 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 , 𝜋 𝑋 |𝑌 =𝑦 -a.s.
We now abuse notation and write 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ) for a particular element of the subdif-
ferential and similarly for 𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ). Then, 𝜋 𝑋 |𝑌 =𝑦 -a.s.,
k𝑇 − idk 2 = k𝑦¯ − 𝑦 k 2 − 2ℎ h𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 − 𝜕F(𝜋 𝑋 |𝑌 =𝑦 ),𝑇 − idi
− ℎ 2 k𝜕F(𝜋 𝑋 |𝑌 =𝑦¯ ) ◦ 𝑇 − 𝜕F(𝜋 𝑋 |𝑌 =𝑦 )k 2 .
We now integrate w.r.t. 𝜋 𝑋 |𝑌 =𝑦 and apply strong convexity of F in Wasserstein space:
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ k𝑦 − 𝑦¯ k 2 − 2𝛼ℎ 𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) − 𝛼 2ℎ 2 𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ )
and hence
1
𝑊22 (𝜋 𝑋 |𝑌 =𝑦 , 𝜋 𝑋 |𝑌 =𝑦¯ ) ≤ 2
k𝑦 − 𝑦¯ k 2 . 
(1 + 𝛼ℎ)
224 CHAPTER 8. THE PROXIMAL SAMPLER

The point of the second proof is that, although it uses some heavy machinery, it is just
a translation of a Euclidean optimization proof into the language of Wasserstein space
(see Exercise 8.2).

8.3 Simultaneous Heat Flow and Time Reversal


We now introduce two new techniques in order to further analyze the proximal sampler.

Simultaneous heat flow. The first technique is based on the observation that in going
from 𝜇𝑘𝑋 to 𝜇𝑘𝑌 , and from 𝜋 𝑋 to 𝜋 𝑌 , we are applying the heat flow. Given any 𝑓 -divergence
D 𝑓 (· k ·), we will compute its time derivative when both arguments undergo simultaneous
heat flow. Remarkably, the result will be almost the same as the time derivative of the
𝑓 -divergence to the target along the continuous-time Langevin diffusion, in a sense to be
made precise. The upshot is that the analysis of the proximal sampler closely resembles
the analysis of the continuous-time Langevin diffusion.
The simultaneous heat flow calculation is inspired by [VW19], and was carried out at
this level of generality in [Che+22a].
Let 𝑓 : R+ → R+ be a convex function with 𝑓 (1) = 0, and let D 𝑓 be the associated
𝑓 -divergence (see Section 1.5). We begin with a quick computation of the time derivative
of the 𝑓 -divergence along the Langevin diffusion.

Theorem 8.3.1. Let (𝜋𝑡 )𝑡 ≥0 denote the law of the continuous-time Langevin diffusion
with target 𝜋. Then, for any 𝑓 -divergence D 𝑓 , it holds that

𝜕𝑡 D 𝑓 (𝜋𝑡 k 𝜋) = −J𝑓 (𝜋𝑡 k 𝜋) ,

where
D 𝜇 𝜇E
J𝑓 (𝜇 k 𝜋) B E𝜇 ∇ 𝑓 0 ◦ , ∇ ln . (8.3.2)
𝜋 𝜋

Proof. Using the Fokker–Planck equation,


∫ ∫ ∫
𝜋𝑡  0 𝜋𝑡  𝜋𝑡  𝜋𝑡 
𝜕𝑡 D 𝑓 (𝜋𝑡 k 𝜋) = 𝜕𝑡 𝑓 d𝜋 = 𝑓 𝜕𝑡 𝜋𝑡 = 𝑓0 div 𝜋𝑡 ∇ ln
∫ D 𝜋 𝜋
E
𝜋 𝜋
𝜋𝑡  𝜋𝑡
=− ∇ 𝑓0◦ , ∇ ln d𝜋𝑡 . 
𝜋 𝜋
Next, we compute the time derivative of the 𝑓 -divergence when both arguments
simultaneously evolve according to the heat flow.
8.3. SIMULTANEOUS HEAT FLOW AND TIME REVERSAL 225

Theorem 8.3.3. Let (𝑄𝑡 )𝑡 ≥0 denote the standard heat semigroup. Then,
1
𝜕𝑡 D 𝑓 (𝜇𝑄𝑡 k 𝜋𝑄𝑡 ) = − J𝑓 (𝜇𝑄𝑡 k 𝜋𝑄𝑡 ) ,
2
where J𝑓 is defined in (8.3.2).

Proof. For brevity, write 𝜇𝑡 B 𝜇𝑄𝑡 and 𝜋𝑡 B 𝜋𝑄𝑡 . Since 𝜕𝑡 𝜇𝑡 = 21 Δ𝜇𝑡 = 12 div(𝜇𝑡 ∇ ln 𝜇𝑡 )
and similarly for 𝜕𝑡 𝜋𝑡 , we compute
∫ ∫   ∫
𝜇𝑡  0 𝜇𝑡  𝜇𝑡 𝜇𝑡 
2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = 2 𝜕𝑡 𝑓 d𝜋𝑡 = 2 𝑓 𝜕𝑡 𝜇𝑡 − 𝜕𝑡 𝜋𝑡 + 2 𝑓 𝜕𝑡 𝜋𝑡
∫ 𝜋𝑡 𝜋𝑡 ∫ 𝜋𝑡 𝜋𝑡
𝜇𝑡   𝜇𝑡  𝜇𝑡 
= 𝑓0 div(𝜇𝑡 ∇ ln 𝜇𝑡 ) − div(𝜋𝑡 ∇ ln 𝜋𝑡 ) + 𝑓 div(𝜋𝑡 ∇ ln 𝜋𝑡 )
∫ D 𝜋 𝑡 𝜋 𝑡 ∫ D 𝜋 𝑡
𝜇𝑡  E  0 𝜇𝑡  𝜇𝑡  E
=− ∇ 𝑓 ◦0
, ∇ ln 𝜇𝑡 d𝜇𝑡 + ∇ 𝑓 , ∇ ln 𝜋𝑡 d𝜋𝑡
𝜋𝑡 𝜋𝑡 ∫𝜋𝑡
D 𝜇𝑡  E
− ∇ 𝑓 ◦ , ∇ ln 𝜋𝑡 d𝜋𝑡
∫ D ∫ D 𝜋𝑡
𝜇𝑡  𝜇𝑡 E 𝜇𝑡 E 𝜇𝑡 
=− ∇ 𝑓0◦ , ∇ ln d𝜇𝑡 + ∇ , ∇ ln 𝜋𝑡 𝑓 0 d𝜋𝑡
𝜋𝑡 𝜋𝑡 𝜋𝑡 ∫ D 𝜋𝑡
𝜇𝑡 E 𝜇𝑡 
− ∇ , ∇ ln 𝜋𝑡 𝑓 0 d𝜋𝑡
𝜋𝑡 𝜋𝑡
= −J𝑓 (𝜇𝑡 k 𝜋𝑡 ) . 

Although this theorem is already enough to prove new convergence results for the
proximal sampler, the rates will be slightly suboptimal. The reason for this is because we
have only considered one step of the proximal sampler, in which the algorithm goes from
𝜇𝑘𝑋 to 𝜇𝑘𝑌 (and the target goes from 𝜋 𝑋 to 𝜋 𝑌 ). In order to obtain the sharp convergence
rates, we also need to consider the second step, in which we go from 𝜇𝑘𝑌 to 𝜇𝑘+1 𝑋 (and the

target returns from 𝜋 𝑌 to 𝜋 𝑋 ). For reasons that will become clear shortly, we refer to these
steps as the “forwards step” and the “backwards step” respectively.
First, consider the evolution of the target along the heat semigroup 𝑡 ↦→ 𝜋 𝑋 𝑄𝑡 , so
that at time ℎ we arrive at 𝜋 𝑌 . The stochastic process representation of this evolution is
d𝑍𝑡 = d𝐵𝑡 , with 𝑍 0 ∼ 𝜋 𝑋 and 𝑍ℎ ∼ 𝜋 𝑌 , thus describing the forward step. So far so good,
but how should we think∫about the backwards step? By definition, 𝜋 𝑋 is obtained from
𝜋 𝑌 by the relation 𝜋 𝑋 = 𝜋 𝑋 |𝑌 =𝑦 d𝜋 𝑌 (𝑦), but this is not as helpful because we lose the
stochastic process view which allows us to apply calculus. Instead, we will think of 𝜋 𝑋 as
being obtained from 𝜋 𝑌 by the time reversal of the diffusion (𝑍𝑡 )𝑡 ∈[0,ℎ] .
226 CHAPTER 8. THE PROXIMAL SAMPLER

Time reversal. To emphasize the underlying principles, we will consider a more general
diffusion d𝑍𝑡 = 𝑏𝑡 (𝑍𝑡 ) d𝑡 + d𝐵𝑡 on the time interval [0,𝑇 ]. The discussion can also be
extended to include non-isotropic diffusion matrices, but we do not consider this for
clarity of exposition. Also, let 𝜋𝑡 denote the law of 𝑍𝑡 .
The time reversal of the diffusion is (𝑍𝑡← )𝑡 ∈[0,𝑇 ] B (𝑍𝑇 −𝑡 )𝑡 ∈[0,𝑇 ] . Then, (𝑍𝑡← )𝑡 ∈[0,𝑇 ] is
also a Markov process, and our aim is to describe its evolution via a stochastic differential
equation.1 First, note that since (𝑍𝑡 )𝑡 ∈[0,𝑇 ] is not time-homogeneous, we need a family of
generators (ℒ𝑡 )𝑡 ∈[0,𝑇 ] , where ℒ𝑡 𝑓 B 12 Δ𝑓 + h𝑏𝑡 , ∇𝑓 i. It suffices to compute the generators
(ℒ𝑡← )𝑡 ∈[0,𝑇 ] of the reversed process.

Theorem 8.3.4. The generator of the reversed process at time 𝑡 is


1
ℒ𝑡← 𝑓 = Δ𝑓 + h−𝑏𝑇 −𝑡 + ∇ ln 𝜋𝑇 −𝑡 , ∇𝑓 i .
2

This shows that the reversed process satisfies


d𝑍𝑡← = {−𝑏𝑇 −𝑡 (𝑍𝑡← ) + ∇ ln 𝜋𝑇 −𝑡 (𝑍𝑡← )} d𝑡 + d𝐵𝑡 . (8.3.5)
Proof. The generator of the reversed process by definition satisfies, for any test function
𝑓 and any 0 ≤ 𝑠 < 𝑡 ≤ 𝑇 ,
𝜕𝑡 E[𝑓 (𝑍𝑡← ) | 𝑍𝑠← ] = E[ℒ𝑡← 𝑓 (𝑍𝑡← ) | 𝑍𝑠← ] .
Equivalently, for any test function 𝑔,
𝜕𝑡 E[𝑓 (𝑍𝑡← ) 𝑔(𝑍𝑠← )] = E[ℒ𝑡← 𝑓 (𝑍𝑡← ) 𝑔(𝑍𝑠← )] .
Switching from 𝑍 ← to 𝑍 and using the change of notation 𝑠 ← 𝑇 − 𝑡 and 𝑡 ← 𝑇 − 𝑠, so
that 𝑠 ≤ 𝑡, we want
𝜕𝑠 E[𝑓 (𝑍𝑠 ) 𝑔(𝑍𝑡 )] = − E[ℒ𝑇←−𝑠 𝑓 (𝑍𝑠 ) 𝑔(𝑍𝑡 )] . (8.3.6)
Write 𝑃𝑠,𝑡 𝑔(𝑥) B E[𝑔(𝑋𝑡 ) | 𝑋𝑠 = 𝑥]. Then, in this context, Kolmogorov’s equation reads
𝜕𝑠 𝑃𝑠,𝑡 𝑔 = −ℒ𝑠 𝑃𝑠,𝑡 𝑔 (this is admittedly a confusing calculation because we are differentiating
w.r.t. 𝑠, not 𝑡, but it can be checked via Taylor expansion). Thus,

𝜕𝑠 E[𝑓 (𝑍𝑠 ) 𝑔(𝑍𝑡 )] = 𝜕𝑠 E[𝑓 (𝑍𝑠 ) 𝑃𝑠,𝑡 𝑔(𝑍𝑠 )] = 𝜕𝑠 𝑓 𝑃𝑠,𝑡 𝑔 d𝜋𝑠

1 Why can the reversed process be described by a stochastic differential equation? An elegant approach
to this question was proposed by Föllmer in [Föl85]: via Girsanov’s theorem it suffices to show that the law
of the time reversal on path space has finite KL divergence w.r.t. the Wiener measure. This well-known
paper of Föllmer is often cited by works on the stochastic process (the time reversal of Brownian motion)
which now bears his name.
8.3. SIMULTANEOUS HEAT FLOW AND TIME REVERSAL 227
∫ ∫
=− 𝑓 ℒ𝑠 𝑃𝑠,𝑡 𝑔 𝜋𝑠 + 𝑓 𝑃𝑠,𝑡 𝑔 ℒ𝑠∗ 𝜋𝑠 ,

where ℒ𝑠∗ denotes the Lebesgue adjoint of ℒ𝑠 . A direct computation using the definition
of ℒ𝑠 shows that ℒ𝑠∗ (𝑓 𝜋𝑠 ) = 𝑓 ℒ𝑠∗ 𝜋𝑠 + { 12 Δ𝑓 + h−𝑏𝑠 + ∇ ln 𝜋𝑠 , ∇𝑓 i} 𝜋𝑠 , so

𝜕𝑠 E[𝑓 (𝑍𝑠 )𝑔(𝑍𝑡 )] = − {ℒ𝑠∗ (𝑓 𝜋𝑠 ) − 𝑓 ℒ𝑠∗ 𝜋𝑠 } 𝑃𝑠,𝑡 𝑔
1

Δ𝑓 + h−𝑏𝑠 + ∇ ln 𝜋𝑠 , ∇𝑓 i 𝑃𝑠,𝑡 𝑔 d𝜋𝑠

=−
2
 1
Δ𝑓 (𝑍𝑠 ) + h−𝑏𝑠 (𝑍𝑠 ) + ∇ ln 𝜋𝑠 (𝑍𝑠 ), ∇𝑓 (𝑍𝑠 )i 𝑔(𝑍𝑡 ) .

= −E
2
Comparing this with (8.3.6), we obtain the result. 

Applying the time reversal to the RGO. Now let (𝑍𝑡 )𝑡 ∈[0,ℎ] be standard Brownian
motion with 𝑍 0 ∼ 𝜋 𝑋 , so that 𝑍ℎ ∼ 𝜋 𝑌 . Then, (𝑍 0, 𝑍ℎ ) ∼ 𝝅, and in particular the law of
𝑍 0 conditioned on 𝑍ℎ = 𝑦 is 𝜋 𝑋 |𝑌 =𝑦 . On the other hand, this is the same as the law of 𝑍ℎ←
conditioned on 𝑍 0← = 𝑦, where (𝑍𝑡← )𝑡 ∈[0,ℎ] is the time reversal of (𝑍𝑡 )𝑡 ∈[0,ℎ] .
Using the explicit form of the reversed process in (8.3.5), we know that

d𝑍𝑡← = ∇ ln(𝜋 𝑋 𝑄ℎ−𝑡 )(𝑍𝑡← ) d𝑡 + d𝐵𝑡 . (8.3.7)

If we initialize the process at 𝑍 0← ∼ 𝜋 𝑌 , then 𝑍ℎ← ∼


∫ 𝜋 . On the other hand, if we initialize
𝑋

the process at 𝑍 0← ∼ 𝜇𝑘𝑌 , then the law of 𝑍ℎ← is 𝜋 𝑋 |𝑌 =𝑦 d𝜇𝑘𝑌 (𝑦) = 𝜇𝑘+1
𝑋 . Thus, we have

successfully exhibited a stochastic process representation which takes us from 𝜇𝑘𝑌 to 𝜇𝑘+1 𝑋 .

For any measure 𝜇, write 𝜇𝑄𝑡← for the law of 𝑍𝑡← initialized at 𝑍 0← ∼ 𝜇.

Simultaneous backwards heat flow. Next, we will show that the time derivative of
the 𝑓 -divergence along the simultaneous backwards heat flow also behaves the same way
as the simultaneous forwards heat flow. This leads to a pleasing symmetry between the
forwards and backwards steps of the proximal sampler.

Theorem 8.3.8. Let (𝑄𝑡← )𝑡 ∈[0,ℎ] denote the construction described above by reversing
the heat flow started at 𝜋 𝑋 . Then,
1
𝜕𝑡 D 𝑓 (𝜇𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − J𝑓 (𝜇𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) ,
2
228 CHAPTER 8. THE PROXIMAL SAMPLER

where J𝑓 is defined in (8.3.2).

Proof. For brevity, write 𝜇𝑡← B 𝜇𝑄𝑡← and 𝜋𝑡← B 𝜋 𝑌 𝑄𝑡← . By construction of the reversed
process, 𝜋𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 . Then, by the Fokker–Planck equation,
1 1
𝜕𝑡 𝜋𝑡← = − div(𝜋𝑡← ∇ ln 𝜋𝑡← ) + Δ𝜋𝑡← = − Δ𝜋𝑡← ,
2 2
1 𝜇←  1
𝜕𝑡 𝜇𝑡← = − div(𝜇𝑡← ∇ ln 𝜋𝑡← ) + Δ𝜇𝑡← = div 𝜇𝑡← ∇ ln 𝑡← − Δ𝜇𝑡← .
2 𝜋𝑡 2
Note that the fact that (𝜋𝑡← )𝑡 ∈[0,ℎ] satisfies the backwards heat equation is completely
natural in light of our construction via the reversed process.
Hence, we compute
∫ ←  𝜇𝑡←

𝜇𝑡← 
0 𝜇𝑡 

2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = 2
← ←
𝑓 𝜕 𝜇 ←
− 𝜕 𝜋 ←
+ 2 𝑓 𝜕𝑡 𝜋𝑡←
𝜋𝑡← 𝜋𝑡← 𝜋𝑡←
𝑡 𝑡 𝑡 𝑡
∫ ←  𝜇𝑡←  𝜇𝑡←  ∫ 𝜇𝑡← 
0 𝜇𝑡 
= 𝑓 2 div 𝜇 ←
∇ ln − Δ𝜇 ←
+ Δ𝜋 ←
− 𝑓 Δ𝜋𝑡←
𝜋𝑡← 𝑡
𝜋𝑡← 𝑡
𝜋𝑡← 𝑡 𝜋𝑡←
∫ ← 𝜇𝑡← 
0 𝜇𝑡 
=2 𝑓 div 𝜇𝑡 ∇ ln ←

𝜋𝑡← 𝜋𝑡
h∫ 𝜇 ←  𝜇𝑡←  ∫ 𝜇𝑡←  i
− 𝑓 0 𝑡
Δ𝜇𝑡 − ← Δ𝜋𝑡 +
← ←
𝑓 ← Δ𝜋𝑡 . ←
𝜋𝑡← 𝜋𝑡 𝜋𝑡
| {z }
(★)

The term (★) is exactly the same kind of term we encountered in the proof of Theorem 8.3.3,
and by the same calculations it equals −J𝑓 (𝜇𝑡← k 𝜋𝑡← ). Therefore,
𝜇←  𝜇← E
∫ D
2 𝜕𝑡 D 𝑓 (𝜇𝑡 k 𝜋𝑡 ) = −2
← ←
∇ 𝑓 0 ◦ 𝑡← , ∇ ln 𝑡← d𝜇𝑡← + J𝑓 (𝜇𝑡← k 𝜋𝑡← )
𝜋𝑡 𝜋𝑡
← ←
= −J𝑓 (𝜇𝑡 k 𝜋𝑡 ) . 

8.4 Convergence under Log-Concavity


Next, we present a convergence proof for the proximal sampler under log-concavity,
following [Che+22a]. The proof can be compared to the 1/𝑡 convergence rate for the
Langevin diffusion under log-concavity (1.4.12), which was obtained via a Lyapunov
function argument.
8.4. CONVERGENCE UNDER LOG-CONCAVITY 229

Theorem 8.4.1. Assume that the target 𝜋 𝑋 is log-concave. Then, for the law 𝜇𝑘𝑋 of the
𝑘-th iterate of the proximal sampler,

𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ .
𝑘ℎ

Proof. Forwards step. Along the simultaneous heat flow, Theorem 8.3.3 shows that

1
𝜕𝑡 KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) = − FI(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
2
so we need to lower bound the Fisher information. Also, log-concavity is preserved by
convolution, so 𝜋 𝑋 𝑄𝑡 is log-concave. [TODO: Justify this fact.] Hence, by convexity of
the KL divergence to a log-concave target along Wasserstein geodesics (Theorem 1.4.5),
D 𝜇𝑋 𝑄𝑡 E
0 = KL(𝜋 𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) ≥ KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) + E𝜇𝑋 𝑄𝑡 ∇ ln 0𝑋 ,𝑇𝜇𝑋 𝑄𝑡 →𝜋 𝑋 𝑄𝑡 − id .
0 𝜋 𝑄𝑡 0
Rearranging this and using the Cauchy–Schwarz inequality,
h 𝜇 𝑋 𝑄𝑡 2 i 2
E𝜇𝑋 𝑄𝑡 ∇ ln 0𝑋 𝑊22 (𝜇 0𝑋 𝑄𝑡 , 𝜋 𝑋 𝑄𝑡 ) ≥ KL(𝜇0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 ) .

0 𝜋 𝑄𝑡
| {z }
FI(𝜇 0𝑋 𝑄𝑡 k𝜋 𝑋 𝑄𝑡 )

Combining this with the fact that the Wasserstein distance is decreasing along the simul-
taneous heat flow,
2
1 KL(𝜇 0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
𝜕𝑡 KL(𝜇0𝑋 𝑄𝑡 𝑋
k 𝜋 𝑄𝑡 ) ≤ − .
2 𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )

Solving this differential inequality,

1 1 1 ℎ
= ≥ + .
KL(𝜇𝑌0 k 𝜋𝑌 ) KL(𝜇0𝑋 𝑄ℎ k 𝜋𝑋𝑄 ℎ) KL(𝜇0𝑋 k 𝜋𝑋 ) 2𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )

Backwards step. Along the simultaneous backwards heat flow, Theorem 8.3.8 gives

1
𝜕𝑡 KL(𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − FI(𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) .
2
230 CHAPTER 8. THE PROXIMAL SAMPLER

Since 𝜋 𝑌 𝑄𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 is log-concave and 𝑡 ↦→ 𝑊22 (𝜋 𝑌 𝑄𝑡←, 𝜋 𝑌 𝑄𝑡← ) is decreasing (which
is checked via a coupling argument using the diffusion (8.3.7)), a similar calculation as the
forwards step leads to the inequality
1 1 1 ℎ
= ≥ + .
KL(𝜇1𝑋 k 𝜋𝑋 ) KL(𝜇𝑌0 𝑄ℎ← k 𝜋 𝑌 𝑄ℎ← ) KL(𝜇𝑌0 k 𝜋𝑌 ) 2𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
We iterate these inequalities, using the fact that 𝑊2 (𝜇𝑘𝑋 , 𝜋 𝑋 ) ≤ 𝑊2 (𝜇0𝑋 , 𝜋 𝑋 ) for all
𝑘 ∈ N (which follows from Theorem 8.2.1) to obtain
1 1 𝑘ℎ
≥ +
KL(𝜇𝑘𝑋 k 𝜋 𝑋 ) KL(𝜇 0𝑋 k 𝜋 𝑋 ) 𝑊22 (𝜇 0𝑋 , 𝜋 𝑋 )
or
KL(𝜇0𝑋 k 𝜋 𝑋 ) 𝑊22 (𝜇0𝑋 , 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ ≤ . 
1 + 𝑘ℎ KL(𝜇0𝑋 k 𝜋 𝑋 )/𝑊22 (𝜇0𝑋 , 𝜋 𝑋 ) 𝑘ℎ

8.5 Convergence under Functional Inequalities


We now prove convergence guarantees for the proximal sampler when the target satisfies
either a Poincaré inequality or a log-Sobolev inequality, following [Che+22a].

Theorem 8.5.1. Suppose that the target 𝜋 𝑋 satisfies a Poincaré inequality with constant
𝐶 PI . Then, for the law 𝜇𝑘𝑋 of the 𝑘-th iterate of the proximal sampler,

𝜒 2 (𝜇0𝑋 k 𝜋 𝑋 )
𝜒 2 (𝜇𝑘𝑋 k 𝜋 𝑋 ) ≤ .
(1 + ℎ/𝐶 PI ) 2𝑘

Proof. Forwards step. For the chi-squared divergence, we can check that the dissipation
functional is given by J (𝜇 k 𝜋) = 2 E𝜋 [k∇(𝜇/𝜋)k 2 ]. Along the simultaneous heat flow,
by Theorem 8.3.3,
h 𝜇 𝑋 𝑄𝑡 2 i
𝜕𝑡 𝜒 (𝜇 0 𝑄𝑡 k 𝜋 𝑄𝑡 ) = −J (𝜇 0 𝑄𝑡 k 𝜋 𝑄𝑡 ) = − E𝜋 𝑋 𝑄𝑡 ∇ 0𝑋 .
2 𝑋 𝑋 𝑋 𝑋
𝜋 𝑄𝑡
Since 𝜋 𝑋 satisfies a Poincaré inequality with constant 𝐶 PI , by subadditivity of the Poincaré
constant under convolution (Proposition 2.3.7), 𝜋 𝑋 𝑄𝑡 satisfies a Poincaré inequality with
constant at most 𝐶 PI + 𝑡. It therefore yields

2 1 𝜇 0𝑋 𝑄𝑡 1
𝜕𝑡 𝜒 (𝜇 0𝑋 𝑄𝑡 𝑋
k 𝜋 𝑄𝑡 ) ≤ − var𝜋 𝑋 𝑄𝑡 𝑋 =− 𝜒 2 (𝜇0𝑋 𝑄𝑡 k 𝜋 𝑋 𝑄𝑡 )
𝐶 PI + 𝑡 𝜋 𝑄𝑡 𝐶 PI + 𝑡
8.5. CONVERGENCE UNDER FUNCTIONAL INEQUALITIES 231

and hence

2 2
 ∫ ℎ
1  𝜒 2 (𝜇 0𝑋 k 𝜋 𝑋 )
𝜒 (𝜇𝑌0 𝑌
k𝜋 )=𝜒 (𝜇0𝑋 𝑄ℎ k 𝜋 𝑄ℎ ) ≤ exp −
𝑋
d𝑡 𝜒 2 (𝜇0𝑋 k 𝜋 𝑋 ) = .
0 𝐶 PI + 𝑡 1 + ℎ/𝐶 PI

Backwards step. Along the simultaneous backwards heat flow, Theorem 8.3.8 yields

h 𝜇 𝑌 𝑄 ← 2 i
𝜕𝑡 𝜒 2 (𝜇𝑌0 𝑄𝑡← k 𝜋 𝑌 𝑄𝑡← ) = − E𝜋 𝑌 𝑄𝑡← ∇ 0𝑌 ← .
𝑡
𝜋 𝑄𝑡

Using the fact that 𝜋 𝑌 𝑄𝑡← = 𝜋 𝑋 𝑄ℎ−𝑡 satisfies the Poincaré inequality with constant at
most 𝐶 PI + ℎ − 𝑡, we deduce similarly that

𝜒 2 (𝜇𝑌0 k 𝜋 𝑌 )
𝜒 2 (𝜇 1𝑋 k 𝜋 𝑋 ) = 𝜒 2 (𝜇𝑌0 𝑄ℎ← k 𝜋 𝑌 𝑄ℎ← ) ≤ .
1 + ℎ/𝐶 PI

Iterating this pair of inequalities yields the result. 

A similar result holds for the log-Sobolev inequality; since the proof is entirely analo-
gous, we leave it as Exercise 8.5.

Theorem 8.5.2. Suppose that the target 𝜋 𝑋 satisfies a log-Sobolev inequality with
constant 𝐶 LSI . Then, for the law 𝜇𝑘𝑋 of the 𝑘-th iterate of the proximal sampler,

KL(𝜇0𝑋 k 𝜋 𝑋 )
KL(𝜇𝑘𝑋 𝑋
k𝜋 ) ≤ .
(1 + ℎ/𝐶 LSI ) 2𝑘

Recall that if 𝜋 𝑋 is 𝛼-strongly log-concave, then it satisfies a log-Sobolev inequality


1
with constant 𝐶 LSI ≤ 1/𝛼 (Theorem 1.2.28). Thus, the contraction factor of (1+𝛼ℎ) 2 in KL

divergence matches the contraction factor in 𝑊22 distance (Theorem 8.2.1). To get this
sharp result, it is necessary to utilize the backwards step.
Similarly to Theorem 2.2.15, it is also possible to obtain guarantees for Rényi diver-
gences, see Exercise 8.6.
Remark 8.5.3. It is a curious observation that in the 𝑊2 guarantee of Theorem 8.2.1, the
1
contraction factor of (1+𝛼ℎ) 2 occurs solely in the backwards step, whereas in Theorem 8.5.2
1
the forwards and backwards steps each contribute a contraction factor of 1+𝛼ℎ .
232 CHAPTER 8. THE PROXIMAL SAMPLER

8.6 Applications
The original application of the proximal sampler was for sampling from certain families of
structured log-concave distributions [LST21c]. Since then, the proximal sampler has been
used to provide new guarantees for non-smooth and weakly smooth potentials [GLL22;
LC22a; LC22b]. We will restrict ourselves to applications which are more or less immediate
corollaries of our present analysis.

New guarantees for sampling from smooth potentials. When the potential 𝑉 is
𝛽-smooth, as discussed in Section 8.1, the RGO can be implemented via rejection sampling.
We obtain the following corollaries.

1
Corollary 8.6.1. Let 𝜋 𝑋 ∝ exp(−𝑉 ), where 𝑉 is 𝛽-smooth. Take ℎ = 𝛽𝑑 and assume we
have an oracle to 𝑉 which evaluates 𝑉 and the proximal operator for 𝑉 . Let 𝜇 𝑋𝑁 denote
the law of the 𝑁 -th iterate of the proximal sampler, in which the RGO is implemented
via rejection sampling.

1. (Theorem 8.2.1) If in addition 𝑉 is 𝛼-strongly convex with 𝛼 > 0, then writing


√ 𝑊 (𝜇 𝑋 ,𝜋 𝑋 )
𝜅 B 𝛽/𝛼 we obtain 𝛼 𝑊2 (𝜇 𝑋𝑁 , 𝜋 𝑋 ) ≤ 𝜀 using 𝑂 (𝜅𝑑 log 2 𝜀0 ) queries to the
oracle in expectation.
√︃
2. (Theorem 8.4.1) If 𝑉 is convex, we obtain the guarantee KL(𝜇 𝑋𝑁 k 𝜋 𝑋 ) ≤ 𝜀 using
𝛽𝑑 𝑊22 (𝜇 0𝑋 ,𝜋 𝑋 )
𝑂( 𝜀2
) queries to the oracle in expectation.

3. (Theorem 8.5.1)
√︃ If 𝑉 satisfies a Poincaré inequality with constant 𝐶 PI , we obtain the
2 (𝜇 𝑋 k𝜋)
guarantee 𝜒 2 (𝜇 𝑋𝑁 k 𝜋 𝑋 ) ≤ 𝜀 using 𝑂 (𝐶 PI 𝛽𝑑 log 𝜀02 ) queries to the oracle
𝜒

in expectation.

4. (Theorem 8.5.2) √︃
If 𝑉 satisfies a log-Sobolev inequality with constant 𝐶 LSI , we obtain
KL(𝜇 0𝑋 k𝜋)
the guarantee KL(𝜇 𝑋𝑁 k 𝜋 𝑋 ) ≤ 𝜀 using 𝑂 (𝐶 LSI 𝛽𝑑 log 𝜀2
) queries to the
oracle in expectation.

Note that for the strongly log-concave case, these results are competitive with the
state-of-the-art results for MALA under a feasible start (Theorem 7.3.4)!

Improving the condition number dependence of high-accuracy samplers. The


next application we present is the original use of the proximal sampler in [LST21c]. Namely,
8.6. APPLICATIONS 233

suppose that 𝜋 𝑋 ∝ exp(−𝑉 ) is such that 0 ≺ 𝛼𝐼𝑑  ∇2𝑉  𝛽𝐼𝑑 with condition number
𝜅 B 𝛽/𝛼. Suppose we have a high-accuracy sampler which, given any target satisfying
these assumptions, outputs a sample from a probability measure 𝜇 with k𝜇 − 𝜋 𝑋 k TV ≤ 𝜀
using 𝑂e(𝑓 (𝜅) 𝑑 𝑐 polylog(1/𝜀)) queries, where 𝑓 : R+ → R+ is some increasing function.
Then, by combining this high-accuracy sampler with the proximal sampler, we can obtain
a new sampler whose complexity is only 𝑂 e(𝜅𝑑 𝑐 polylog(𝜅/𝜀)), i.e., we have improved the
dependence on the condition number to near linear.
To see how this works, observe that if we choose the step size ℎ = 𝛽1 for the proximal
sampler, then the RGO 𝜋 𝑋 |𝑌 =𝑦 has condition number 𝑂 (1). Thus, the high-accuracy
sampler can obtain a 𝛿-approximate sample from the RGO using 𝑂 e(𝑑 𝑐 polylog(1/𝛿))
queries. On the other hand, with this choice of step size, we know from Theorem 8.5.2
that with a perfect implementation of the RGO the number of iterations required for the
e(𝜅 log(𝑑/𝜀 2 )). To complete the
proximal sampler to output 𝜇 𝑋𝑁 with k𝜇 𝑋𝑁 − 𝜋 k TV ≤ 𝜀 is 𝑂
analysis, we need to analyze how the error propagates due to the imperfect implementation
of the RGO. This is handled via a coupling argument (Exercise 8.8).

Lemma 8.6.2. Let 𝜇 𝑋𝑁 denote the law of the 𝑁 -th iterate of the proximal sampler with
perfect implementation of the RGO. Suppose that instead, in each step of the proximal
sampler, we use a sample from a distribution which is 𝛿-close to the RGO in total variation
distance; let 𝜇ˆ𝑋𝑁 denote the law of the 𝑁 -th iterate of the proximal sampler with imperfect
implementation of the RGO. Then,

k 𝜇ˆ𝑋𝑁 − 𝜇 𝑋𝑁 k TV ≤ 𝑁 𝛿 .

Since 𝑁 = 𝑂 e(𝜅 log(𝑑/𝜀 2 )), we can take 𝛿  𝜀/𝑁 . The total complexity of the prox-
imal sampler (the number of iterations 𝑁 of the proximal sampler multiplied by the
cost of approximately implementing the RGO with the high-accuracy sampler) is then
e(𝜅𝑑 𝑐 polylog(1/𝜀)) as claimed.
𝑂
In particular, applying this to the Metropolized random walk (MRW) algorithm (Theo-
rem 7.3.4) improves the complexity from 𝑂 e(𝜅 2𝑑 polylog(1/𝜀)) to 𝑂
e(𝜅𝑑 polylog(1/𝜀)).

Zeroth-order algorithms for sampling. The example above shows that boosting
the MRW algorithm with the proximal sampler leads to an algorithm whose complexity
is competitive with that of MALA. Moreover, unlike MALA, the algorithm based on
MRW only uses zeroth-order information, which is crucial for certain applications such
as Bayesian inverse problems in which gradient information is prohibitively expensive.
Similarly, implementing the RGO using rejection sampling only uses zeroth-order
234 CHAPTER 8. THE PROXIMAL SAMPLER

information, except possibly for computing the minimizer of the potential 𝑉𝑦 .

Lack of discretization analysis. Finally, we mention that the results in Corollary 8.6.1
are state-of-the-art under the various assumptions. A key reason why the proximal
sampler yields powerful complexity guarantees is because there is no “discretization
analysis”. For example, consider the sampling from a target distribution satisfying a
Poincaré inequality. Since a Poincaré inequality implies convergence in chi-squared
divergence, it is natural to perform a 𝜒 2 analysis of LMC, but this leads to substantial
new technical hurdles (see Chapter 5). Moreover, under a Poincaré inequality it becomes
non-trivial even to prove moment bounds for the LMC iterates. All of this is handled via
a careful analysis in [Che+21a], but the results there have worse dependence on 𝑑, 𝐶 PI 𝛽,
and 𝜀 −1 . In contrast, Corollary 8.6.1 bypasses all of these difficulties because the proximal
sampler reduces the task of sampling from distributions satisfying a Poincaré inequality
to the task of sampling from strongly log-concave distributions for the implementation of
the RGO, and even this is made straightforward via rejection sampling provided that we
take a small enough step size for the proximal sampler.

Bibliographical Notes
The reader is encouraged to read the original paper [LST21b] on the proximal sampler,
which contains applications to sampling from composite densities 𝜋 ∝ exp(−(𝑓 + 𝑔)),
where 𝑓 is well-conditioned and 𝑔 admits an implementable RGO, as well as to sampling
from log-concave finite sums 𝜋 ∝ exp(−𝐹 ) where 𝐹 B 𝑛 −1 𝑛𝑖=1 𝑓𝑖 is well-conditioned
Í
and the complexity is measured via the number of oracle calls to the individual functions
(𝑓𝑖 )𝑖∈[𝑛] . The proximal sampler has also been used to sample from weakly smooth and non-
smooth potentials [LC22a; LC22b], and it has been applied to the problem of differentially
private convex optimization [GLL22].
The optimization results in Exercise 8.2 obtained in analogy with the proximal sampler
are given in [Che+22a].

Exercises
Introduction to the Proximal Sampler
8.6. APPLICATIONS 235

D Exercise 8.1 RGO as a proximal operator on the Wasserstein space


Given a functional F : P2 (R𝑑 ) → R∪ {∞}, the proximal operator for F on the Wasserstein
space is defined via
n 1 o
proxF (𝜇) B arg min F(𝜇 0) + 𝑊22 (𝜇, 𝜇 0) .
𝜇 0 ∈P2 (R𝑑 ) 2
The proximal operator was used in the seminal work [JKO98] in order to rigorously make
sense of gradient flows on the Wasserstein space. Prove that the RGO satisfies
𝜋 𝑋 |𝑌 =𝑦 = proxℎ KL(·k𝜋) (𝛿𝑦 ) .
Hence, the assumption that we can implement the RGO is the same as assuming that we
can evaluate the proximal operator for the KL divergence on any Dirac measure.

Convergence under Strong Log-Concavity

D Exercise 8.2 comparison with optimization results


This exercise compares the results for the proximal sampler with the proximal point
method in optimization.
1
1. Suppose that 𝑉 is 𝛼-strongly convex. Prove that proxℎ𝑉 is 1+𝛼ℎ -Lipschitz.
Hint: Show that proxℎ𝑉 = (id + ℎ∇𝑉 ) −1 . Argue via convex duality by considering
2
the convex conjugate ( k·k2 + ℎ𝑉 ) ∗ .
2. Suppose that 𝑉 is 𝛼-strongly convex. Translate both of the proofs in Section 8.2 to
Euclidean optimization.
3. Suppose that 𝑉 satisfies the gradient domination condition
k∇𝑉 (𝑥)k 2 ≥ 2𝛼 {𝑉 (𝑥) − inf 𝑉 } , for all 𝑥 ∈ R𝑑 .
Also, let 𝑥 0 B proxℎ𝑉 (𝑥). Inspired by Theorem 8.5.2, we can ask whether or not it
holds that
1
𝑉 (𝑥 0) − inf 𝑉 ≤ {𝑉 (𝑥) − inf 𝑉 } .
(1 + 𝛼ℎ) 2
Prove that this is indeed the case.
Hint: Define 𝑉𝑡,𝑥 (𝑧) B 𝑉 (𝑧) + 2𝑡1 k𝑧 −𝑥 k 2 and let 𝑥𝑡 B arg min 𝑉𝑡,𝑥 ; then, differentiate
𝑡 ↦→ 𝑉𝑡,𝑥 (𝑥𝑡 ).

D Exercise 8.3 first coupling lemma


Prove Lemma 8.2.2.
236 CHAPTER 8. THE PROXIMAL SAMPLER

Simultaneous Heat Flow and Time Reversal

D Exercise 8.4 non-negativity of the dissipation functional


By the data-processing inequality, the 𝑓 -divergence to the target is always decreasing
along the Langevin diffusion and hence the functional J𝑓 defined in (8.3.2) is always
non-negative. Prove this more directly from the expression for J𝑓 .

D Exercise 8.5 convergence under LSI


Verify that the result under LSI (Theorem 8.5.2) holds.

D Exercise 8.6 convergence in Rényi divergence


In [VW19], Vempala and Wibisono showed convergence of the Langevin diffusion in Rényi
divergence under a Poincaré or log-Sobolev inequality (see Theorem 2.2.15). Similarly,
extend Theorem 8.5.1 and Theorem 8.5.2 to provide Rényi divergence guarantees.

D Exercise 8.7 Gaussian case


1. Suppose that 𝜋 𝑋 = normal(0, 𝐼𝑑 ) and 𝜇0𝑋 = normal(0, 𝜎02𝐼𝑑 ). Show that the iterates
of the proximal sampler all have Gaussian distributions, and explicitly compute the
variances. Use this to show that the contraction factors in Theorem 8.2.1 and Theo-
rem 8.5.2 are sharp.

2. Next, suppose that 𝜋 𝑋 = normal(0, Σ) and that 𝜇0𝑋 = normal(𝑚 0, Σ0 ). Show that
the next iterate of the proximal sampler is 𝜇1𝑋 = normal(𝑚 1, Σ1 ), where the mean
satisfies 𝑚 1 = proxℎ𝑉 (𝑚 0 ) and 𝑉 (𝑥) B 21 h𝑥, Σ−1 𝑥i. In other words, the mean of
the iterate of the proximal sampler evolves according to the proximal point method.

Applications

D Exercise 8.8 second coupling lemma


Prove Lemma 8.6.2.
CHAPTER 9

Lower Bounds for Sampling

In order to determine if our sampling guarantees are optimal, we need to pair them with
lower bounds. However, the problem of establishing query complexity lower bounds
for sampling is challenging and the work on this topic is nascent. Here, we will give an
overview of the current progress in this direction.

9.1 A Query Complexity Result in One Dimension


In this section, we follow [Che+22b], which established a sharp query complexity result
for sampling strongly log-concave distributions in one dimension. Define the class
Π𝜅 B {𝜋 ∈ Pac (R) | 𝜋 ∝ exp(−𝑉 ), 1 ≤ 𝑉 00 ≤ 𝜅, 𝑉 0 (0) = 0} .
In applications of sampling, one may first need to use an optimization algorithm to find
the minimizer of 𝑉 before applying the sampling algorithm. In our definition of Π𝜅 ,
however, we have enforced the requirement 𝑉 0 (0) = 0 in order to cleanly separate out
the complexity of optimization (finding the minimizer of 𝑉 ) from the intrinsic complexity
of sampling. Our goal is to understand the minimum number of queries required by an
algorithm to output an approximate sample from any target 𝜋 ∈ Π𝜅 .

1
Theorem 9.1.1 ([Che+22b]). The query complexity of outputting a sample which is 64
close in total variation distance to the target 𝜋, uniformly over the choice of 𝜋 ∈ Π𝜅 , is

237
238 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

Θ(log log 𝜅).

In what follows, we will make this theorem more precise and give a proof.

Lower bound. The lower bound will hold for any local oracle. Loosely speaking, a local
oracle accepts as an input a point 𝑥 ∈ R and outputs some information about the target 𝜋
such that if 𝜋ˆ is another possible target and 𝜋 ∝ 𝜋ˆ in some neighborhood of 𝑥, then the
output of the oracle is the same for both 𝜋 and 𝜋. ˆ This just formalizes the idea that the
oracle only outputs information about 𝜋 “near the point 𝑥”. To simplify the discussion,
however, we will suppose for concreteness that we have access to a second-order oracle:
given 𝑥 ∈ R𝑑 , it outputs the triple (𝑉 (𝑥), 𝑉 0 (𝑥), 𝑉 00 (𝑥)), where we recall that 𝑉 is only
specified up to an additive constant. (If this is confusing, you may instead suppose that
the oracle outputs the triple (𝑉 (𝑥) − 𝑉 (0), 𝑉 0 (𝑥), 𝑉 00 (𝑥)) where 𝜋 = exp(−𝑉 ).)
The lower bound will proceed in two stages.

1. First, we reduce the sampling problem to a statistical testing problem. Namely,


we will construct a family 𝜋1, . . . , 𝜋𝑚 ∈ Π𝜅 , and suppose that 𝒊 ∼ uniform([𝑚]) is
drawn randomly. The statistical testing problem is defined as follows: given query
access to 𝜋𝑖 (through the oracle), guess the value of 𝒊.
We will show that an algorithm to sample from 𝜋 𝒊 can be used to solve the statistical
testing problem; thus, “sampling is harder than testing”.

2. Next, we will prove a lower bound on the number of queries required to solve the
statistical testing problem: “testing is hard”. This relies on standard information-
theoretic techniques for proving minimax lower bounds for statistical problems.
The main difference between this problem and the usual statistical setting is that
rather than having i.i.d. samples from some data distribution, we instead have query
access and the algorithm is allowed to be adaptive.

Combining the two steps then yields our query complexity lower bound for sampling. We
begin with the construction of 𝜋1, . . . , 𝜋𝑚 , which is slightly tricky.
2𝑚−2
Let 𝑚 be the largest integer such that exp(− 2 2𝜅 ) ≥ 21 (and note that 𝑚 = Θ(log 𝜅)).
We define two auxiliary functions
1
𝜅, 2 ≤ 𝑥 < 1, 5
1, ≤ 𝑥 < 4,

 
1 , 2
1 ≤ 𝑥 < 2,

 

 

𝜙 (𝑥) B 𝜓 (𝑥) B 𝜅 , 4 ≤ 𝑥 < 5,
𝜅, 2 ≤ 𝑥 < 25 , 0 , otherwise .

 


0 ,

otherwise , 

9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 239

We define a family (𝑉𝑖 )𝑖∈[𝑚] of 1-strongly convex and 𝜅-smooth potentials as follows. We
require that 𝑉𝑖 (0) = 𝑉𝑖0 (0) = 0 and that 𝑉𝑖 be an even function, so it suffices to specify 𝑉𝑖00
on R+ . The second derivative is given by
𝑚−1
− 21 𝑥 𝑥 1
B 1{𝑥 ≤ 𝜅 + 1{𝑥 ≥ 5𝜅 − 2 2𝑚−1 } ,
∑︁
𝑉𝑖00 (𝑥) 2 𝑥 ≥ 0.
𝑖−1  
} +𝜙 + 𝜓
𝜅 − 12 2𝑖 𝜅 − 21 2𝑗
𝑗=𝑖

Observe that all of the terms in the above summation have disjoint supports. Although
the construction seems complicated, the basic idea is to make 𝑉𝑖00 oscillate between its
minimum and maximum allowable values 1 and 𝜅; see Figure 9.1 for a visual.

𝑉𝑖00 (𝑥/ 𝜅)

𝑥
2𝑖−1 2𝑖 2𝑖+1 5 𝑖+1 2𝑖+2 5 𝑖+2 5 𝑚+2
42 42 42
...

Figure 9.1: The dashed lines correspond to 𝜙 and the dotted lines correspond to 𝜓 . Here,
the horizontal axis is distorted for clarity.

There are two key properties of this construction. First, we will show in Lemma 9.1.2
1 1
that each 𝜋𝑖 places a substantial amount of mass on the interval (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ]. This
implies that if we can sample from 𝜋𝑖 , it is likely that the sample will land in this interval,
which is used to reduce the sampling task to the statistical testing task. Then, we will
show in Lemma 9.1.4 that 𝑉𝑖 and 𝑉𝑖+1 agree exactly outside of a small interval which is
1
approximately located at 𝜅 − 2 2𝑖 . This implies that for any given value 𝑥 ∈ R𝑑 , there are
only 𝑂 (1) possible values of (𝑉𝑖 (𝑥), 𝑉𝑖0 (𝑥), 𝑉𝑖00 (𝑥)) as 𝑖 ranges in [𝑚], which in turn will
be used to show that the oracle is not very informative (and hence prove a lower bound
for the statistical testing task).
1
The intuition behind the following lemma is that at 𝜅 − 2 2𝑖−1 , 𝑉𝑖00 = 𝜅 for the first time
and so the density 𝜋𝑖 drops off rapidly after this point.
240 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

Lemma 9.1.2. For each 𝑖 ∈ [𝑚],


1 1 1
𝜋𝑖 (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] ≥

.
32

Proof. According to the definition of 𝜋𝑖 , we have


∫ 𝜅 − 12 2𝑖−1
1 exp(−𝑥 2 /2) d𝑥 ∫
− 12 − 12 𝜅 − 2 2𝑖−2
2 2 exp(−𝑉𝑖 ) .
𝑖−2 𝑖−1 
𝜋𝑖 (𝜅 ,𝜅 ] = , 𝑍 𝜋𝑖 B
𝑍 𝜋𝑖
1
Recalling that 𝑚 is chosen so that exp(−𝑥 2 /2) ≥ 1/2 whenever |𝑥 | ≤ 𝜅 − 2 2𝑚−1 ,
1
𝜅 − 2 2𝑖−1
𝑥2  1 1

exp − d𝑥 ≥ 𝜅 − 2 2𝑖−2 .
1
𝜅 − 2 2𝑖−2 2 2
For the normalizing constant, observe that
1
∫ ∞ ∫ 𝜅 − 2 2𝑖 ∫ ∞ ∫ ∞
− 21
exp(−𝑉𝑖 ) = exp(−𝑉𝑖 ) + 1
exp(−𝑉𝑖 ) ≤ 𝜅 2 +
𝑖
1
exp(−𝑉𝑖 ) .
0 0 𝜅 − 2 2𝑖 𝜅 − 2 2𝑖
1 1 1 1
Since 𝑉𝑖00 = 𝜅 on [𝜅 − 2 2𝑖−1, 𝜅 − 2 2𝑖 ], it follows that 𝑉𝑖0 (𝜅 − 2 2𝑖 ) ≥ 𝜅 − 2 2𝑖−1 , and so
1 2
− 12 − 21 (𝑥 − 𝜅 − 2 2𝑖 ) 1
𝑉𝑖 (𝑥) ≥ 𝜅 2 𝑖−1
(𝑥 − 𝜅 2)+
𝑖
, 𝑥 ≥ 𝜅 − 2 2𝑖 .
2
Therefore,
1 2
∞ ∞
(𝑥 − 𝜅 − 2 2𝑖 ) 
∫ ∫
− 12 − 12

exp(−𝑉𝑖 ) ≤ exp −𝜅 2 𝑖−1
(𝑥 − 𝜅 2)−
𝑖
d𝑥
1
𝜅 − 2 2𝑖 𝜅 − 2 2𝑖
1
2
1 1
≤ ≤ √ ,
𝜅 − 12 2𝑖−1 𝜅
where we applied a standard tail estimate for Gaussian densities (Lemma 9.1.3). Then,

− 12 − 21 2𝑖−3 1
2 2
𝑖−2 𝑖−1 
𝜋𝑖 (𝜅 ,𝜅 ] ≥ ≥ ,
2 (2 + 1)
𝑖 32
which proves the result. 
In the above proof, we used the following lemma (see Exercise 9.1).
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 241

Lemma 9.1.3. Let 𝑎, 𝑥 0 > 0. Then,


1 1
∫ ∞
exp −𝑎 (𝑥 − 𝑥 0 ) − (𝑥 − 𝑥 0 ) 2 d𝑥 ≤ .

𝑥0 2 𝑎

The next lemma is the main reason why we used an oscillating construction for 𝑉𝑖00.

Lemma 9.1.4. We have the equalities

𝑉𝑖 = 𝑉𝑖+1 , 𝑉𝑖0 = 𝑉𝑖+1


0
, 𝑉𝑖00 = 𝑉𝑖+1
00
,
1 1
outside of the set {𝑥 ∈ R : 𝜅 − 2 2𝑖−1 ≤ |𝑥 | ≤ 54 𝜅 − 2 2𝑖+1 }.

𝑥
2𝑖−1 2𝑖 2𝑖+1 5 𝑖+1 2𝑖+2 5 𝑖+2
42 42

Figure 9.2: We plot 𝑉𝑖00 (𝑥) (in blue) and 𝑉𝑖+1


00 (𝑥) (in orange). In this figure, we do not distort

the horizontal axis lengths to make it easier to visually compare the relative lengths of
intervals on which the second derivatives are constant.

Proof. Refer to Figure 9.2 for a visual aid for the proof.
1
Clearly the potentials and derivatives match when |𝑥 | ≤ 𝜅 − 2 2𝑖−1 . Since the second
1
derivatives match when |𝑥 | ≥ 45 𝜅 − 2 2𝑖+1 , it suffices to show that
5 − 1 𝑖+1  0 5 − 21 𝑖+1 5 − 1 𝑖+1  5 1
𝑉𝑖0 𝜅 22 𝜅 2 ) and 𝜅 22 = 𝑉𝑖+1 𝜅 − 2 2𝑖+1 .

= 𝑉𝑖+1 𝑉𝑖
4 4 4 4
To that end, note that for 𝑥 ≥ 0,
1 1 𝑥 𝑥 𝑥
00
(𝑥) − 𝑉𝑖00 (𝑥) = 1{𝜅 − 2 2𝑖−1 < 𝑥 ≤ 𝜅 − 2 2𝑖 } − 𝜙
  
𝑉𝑖+1 +𝜙 −𝜓
𝜅 − 12 2𝑖 𝜅 − 12 2𝑖+1 𝜅 − 21 2𝑖
242 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

1 1

 −(𝜅 − 1) , 𝜅 − 2 2𝑖−1 ≤ 𝑥 ≤ 𝜅 − 2 2𝑖 ,
 +(𝜅 − 1) , 𝜅 − 12 2𝑖 ≤ 𝑥 ≤ 𝜅 − 12 2𝑖+1 ,




= 1 1
 −(𝜅 − 1) , 𝜅 − 2 2𝑖+1 ≤ 𝑥 ≤ 54 𝜅 − 2 2𝑖+1 ,


0 ,

otherwise .

A little algebra shows that the above expression integrates to zero, hence we deduce the
1
0 ( 5 𝜅 − 21 2𝑖+1 ). Also, by integrating this expression twice,
equality 𝑉𝑖0 ( 45 𝜅 − 2 2𝑖+1 ) = 𝑉𝑖+1 4

5 − 1 𝑖+1  5 1 𝜅 − 1 − 1 𝑖−1 2
𝜅 22 − 𝑉𝑖 𝜅 − 2 2𝑖+1 = (𝜅 2 2 )

𝑉𝑖+1 −
4 4 2
| {z }
1 1
integral on [𝜅 − 2 2𝑖−1, 𝜅 − 2 2𝑖 ]
1 1 𝜅 − 1 −1 𝑖 2
− (𝜅 − 1) 𝜅 − 2 2𝑖−1 𝜅 − 2 2𝑖 + (𝜅 2 2 )
2
| {z }
1 1
integral on [𝜅 − 2 2𝑖 , 𝜅 − 2 2𝑖+1 ]
1 1 1 𝜅 − 1 1 − 1 𝑖+1  2
+ (𝜅 − 1) 𝜅 − 2 2𝑖−1 𝜅 − 2 2𝑖+1 − 𝜅 22
4 2 4
| {z }
1 5 1
integral on [𝜅 − 2 2𝑖+1, 4 𝜅 − 2 2𝑖+1 ]
𝜅−1
= {−22𝑖−3 − 22𝑖−1 + 22𝑖−1 + 22𝑖−2 − 22𝑖−3 }
𝜅
= 0. 

We need one final ingredient: Fano’s inequality, which is the standard tool for
establishing information-theoretic lower bounds.

Theorem 9.1.5 (Fano’s inequality). Let 𝒊 ∼ uniform([𝑚]). Then, for any estimator b𝒊 of
𝒊, where b𝒊 is measurable with respect to some data 𝑌 ,

I(𝒊; 𝑌 ) + ln 2
P b𝒊 ≠ 𝒊 ≥ 1 −

,
ln 𝑚
where I is the mutual information I(𝒊; 𝑌 ) B KL(law(𝒊, 𝑌 ) k law(𝒊) ⊗ law(𝑌 )).

Proof. Let H(·) denote the entropy of a discrete random variable, i.e., if 𝑋 has law 𝑝 on a
discrete alphabet X, then H(𝑋 ) = 𝑥 ∈X 𝑝 (𝑥) ln(1/𝑝 (𝑥)). We refer to [CT06, Chapter 2]
Í
for the basic properties of entropy (and related quantities).
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 243

Let 𝐸 B 1{b𝒊 ≠ 𝒊 } denote the indicator of an error. Using the chain rule for entropy in
two different ways,

H( 𝒊, 𝐸 | b𝒊 ) = H( 𝒊 | b𝒊 ) + H( 𝐸 | 𝒊,b𝒊 )
| {z }
=0
= H( 𝐸 | b𝒊 ) + H( 𝒊 | 𝐸,b𝒊 ) .

Since conditioning reduces entropy, H( 𝐸 | b𝒊 ) ≤ H( 𝐸 ) ≤ ln 2. Also,

H( 𝒊 | 𝐸,b𝒊 ) = P{b𝒊 = 𝒊 } H( 𝒊 | b𝒊, 𝐸 = 0 ) + P{b𝒊 ≠ 𝒊 } H( 𝒊 | b𝒊, 𝐸 = 1 ) ≤ P{b𝒊 ≠ 𝒊 } ln 𝑚 .


| {z }
=0

Hence,

P{b𝒊 ≠ 𝒊 } ln 𝑚 + ln 2 ≥ H( 𝒊 | b𝒊 ) = H( 𝒊 ) − I( 𝒊 ; b𝒊 ) ≥ ln 𝑚 − I( 𝒊 ; 𝑌 )

where the last inequality is the data-processing inequality. Rearranging the inequality
completes the proof of Fano’s inequality. 
Proof of Theorem 9.1.1, lower bound. We follow the general outline described above.
1. Reduction to statistical testing. Let 𝒊 ∼ uniform([𝑚]) and suppose that for each
1
𝑖 ∈ [𝑚], 𝜋ˆ𝑖 is a distribution with k 𝜋ˆ𝑖 − 𝜋𝑖 k TV ≤ 64 . Suppose that we have a sample
𝑋 ∼ 𝜋ˆ 𝒊 (more precisely, this means that conditioned on 𝒊 = 𝑖, we have 𝑋 ∼ 𝜋ˆ𝑖 ). In light
of Lemma 9.1.2, a good candidate estimator b𝒊 for 𝒊 is
1 1
b𝒊 B 𝑖 ∈ N such that 𝑋 ∈ (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] if such an 𝑖 exists .

The probability that the estimator is correct is at least

1 ∑︁
𝒊 =𝑖 = 1
𝑚 𝑚
1 1
∑︁
P 𝑋 ∈ (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] 𝒊 = 𝑖
  
P 𝒊=𝒊 =
b P b𝒊 = 𝑖
𝑚 𝑖=1 𝑚 𝑖=1
1 ∑︁ 1 ∑︁  1 1
𝑚 𝑚
1 1 1 1
𝜋ˆ𝑖 (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] ≥ 𝜋𝑖 (𝜅 − 2 2𝑖−2, 𝜅 − 2 2𝑖−1 ] −

= ≥ .
𝑚 𝑖=1 𝑚 𝑖=1 64 64
(9.1.6)

Hence, a sampling can be used to solve the statistical testing problem.


2. A lower bound for the statistical testing problem. Next, we want to show for
any algorithm which uses 𝑛 queries to the oracle for 𝜋 𝒊 and outputs an estimator b𝒊 of 𝒊,
there is a lower bound for the probability of error P{b𝒊 ≠ 𝒊 }.
244 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

First, suppose that the algorithm is deterministic, i.e., we assume that each query point
𝑥 𝑗 of the algorithm is a deterministic function of the previous query points and query
values. Let 𝒪𝑖 (𝑥) B (𝑉𝑖 (𝑥), 𝑉𝑖0 (𝑥), 𝑉𝑖00 (𝑥)) denote the output of the oracle on input 𝑥
when the target is 𝜋𝑖 . Since the estimator b𝒊 is a function of {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] , then Fano’s
inequality (Theorem 9.1.5) yields
I(𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] ) + ln 2
P b𝒊 ≠ 𝒊 ≥ 1 −

.
ln 𝑚
By the chain rule for mutual information,
𝑛
∑︁
(9.1.7)
 
I 𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] = I 𝒊; 𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 ) {𝑥 𝑗 0 , 𝒪𝒊 (𝑥 𝑗 0 )} 𝑗 0 ∈[ 𝑗−1] .
𝑗=1

By our assumption, conditioned on {𝑥 𝑗 0 , 𝒪𝒊 (𝑥 𝑗 0 )} 𝑗 0 ∈[ 𝑗−1] , the query point 𝑥 𝑗 is deterministic.


Also, for a fixed point 𝑥 𝑗 , Lemma 9.1.4 implies that 𝒪𝑖 (𝑥 𝑗 ) can only take on a constant
number of possible values as 𝑖 ranges over [𝑚] (the careful reader can check that the
number of possible values for 𝒪𝑖 (𝑥 𝑗 ) is at most 5). Together with (9.1.7),

I 𝒊; {𝑥 𝑗 , 𝒪𝒊 (𝑥 𝑗 )} 𝑗 ∈[𝑛] ≤ 𝑛 ln 5 .


Fano’s inequality then yields


𝑛 ln 5 + ln 2
P b𝒊 ≠ 𝒊 ≥ 1 − (9.1.8)

.
ln 𝑚
In general, for a possibly randomized algorithm, we can still deduce (9.1.8) by apply-
ing the previous argument conditioned on the random seed of the algorithm (which is
independent of 𝒊).
3. Finishing the argument. By combining together (9.1.6) and (9.1.8), and recalling
that 𝑚 = Θ(log 𝜅), we have shown that 𝑛 & log log 𝜅. 
The argument above makes rigorous the following intuition: since there are 𝑚 distri-
butions in our lower bound construction, there are log2 𝑚 bits of information to learn.
On the other hand, Lemma 9.1.4 implies that each oracle query only reveals 𝑂 (1) bits of
information. Hence, the number of queries required is at least Ω(log 𝑚) = Ω(log log 𝜅).

Upper bound. To show that the lower bound is tight, we exhibit an algorithm, based
on rejection sampling, which achieves the lower complexity bound. As per our discussion
in Section 7.1, to implement rejection sampling we must specify the construction of an
upper envelope e 𝜇 ≥ 𝜋e, where 𝜋 ∈ Π𝜅 .
9.1. A QUERY COMPLEXITY RESULT IN ONE DIMENSION 245

Without loss of generality, we assume that 𝑉 (0) = 0 (if not, replace the output 𝑉 (𝑥) of
an oracle query with 𝑉 (𝑥)−𝑉 (0)). The upper bound algorithm only requires a zeroth-order
oracle, and it is as follows.

1. Find the first index 𝑖 − ∈ {0, 1, . . . , d 12 log2 𝜅e} such that 𝑉 (−2𝑖 − / 𝜅) ≥ 21 .

2. Find the first index 𝑖 + ∈ {0, 1, . . . , d 12 log2 𝜅e} such that 𝑉 (+2𝑖 + / 𝜅) ≥ 12 .
√ √
3. Set 𝑥 − B −2𝑖 − / 𝜅 and 𝑥 + B +2𝑖 + / 𝜅; then, set
  𝑥 −𝑥
− (𝑥 − 𝑥 − ) 2 
exp −

 − , 𝑥 ≤ 𝑥− ,
2𝑥 − 2




𝜇 (𝑥) B 1 ,


e 𝑥− ≤ 𝑥 ≤ 𝑥+ ,
2

𝑥 − 𝑥 + (𝑥 − 𝑥 + )

 
 exp − 2𝑥 −

 , 𝑥 ≥ 𝑥+ .
2

 +

To see why 𝑖 − √ and 𝑖 + exist, from 𝑉 00 ≥ 1 and 𝑉 (0) = 𝑉 0 (0) = 0 we have 𝑉 (𝑥) ≥ 𝑥 2 /2.
Hence, if |𝑥 | = 2 / 𝜅 where 𝑖 ≥ 21 ln 𝜅, we have 𝑉 (𝑥) ≥ 1/2.
𝑖

Since 𝑉 is decreasing (resp. increasing) on R− (resp. R+ ), the first two steps can be
implemented by running binary search over arrays of size 𝑂 (log 𝜅), which therefore
only requires 𝑂 (log log 𝜅) queries. We will prove that e 𝜇 is a valid upper envelope for the
unnormalized target 𝜋e B exp(−𝑉 ), and that 𝑍 𝜇 /𝑍 𝜋 . 1. In turn, Theorem 7.1.1 shows
that once e 𝜇 is constructed, an exact sample can be drawn from 𝜋 using 𝑂 (1) additional
queries in expectation.
Alternatively, if we require that the algorithm use a fixed (non-random) number of
iterations, then note that in order to make the failure probability (the probability that
rejection sampling fails to terminate within the allotted number of iterations) at most 𝜀,
it suffices to run rejection sampling for 𝑂 (log(1/𝜀)) steps. Combining this with the cost
of constructing e 𝜇 , we conclude that we can output a sample whose law is 𝜀-close to 𝜋 in
total variation distance using 𝑂 (log log 𝜅 + log(1/𝜀)) queries.
Proof of Theorem 9.1.1, upper bound. First, we prove that e 𝜇 is a valid upper envelope. Since
𝜋e is decreasing on R+ with 𝜋e(0) = exp(−𝑉 (0)) = 1, then 𝜋e ≤ 1 ≤ e 𝜇 on [0, 𝑥 + ]. Next, since
𝑉 (𝑥 + ) ≥ 1/2 (by the definition of 𝑥 + ), convexity of 𝑉 yields
𝑉 (𝑥 + ) − 𝑉 (0) 1
𝑉 0 (𝑥 + ) ≥ ≥ .
𝑥+ 2𝑥 +
Thus, for 𝑥 ≥ 𝑥 + ,
1 1 1
𝑉 (𝑥) ≥ 𝑉 (𝑥 + ) + 𝑉 0 (𝑥 + ) (𝑥 − 𝑥 + ) + (𝑥 − 𝑥 + ) 2 ≥ (𝑥 − 𝑥 + ) + (𝑥 − 𝑥 + ) 2 ,
2 2𝑥 + 2
246 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

which shows that 𝜋e(𝑥) ≤ e 𝜇 (𝑥). By a symmetric argument on R− , we conclude that 𝜋e ≤ e


𝜇.
By Theorem 7.1.1, it suffices to bound 𝑍 𝜇 /𝑍 𝜋 . First, we claim that 0 e 𝜇 & 𝑥 + . When
∫ 𝑥+

𝑖 + = 0, this holds
√ √
1/ 𝜅 1/ 𝜅
𝜅𝑥 2  1
∫ 𝑥+ ∫ ∫
𝑥+
𝜇= exp(−𝑉 ) ≥ exp − d𝑥 ≥ √ = .
2 3 𝜅 3
e
0 0 0

When 𝑖 + > 0, then by the definition of 𝑖 + we have 𝑉 (𝑥 + /2) ≤ 1/2, so


∫ 𝑥+ ∫ 𝑥 + /2
𝑥+
𝜇≥ exp(−𝑉 ) ≥ .
4
e
0 0

On the other hand, by Lemma 9.1.3,

1 1
∫ ∫ 𝑥+ ∫ ∞ ∫ ∞
(𝑥 − 𝑥 + ) − (𝑥 − 𝑥 + ) 2 d𝑥 ≤ 3𝑥 + .
exp −

𝜇= 𝜇+ 𝜇 ≤ 𝑥+ +
2𝑥 + 2
e e e
R+ 0 𝑥+ 𝑥+

Hence, R e
𝜇 ≤ 3𝑥 + ≤ 12 R 𝜋e, and similarly R e
𝜇 ≤ 12 R 𝜋e. Therefore, 𝑍 𝜇 /𝑍 𝜋 ≤ 12. 
∫ ∫ ∫ ∫
+ + − −

Discussion. Although this query complexity result only pertains to one-dimensional


targets, there are still some useful takeaways. For instance, the lower bound proof shows
that information theoretic arguments can indeed be adapted to the context of sampling,
and it may serve as a template for further results in this direction.
The obtained complexity Θ(log log 𝜅) is surprisingly small; in particular, the upper
bound uses a tailor-made algorithm based on rejection sampling, rather than any of the
other existing algorithms (such as those based on Langevin dynamics). This is perhaps the
best case scenario for a lower bound: it helps us to determine if our existing algorithms
are optimal, and if not, it gives guidance on how to design a better one. On the other hand,
the specific complexity is likely due to the one-dimensional structure; in high dimension,
it is conjectured that the dependence on the condition number is polynomial.

9.2 Other Approaches


In this section, we discuss alternative approaches and partial progress towards obtaining
lower bounds for sampling.

Lower bounds for particular algorithms. As already discussed in Section 7.3, the
works [Che+21b; LST21a; WSC21] obtain lower bounds for the complexity of MALA,
9.2. OTHER APPROACHES 247

culminating in a precise understanding of the runtime of MALA both from feasible and
warm start initializations.
The paper [CLW21] provides an approach to proving lower bounds for discretization
schemes. In their setup, there is a stochastic process (𝑍𝑡 )𝑡 ≥0 , driven by some underlying
Brownian motion (𝐵𝑡 )𝑡 ≥0 ; for example, the process (𝑍𝑡 )𝑡 ≥0 could be the Langevin diffusion
or the underdamped Langevin diffusion (Section 6.3). The algorithm is allowed to make
queries to the potential 𝑉 , as well as certain queries to the driving Brownian motion
(𝐵𝑡 )𝑡 ≥0 , and the goal of the algorithm is to output a point 𝑍b𝑇 which is close to 𝑍𝑇 in
mean squared error: E[k𝑍𝑇 − 𝑍b𝑇 k 2 ] ≤ 𝜀 2 . Within this framework, they prove that the
randomized midpoint discretization (introduced in Section 6.1) is optimal for simulating
the underdamped Langevin dynamics (see Theorem 6.3.8 for the upper bound).

Estimating the normalizing constant. In [RV08], the authors consider the number
of membership queries needed to estimate the volume of a convex body 𝐾 ⊆ R𝑑 such that
B(0, 1) ⊆ 𝐾 ⊆ B(0, 𝑂 (𝑑 8 )) to within a small multiplicative constant; their lower bound
e 2 ). In comparison, the state-of-the-art upper bound for volume
for this problem is Ω(𝑑
computation is 𝑂 e(𝑑 3 ) (see [CV18; Jia+21]).
In ∫[GLL20], the authors consider the problem of estimating the normalizing constant
𝑍 𝜋 B 𝜋e from queries to the unnormalized density 𝜋e. Based on a multilevel Monte Carlo
scheme, they show that sampling algorithms can be turned into approximation algorithms
for the normalizing constant, with the cost of an extra 𝑂 (𝑑) dimension dependence in
the reduction. By combining this with the randomized midpoint discretization of the
underdamped Langevin diffusion (Theorem 6.3.8), they show that a 1 ± 𝜀 multiplicative
approximation to 𝑍 𝜋 can be obtained using 𝑂 e((𝑑 4/3𝜅 +𝑑 7/6𝜅 7/6 )/𝜀 2 ) queries (in the strongly
log-concave case).
They then prove that Ω(𝑑 1−𝑜 (1) /𝜀 2−𝑜 (1) ) queries are necessary to obtain a 1 ± 𝜀 multi-
plicative approximation to 𝑍 𝜋 . Unfortunately, due to the 𝑂 (𝑑) loss in the reduction from
estimating the normalizing constant to sampling, this does not imply a non-trivial lower
bound for the task of sampling.

Lower bound for a stochastic oracle. In [CBL22], the authors obtain a lower bound
on the complexity of sampling using a stochastic oracle. Namely, in order to output an
𝜀-approximate sample (in TV distance) from an 𝛼-strongly log-concave and 𝛽-log-smooth
distribution whose mean lies in the ball B(0, 1/𝛼), with an oracle that given 𝑥 ∈ R𝑑
outputs ∇𝑉 (𝑥) + 𝜉 with 𝜉 ∼ normal(0, Σ) and tr Σ ≤ 𝜎 2𝑑, the number of queries required
is at least Ω(𝜎 2𝑑/𝜀 2 ). On the other hand, when 𝛼, 𝜎  1, this complexity is achieved via
stochastic gradient Langevin Monte Carlo.
248 CHAPTER 9. LOWER BOUNDS FOR SAMPLING

Bibliographical Notes
Since the theory of lower bounds for sampling is still early in its development, there are
not too many works yet in this direction. Section 9.2 contains a brief survey.
Recently, the paper [GLL22] obtains a query complexity bound for the following class
of target distributions:

n   ∑︁ 𝛼 2
 o
Π𝛼,𝐿 B 𝜋 ∈ Pac B(0, 1) 𝜋 ∝ exp − 𝑓𝑖 − k·k , 𝑓𝑖 : B(0, 1) → R is 𝐿-Lipschitz .
𝑖=1
2

They show that the minimum number of queries to the individual functions (𝑓𝑖 )𝑖=1

required
to obtain a sample which is 𝜀-close to a target 𝜋 ∈ Π𝛼,𝐿 is, in the regime 𝑑  𝐿 2 /𝛼, of the
e 2 /𝛼). The upper bound is based on the proximal sampler (Section 8.1), whereas
order Θ(𝐿
the lower bound, which in this context reduces sampling to an optimization task, relies
on information-theoretic arguments.

Exercises
A Query Complexity Result in One Dimension

D Exercise 9.1 Gaussian tail bound


∫∞
For 𝑥 > 0, show that 𝑥 exp(−𝑡 2 /2) d𝑡 ≤ 𝑥 −1 exp(−𝑥 2 /2). Use this to prove Lemma 9.1.3.
CHAPTER 10

Structured Sampling

So far, we have only considered sampling within the black-box model, in which we only
have access to oracle queries to the potential and its gradient. We will now consider
several new sampling algorithms which go beyond the black-box model.

10.1 Coordinate Langevin


10.2 Mirror Langevin
The mirror descent method in optimization changes the geometry of the algorithm via
the use of a mirror map 𝜙 : R𝑑 → R ∪ {∞}. Here, 𝜙 is a convex function and we denote
X B int dom 𝜙; we assume that X ≠ ∅, that 𝜙 is strictly convex and differentiable on
X, and that 𝜙 is a barrier for X in the sense that k∇𝜙 (𝑥𝑘 )k → ∞ whenever (𝑥𝑘 )𝑘∈N ⊆ X
converges to a point on 𝜕X. Then, rather than following the gradient descent iteration
𝑥𝑘+1 B 𝑥𝑘 − ℎ ∇𝑉 (𝑥𝑘 ) , 𝑘 = 0, 1, 2, . . . (10.2.1)
we can instead consider the mirror descent iteration
∇𝜙 (𝑥𝑘+1 ) B ∇𝜙 (𝑥𝑘 ) − ℎ ∇𝑉 (𝑥𝑘 ) , 𝑘 = 0, 1, 2, . . . (10.2.2)
The assumptions on the mirror map 𝜙 ensure that the iteration (10.2.2) is well-defined.
2
When 𝜙 = k·k2 , then mirror descent coincides with gradient descent.

249
250 CHAPTER 10. STRUCTURED SAMPLING

Historically, mirror descent was introduced by Nemirovsky and Yudin [NY83] with the
following intuition. Suppose we are optimizing a function 𝑉 which is not defined over the
Euclidean space R𝑑 , but rather over a Banach space B. Then, the gradient of 𝑉 is not an
element of B but rather of the dual space B∗ , and so the gradient descent iteration (10.2.1)
does not even make sense. On the other hand, the mirror descent iteration (10.2.2) works
because the primal point 𝑥𝑘 ∈ B is first mapped to the dual space B∗ via the mapping ∇𝜙.
This reasoning is not so esoteric as it may seem, because even for a function 𝑉 defined
over R𝑑 its natural geometry may correspond to a different norm (e.g., the ℓ1 norm), in
which case R𝑑 is better viewed as a Banach space.
Our aim is to understand the sampling analogue of mirror descent, known as mirror
Langevin. We will keep in mind the key example of constrained sampling. Here, the
potential 𝑉 : R𝑑 → R ∪ {∞} has domain X ( R𝑑 . In this case, the standard Langevin
algorithm leaves the constraint set X which is undesirable; in particular, it is not possible
to obtain guarantees in metrics such as KL divergence because the law of the iterate of the
algorithm is not absolutely continuous with respect to the target. Projecting the iterates
onto X does not solve this issue because the law of the iterate will then have positive mass
on the boundary 𝜕X. Besides, projection may not adapt well to the shape of the constraint
set X. Instead, the use of a mirror map 𝜙 which is a barrier for X can automatically enforce
the constraint.

10.2.1 Continuous-Time Considerations


In continuous time, the mirror Langevin diffusion (𝑍𝑡 )𝑡 ≥0 is the solution to the stochastic
differential equation
√ 1/2
𝑍𝑡∗ = ∇𝜙 (𝑍𝑡 ) , d𝑍𝑡∗ = −∇𝑉 (𝑍𝑡 ) d𝑡 + 2 [∇2𝜙 (𝑍𝑡 )] d𝐵𝑡 . (10.2.3)

Here, the diffusion term is no longer an isotropic Brownian motion but rather involves the
1/2
matrix [∇2𝜙 (𝑍𝑡 )] ; this is necessary in order to ensure that the stationary distribution
is 𝜋. Also, we have given the SDE in the dual space. Using Itô’s formula (Theorem 1.1.18),
one can write down an SDE for (𝑍𝑡 )𝑡 ≥0 in the primal space, but it is more complicated,
involving the third derivative tensor of 𝜙 (see Exercise 10.1), and as such we prefer to
work with the representation (10.2.3).
Using (10.2.3), we can compute the generator ℒ, the carré du champ Γ, and the
Dirichlet energy ℰ of the mirror Langevin diffusion (see Section 1.2 and Exercise 10.2):

2 −1
ℰ(𝑓 , 𝑔) = h∇𝑓 , [∇2𝜙] ∇𝑔i d𝜋 .
−1
Γ(𝑓 , 𝑔) = h∇𝑓 , [∇ 𝜙] ∇𝑔i , (10.2.4)
10.2. MIRROR LANGEVIN 251

The expression shows that the mirror Langevin diffusion is reversible with respect to 𝜋.
Also, if 𝜋𝑡 denotes the law of 𝑍𝑡 , then
∫ ∫ ∫ ∫
𝜋𝑡 −1 𝜋𝑡
𝑓 d𝜋𝑡 = d𝜋 = − ∇𝑓 , [∇2𝜙] ∇ d𝜋 (10.2.5)

𝑓 𝜕𝑡 𝜋𝑡 = 𝜕𝑡 ℒ𝑓
∫ 𝜋 ∫ 𝜋
𝜋𝑡 𝜋𝑡 
∇𝑓 , [∇2𝜙] ∇ ln 𝑓 div 𝜋𝑡 [∇2𝜙] ∇ ln
−1 −1
d𝜋𝑡 = (10.2.6)

=−
𝜋 𝜋
from which we deduce the Fokker–Planck equation
𝜋𝑡 
𝜕𝑡 𝜋𝑡 = div 𝜋𝑡 [∇2𝜙]
−1
∇ ln .
𝜋
From the interpretation as a continuity equation (see Theorem 1.3.17), we deduce that
(𝜋𝑡 )𝑡 ≥0 describes the evolution of a particle which travels according to the family of vector
fields 𝑡 ↦→ −[∇2𝜙] ∇ ln(𝜋𝑡 /𝜋). Recalling that ∇ ln(𝜋𝑡 /𝜋) is the Wasserstein gradient of
−1

KL(· k 𝜋) at 𝜋𝑡 , we can interpret the mirror Langevin diffusion as a “mirror flow” of the
KL divergence in Wasserstein space.
Alternatively, we can equip X with the Riemannian metric induced by ∇2𝜙, i.e., we
set h𝑢, 𝑣i𝑥 B h𝑢, ∇2𝜙 (𝑥) 𝑣i. Then, the mirror Langevin diffusion becomes the Wasserstein
gradient flow of the KL divergence over the Riemannian manifold X (see Section 2.6.1).

The Newton Langevin diffusion. In the special case when the mirror map 𝜙 is chosen
to be the same as the potential 𝑉 , we arrive at a sampling analogue of Newton’s algorithm,
and hence we call it the Newton Langevin diffusion. The equation for the Newton
Langevin diffusion can be written (in the dual space) as

d𝑍𝑡∗ = −𝑍𝑡∗ d𝑡 + 2 [∇2𝑉 ∗ (𝑍𝑡∗ )]
−1/2
d𝐵𝑡 , (10.2.7)

see Exercise 10.3.

Convergence in continuous time. In optimization, Newton’s algorithm has many


favorable properties. For example, at least locally, it is known that Newton’s algorithm
converges quadratically rather than linearly, which means that the error at iteration 𝑘
scales as exp(−𝑐 1 exp(𝑐 2𝑘)) for constants 𝑐 1, 𝑐 2 > 0. Also, Newton’s algorithm is affine-
invariant, meaning that if 𝐴 is any invertible matrix and we instead apply Newton’s
algorithm to the function 𝑉e (𝑥) B 𝑉 (𝐴𝑥), then the iterates (e 𝑥𝑘 )𝑘∈N are related to the
iterates (𝑥𝑘 )𝑘∈N of Newton’s algorithm on the original function 𝑉 via the transformation
𝑥e𝑘 = 𝐴−1𝑥𝑘 (Exercise 10.4). Consequently, the convergence speed of Newton’s algorithm
should not be badly affected by poor conditioning of 𝑉 .
252 CHAPTER 10. STRUCTURED SAMPLING

Can we expect similar properties to hold for the Newton Langevin diffusion? At
least for the property of affine invariance, we have the Brascamp–Lieb inequality
(Theorem 2.2.8): if 𝜋 ∝ exp(−𝑉 ) is strictly log-concave, then for all 𝑓 : R𝑑 → R,

var𝜋 (𝑓 ) ≤ E𝜋 h∇𝑓 , [∇2𝑉 ]


−1
∇𝑓 i .

Below, we will also give an alternative proof of the Bregman transport inequality (The-
orem 2.2.10) based on Wasserstein calculus, which implies the Brascamp–Lieb inequality.
Note that in the strongly convex case ∇2𝑉  𝛼𝐼𝑑 , it implies a Poincaré inequality (in the
sense of Example 1.2.21) for 𝜋 with constant 1/𝛼. However, in our present context with
Dirichlet energy given by (10.2.4), we instead interpret the Brascamp–Lieb inequality as a
Poincaré inequality (in the sense of Definition 1.2.18) for the Newton–Langevin diffusion.
Then, the Poincaré constant is 1, independent of the strong convexity of 𝑉 .
We also obtain a Poincaré inequality for the mirror Langevin diffusion under the
condition of relative strong convexity.

Definition 10.2.8. Let 𝜙, 𝑉 : R𝑑 → R ∪ {∞} be convex functions, and assume that


X = int dom 𝜙 = int dom 𝑉 . Then:

1. 𝑉 is 𝛼-relatively convex (w.r.t. 𝜙) if for all 𝑥 ∈ X,

∇2𝑉 (𝑥)  𝛼 ∇2𝜙 (𝑥) .

2. 𝑉 is 𝛽-relatively smooth (w.r.t. 𝜙) if for all 𝑥 ∈ X,

∇2𝑉 (𝑥)  𝛽 ∇2𝜙 (𝑥) .

2
Observe that when 𝜙 = k·k2 , these definitions reduce to the usual definitions of strong
convexity and smoothness. Recall from Definition 2.2.9 that the Bregman divergence 𝐷𝜙
associated with 𝜙 is the mapping 𝐷𝜙 (·, ·) : X × X → R given by

𝐷𝜙 (𝑥, 𝑦) B 𝜙 (𝑥) − 𝜙 (𝑦) − h∇𝜙 (𝑦), 𝑥 − 𝑦i .

The Bregman divergence plays an important role in the analysis of mirror Langevin
because it is the correct substitute for the Euclidean distance (𝑥, 𝑦) ↦→ 12 k𝑥 − 𝑦 k 2 in this
context. Note the following observations: (1) 𝐷𝜙 is non-negative due to convexity of
𝜙, and if 𝜙 is strictly convex then it equals 0 if and only if its two arguments are equal;
(2) since 𝐷𝜙 (𝑥, 𝑦) is defined by subtracting the first-order Taylor expansion of 𝜙 at 𝑦, it
10.2. MIRROR LANGEVIN 253

behaves infinitesimally like a squared distance; in particular,


1
𝐷𝜙 (𝑥, 𝑦) ∼ h𝑦 − 𝑥, ∇2𝜙 (𝑥) (𝑦 − 𝑥)i as 𝑦 → 𝑥 ;
2
2
(3) when 𝜙 = k·k2 , then 𝐷𝜙 is precisely one-half times the squared Euclidean distance.
Using this definition, we have the following reformulations of relative convexity and
relative smoothness (Exercise 10.5).

Lemma 10.2.9. 𝑉 is 𝛼-relatively convex w.r.t. 𝜙 if and only if

𝐷𝑉 ≥ 𝛼 𝐷𝜙 .

Similarly, 𝑉 is 𝛽-relatively smooth w.r.t. 𝜙 if and only if

𝐷𝑉 ≤ 𝛽 𝐷𝜙 .

Returning to the mirror Langevin diffusion, the following corollary is an immediate


consequence of the Brascamp–Lieb inequality and the definition of relative convexity.

Corollary 10.2.10 (mirror Poincaré inequality, [Che+20b]). Suppose that the potential
𝑉 is 𝛼-relatively convex w.r.t. 𝜙. Then, the mirror Langevin diffusion satisfies the following
Poincaré inequality: for all 𝑓 : R𝑑 → R,
1 1
E𝜋 h∇𝑓 , [∇2𝜙] ∇𝑓 i = ℰ(𝑓 , 𝑓 ) .
−1
var𝜋 𝑓 ≤
𝛼 𝛼

So far so good: we have defined relative convexity, which is a natural generalization


of strong convexity and well-studied in the optimization literature; and we have shown
that it implies a Poincaré inequality for the mirror Langevin diffusion.
However, the analogies with the standard Langevin diffusion stop here. There are
counterexamples which show that relative convexity does not imply a log-Sobolev in-
equality for the mirror Langevin diffusion. This may seem to contradict the Bakry–Émery
theorem (Theorem 1.2.28), which holds for any Markov diffusion. The issue here is that
the assumption of relative convexity is not a curvature-dimension condition. Indeed, in
order to properly formulate a curvature-dimension condition for the Hessian manifold X
equipped with the Riemannian metric 𝔤 induced by ∇2𝜙, one must check the CD(𝛼, ∞)
condition ∇2𝑉e + Ric  𝛼. Here, ∇2 denotes the Riemannian Hessian and 𝑉e is the potential
corresponding to the relative density of 𝜋 w.r.t. the Riemannian volume measure on (X, 𝔤);
254 CHAPTER 10. STRUCTURED SAMPLING

see Section 2.6. Such a calculation was performed in, e.g., [Kol14], which implies that if
(∇𝜙) # 𝜋 is log-concave, then the Newton Langevin diffusion satisfies CD( 12 , ∞). However,
it is not clear under what conditions (∇𝜙) # 𝜋 is log-concave.
Here is another consequence of the fact that relative convexity is not a curvature-
dimension condition: relative convexity (apparently) does not seem to imply contraction
properties for the mirror Langevin diffusion with respect to an appropriately defined
Wasserstein metric.
To summarize: either we can assume the curvature-dimension condition CD(𝛼, ∞),
which imposes complicated conditions on 𝜙 and 𝑉 , or we can adopt the more inter-
pretable relative convexity assumption, which in turn only implies a Poincaré inequality
(Corollary 10.2.10). We will follow the latter approach.
Upon reflection, the curvature-dimension approach for studying the mirror Langevin
diffusion is arguably the less natural one. Indeed, the curvature-dimension approach is
based on viewing the mirror Langevin diffusion from the lens of Riemannian geometry,
but the mirror descent algorithm in optimization is not typically studied via Riemannian
geometry. Instead, the study of mirror descent is based on ideas from convex analysis,
centered around the Bregman divergence. So it seems prudent at this stage to abandon
the Riemannian interpretation of mirror Langevin in favor of convex analysis tools, and
this is indeed how our discretization proof will go. In fact, in lieu of using the Poincaré
inequality in Corollary 10.2.10, we will directly use relative convexity.

10.2.2 Discretization Preliminaries


Following [AC21], we consider the following discretization of (10.2.3).
+
∇𝜙 (𝑋𝑘ℎ ) B ∇𝜙 (𝑋𝑘ℎ ) − ℎ ∇𝑉 (𝑋𝑘ℎ ) ,
(MLMC)
𝑋 (𝑘+1)ℎ B ∇𝜙 ∗ (𝑋 (𝑘+1)ℎ

),

where
√ ∫ 𝑡
[∇2𝜙 ∗ (𝑋𝑠∗ )]
−1/2
𝑋𝑡∗ = +
∇𝜙 (𝑋𝑘ℎ ) + 2 d𝐵𝑠 for 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] . (10.2.11)
𝑘ℎ

2
Note that when 𝜙 = k·k2 , this reduces to the standard LMC algorithm. When generalizing
LMC to different mirror maps, this discretization is chosen to preserve the “forward-flow”
interpretation of [Wib18] (see Section 4.3). In particular, the update from 𝑋𝑘ℎ to 𝑋𝑘ℎ
+ is a

mirror descent step, while the update from 𝑋𝑘ℎ to 𝑋 (𝑘+1)ℎ follows a “Wasserstein mirror
+

flow” of the (negative) entropy.


10.2. MIRROR LANGEVIN 255

However, implementing MLMC requires the exact simulation of a diffusion process,


so is it truly a “discretization”? To address this, note that simulating the mirror diffu-
sion (10.2.11) does not require additional queries to the potential 𝑉 , since it only depends
on the mirror map 𝜙 (except for the initialization). To an extent, any algorithm based
on mirror maps requires the implementation of certain primitive operations involving
𝜙, such as computation of ∇𝜙 or inversion of ∇𝜙; in practice, this requires 𝜙 to have a
“simple” structure such that these operations have closed-form expressions, or are at least
cheap enough to be negligible relative to the cost of computing the gradient of 𝑉 . In our
consideration of MLMC, we take the diffusion (10.2.11) to be another primitive operation
associated with the mirror map 𝜙. This is indeed appropriate for many applications, e.g.,
when 𝜙 is a separable function 𝜙 (𝑥) = 𝑑𝑖=1 𝜙𝑖 (𝑥𝑖 ), and it will streamline our technical
Í
analysis. Nevertheless, implementation of MLMC remains a key obstacle to its practicality
and necessitates further research in this direction.
Our approach to studying MLMC is to adapt the convex optimization approach intro-
duced in Section 4.3.

Key technical results. We now establish the analogues of the various facts that we
invoked in the study of LMC.
First, in the standard gradient descent analysis, if 𝑥 + B 𝑥 − ℎ ∇𝑉 (𝑥), then we have
the key inequality
1
h∇𝑉 (𝑥), 𝑥 + − 𝑧i = {k𝑥 − 𝑧 k 2 − k𝑥 + − 𝑧 k 2 − k𝑥 + − 𝑥 k 2 } for all 𝑧 ∈ R𝑑 .
2ℎ
Remarkably, there is an analogue of this fact for mirror descent, which follows from the
following identity (which can be checked by simple algebra, see Exercise 10.5):

˜ − ∇𝜙 (𝑥), 𝑥 − 𝑧i = 𝐷𝜙 (𝑥,
h∇𝜙 (𝑥) ˜ 𝑥) + 𝐷𝜙 (𝑧, 𝑥)
˜ − 𝐷𝜙 (𝑧, 𝑥) , ˜ 𝑧 ∈ X.
for all 𝑥, 𝑥,
(10.2.12)

Lemma 10.2.13 (Bregman proximal lemma, [CT93]). For 𝑥 ∈ X, let 𝑥 + be defined via
∇𝜙 (𝑥 + ) = ∇𝜙 (𝑥) − ℎ ∇𝑉 (𝑥). Then, for all 𝑧 ∈ X,
1
h∇𝑉 (𝑥), 𝑥 + − 𝑧i = {𝐷𝜙 (𝑧, 𝑥) − 𝐷𝜙 (𝑧, 𝑥 + ) − 𝐷𝜙 (𝑥 +, 𝑥)} .

Proof. Note that


1
h∇𝑉 (𝑥), 𝑥 + − 𝑧i = − h∇𝜙 (𝑥 + ) − ∇𝜙 (𝑥), 𝑥 + − 𝑧i .

256 CHAPTER 10. STRUCTURED SAMPLING

Substituting the identity (10.2.12) into the above equation proves the result. 

Unlike the case of LMC, the presence of a non-constant diffusion matrix involving ∇2𝜙
introduces another source of discretization error. To address this, we introduce a condition
on the third derivative of 𝜙. Note however that a uniform bound on the operator norm
of ∇3𝜙 is not compatible with the assumption that 𝜙 tends to +∞ on 𝜕X. The solution to
this issue was discovered by Nesterov and Nemirovsky [NN94]: we can ask that ∇3𝜙 is
bounded with respect to the geometry induced by 𝜙. This approach also has the benefit of
being consistent with the affine invariance of Newton’s method. The precise definition of
the third derivative condition is as follows.

Definition 10.2.14 (self-concordance). The mirror map 𝜙 is said to be 𝑀𝜙 -self-


concordant if for all 𝑥 ∈ X and all 𝑣 ∈ R𝑑 ,

∇3𝜙 (𝑥) [𝑣, 𝑣, 𝑣] ≤ 2𝑀𝜙 k𝑣 k 3∇2𝜙 (𝑥) B 2𝑀𝜙 h𝑣, ∇2𝜙 (𝑥) 𝑣i 3/2 .

The norm k𝑣 k ∇2𝜙 (𝑥) B h𝑣, ∇2𝜙 (𝑥) 𝑣i is called the local norm, and it is the tangent
√︁

space norm for the Riemannian metric induced by 𝜙.


The definition implies the following result, stated without proof:1

Lemma 10.2.15 ([Nes18, Corollary 5.1.1]). Suppose that 𝜙 is 𝑀𝜙 -self-concordant. Then,


for all 𝑥 ∈ X and 𝑢 ∈ R𝑑 ,

∇3𝜙 (𝑥) 𝑢  2𝑀𝜙 k𝑢 k ∇2𝜙 (𝑥) ∇2𝜙 (𝑥) .

Self-concordant functions are well-studied due to their central role in the theory of
interior-point methods for optimization, see the monograph [NN94]. A key example of a
self-concordant mirror map is when the constraint set is a polytope,

X = {𝑥 ∈ R𝑑 : h𝑎𝑖 , 𝑥i < 𝑏𝑖 for all 𝑖 ∈ [𝑁 ]} ,

in which case 𝜙 (𝑥) = ln(1/ 𝑖=1 (𝑏𝑖 − h𝑎𝑖 , 𝑥i)) is self-concordant with 𝑀𝜙 = 1.
Í𝑁
Finally, a key step in our analysis of LMC was to use the convexity of the entropy
functional along 𝑊2 geodesics. In our analysis of MLMC, we will replace the 𝑊2 distance
with the Bregman transport cost D𝑉 (recall Definition 2.2.9). To study these costs, we first
state an analogue of Brenier’s theorem (Theorem 1.3.8).
1 The proof is surprisingly difficult. The reader can try to prove the result with a worse constant factor.
10.2. MIRROR LANGEVIN 257

Theorem 10.2.16 (Brenier’s theorem for the Bregman transport cost). Suppose that
𝜇, 𝜈 ∈ P (R𝑑 ). Then, the unique optimal Bregman transport coupling (𝑋, 𝑌 ) for 𝜇 and 𝜈
is of the form

∇𝜙 (𝑌 ) = ∇𝜙 (𝑋 ) − ∇ℎ(𝑋 ) ,

where ℎ : R𝑑 → R ∪ {−∞} is such that 𝜙 − ℎ is convex.

Proof sketch. We need facts about optimal transport with general costs 𝑐 (Exercise 1.11).
Namely, the optimal pair of dual potentials (𝑓 ★, 𝑔★) are 𝑐-conjugates, meaning that

𝑓 ★ (𝑥) = inf {𝑐 (𝑥, 𝑦) − 𝑔★ (𝑦)} ,


𝑦∈R𝑑
𝑔★ (𝑦) = inf {𝑐 (𝑥, 𝑦) − 𝑓 ★ (𝑥)} .
𝑥 ∈R𝑑

For 𝛾 ★-a.e. (𝑥, 𝑦), it holds that 𝑓 ★ (𝑥) + 𝑔★ (𝑦) = 𝑐 (𝑥, 𝑦). If 𝑐 is smooth and such that
∇𝑥 𝑐 (𝑥, ·) is injective for all 𝑥 ∈ R𝑑 , then from the definition of 𝑔★ it suggests that we have
∇𝑥 𝑐 (𝑥, 𝑦) = ∇𝑓 ★ (𝑥) for 𝛾 ★-a.e. (𝑥, 𝑦). See [Vil09, Theorem 10.28] for a rigorous statement
and proof of these results.
Applying this to our cost function 𝑐 = 𝐷𝜙 , it yields the existence of 𝐷𝜙 -conjugates
ℎ, ℎ˜ : R𝑑 → R ∪ {−∞} such that ∇𝑥 𝐷𝜙 (𝑥, 𝑦) = ∇ℎ(𝑥) under the optimal plan 𝛾 ★. Hence,

∇𝜙 (𝑦) = ∇𝜙 (𝑥) − ∇ℎ(𝑥) , for 𝛾 ★-a.e. (𝑥, 𝑦)

and
˜
ℎ(𝑥) = inf {𝐷𝜙 (𝑥, 𝑦) − ℎ(𝑦)} .
𝑦∈R𝑑

By expanding out the definition of 𝐷𝜙 , we rewrite this as

˜
𝜙 (𝑥) − ℎ(𝑥) = sup {h∇𝜙 (𝑦), 𝑥 − 𝑦i + ℎ(𝑦) + 𝜙 (𝑦)} .
𝑦∈X

As a supremum of affine functions, it follows that 𝜙 − ℎ is convex. 


Recall that the usual 𝑊2 geodesic from 𝜇 0 to 𝜇1 is given as follows: there is a convex
function 𝜑 : R𝑑 → R ∪ {∞} such that for 𝑋 0 ∼ 𝜇 0 , the pair (𝑋 0, 𝑋 1 ) B (𝑋 0, ∇𝜑 (𝑋 0 )) is an
optimal coupling. By taking the linear interpolation 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 and setting
𝜇𝑡 B law(𝑋𝑡 ), we obtain the 𝑊2 geodesic.
258 CHAPTER 10. STRUCTURED SAMPLING

Now consider the above theorem. If we let 𝑊 B ∇𝜙 (𝑌 ) = ∇𝜙 (𝑋 ) − ∇ℎ(𝑋 ), then


we have the coupling (𝑋 0, 𝑋 1 ) B (∇(𝜙 − ℎ) ∗ (𝑊 ), ∇𝜙 ∗ (𝑊 )) of 𝜇0 B 𝜇 and 𝜇1 B 𝜈. Then,
we can interpolate by setting 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 and 𝜇𝑡 B law(𝑋𝑡 ), which defines
an alternative path joining 𝜇 0 to 𝜇 1 . Note that (𝑋 0,𝑊 ) and (𝑋 1,𝑊 ) are both optimally
coupled for the 𝑊2 distance (since 𝜙 − ℎ is convex). This is a special case of the following.

Definition 10.2.17. A curve (𝜇𝑡 )𝑡 ∈[0,1] ⊆ P2 (R𝑑 ) is a generalized geodesic if there


exists another measure 𝜌 ∈ P2 (R𝑑 ) such that

𝜇𝑡 = law(𝑋𝑡 ) , 𝑋𝑡 B (1 − 𝑡) 𝑋 0 + 𝑡 𝑋 1 ,

and (𝑋 0,𝑊 ), (𝑋 1,𝑊 ) are both optimally coupled for the 𝑊2 metric, where 𝑊 ∼ 𝜌.

It is left as Exercise 10.6 to check that the entropy functional is also convex along
generalized geodesics.

Theorem 10.2.18. Let H(𝜇) B 𝜇 ln 𝜇. Then, for any generalized geodesic (𝜇𝑡 )𝑡 ∈[0,1] ,

𝑡 ↦→ H(𝜇𝑡 ) is convex .

In our context, it implies that if 𝜇, 𝜈 ∈ P (R𝑑 ) and (𝑋, 𝑌 ) is an optimal coupling for the
Bregman cost D𝜙 (𝜇, 𝜈), then
H(𝜈) ≥ H(𝜇) + Eh∇ ln 𝜇 (𝑋 ), 𝑌 − 𝑋 i . (10.2.19)
As an aside, we remark that this observation leads to another proof of the Bregman
transport inequality.
Proof of the Bregman transport inequality (Theorem 2.2.10). Let 𝜋 = exp(−𝑉 ), where 𝑉 is
strictly convex, and let 𝜇 ∈ P (R𝑑 ). Let 𝑋 ∼ 𝜇, 𝑍 ∼ 𝜋 be optimally coupled for the
Bregman transport cost. Then,
KL(𝜇 k 𝜋) = E 𝑉 (𝑋 ) + H(𝜇) .
For the first term, by the definition of 𝐷𝑉 ,
E 𝑉 (𝑋 ) = E 𝑉 (𝑍 ) + E 𝐷𝑉 (𝑋, 𝑍 ) + Eh∇𝑉 (𝑍 ), 𝑋 − 𝑍 i .
For the second term, (10.2.19) implies
H(𝜇) ≥ H(𝜋) + Eh∇ ln 𝜋 (𝑍 ), 𝑋 − 𝑍 i .
10.2. MIRROR LANGEVIN 259

Hence,
KL(𝜇 k 𝜋) ≥ E 𝑉 (𝑍 ) + H(𝜋) + Eh∇𝑉 (𝑍 ) + ∇ ln 𝜋 (𝑍 ), 𝑋 − 𝑍 i + E 𝐷𝑉 (𝑋, 𝑍 )
| {z } | {z }
=KL(𝜋 k𝜋)=0 =0

= D𝑉 (𝜇 k 𝜋) ,
which is what we wanted to show. 

10.2.3 Discretization Analysis


Analysis for the smooth case. We now prove the following result.

Theorem 10.2.20 ([AC21]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝜙 is the mirror map. Assume:

• 𝑉 is 𝛼-relatively convex and 𝛽-relatively smooth w.r.t. 𝜙.

• 𝜙 is 𝑀𝜙 -self-concordant.

• 𝑉 is 𝐿-relatively Lipschitz w.r.t. 𝜙, i.e., k∇𝑉 (𝑥)k [∇2𝜙 (𝑥)] −1 ≤ 𝐿 for all 𝑥 ∈ X.

Let (𝜇𝑘ℎ )𝑘∈N denote the law of MLMC and let 𝛽 0 B 𝛽 + 2𝐿𝑀𝜙 .

1. (weakly convex case) Suppose that 𝛼 = 0. For any 𝜀 ∈ [0, 𝑑], if we take step
2
size ℎ  𝛽𝜀0𝑑 , then for the mixture distribution 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds that
Í𝑁

KL( 𝜇¯𝑁ℎ k 𝜋) ≤ 𝜀 after


√︁

 𝛽 0𝑑 D𝜙 (𝜋, 𝜇 0 ) 
𝑁 =𝑂 iterations .
𝜀4

2. (strongly convex case) Suppose that 𝛼 >√ 0 and let 𝜅 B 𝛽 0/𝛼 denote the “con-
2
dition number”. Then, for any 𝜀 ∈ [0, 𝑑], with step size ℎ  𝛼𝜀𝛽 0𝑑 we obtain

𝛼 D𝜙 (𝜋, 𝜇 𝑁ℎ ) ≤ 𝜀 and KL( 𝜇¯𝑁ℎ,2𝑁ℎ k 𝜋) ≤ 𝜀 after
√︁

 𝜅𝑑 𝛼 D𝜙 (𝜋, 𝜇 0 ) 
𝑁 =𝑂 log iterations ,
𝜀2 𝜀2
Í2𝑁
where 𝜇¯𝑁ℎ,2𝑁ℎ B 𝑁 −1 𝑘=𝑁 +1 𝜇𝑘ℎ .

Similarly to Theorem 4.3.3, the theorem follows from a key recursion.


260 CHAPTER 10. STRUCTURED SAMPLING

Lemma 10.2.21. Under the assumptions of Theorem 10.2.20, if ℎ ∈ [0, 𝛽1 ], then

ℎ KL(𝜇 (𝑘+1)ℎ k 𝜋) ≤ (1 − 𝛼ℎ) D𝜙 (𝜋, 𝜇𝑘ℎ ) − D𝜙 (𝜋, 𝜇 (𝑘+1)ℎ ) + 𝛽 0𝑑ℎ 2 .

We proceed to prove the lemma.


Proof. We follow the proof of Theorem 4.3.3, indicating the changes necessary to adapt
the proof to MLMC. Recall that E(𝜇) B 𝑉 d𝜇.

1. The forward step dissipates the energy. Let 𝑍 ∼ 𝜋 be optimally coupled to 𝑋𝑘ℎ .
Then, applying the relative convexity and relative smoothness of 𝑉 ,

E(𝜇𝑘ℎ
+
) − E(𝜋) = E[𝑉 (𝑋𝑘ℎ
+
) − 𝑉 (𝑋𝑘ℎ ) + 𝑉 (𝑋𝑘ℎ ) − 𝑉 (𝑍 )]
+ +

≤ E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑋𝑘ℎ i + 𝛽 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ )

+ h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i − 𝛼 𝐷𝜙 (𝑍, 𝑋𝑘ℎ )
+ +
(10.2.22)
 
= E h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍 i + 𝛽 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ ) − 𝛼 𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) .

Next, by the Bregman proximal lemma (Lemma 10.2.13),

+ 1 + +
h∇𝑉 (𝑋𝑘ℎ ), 𝑋𝑘ℎ − 𝑍i = {𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) − 𝐷𝜙 (𝑍, 𝑋𝑘ℎ ) − 𝐷𝜙 (𝑋𝑘ℎ , 𝑋𝑘ℎ )} .

Substituting this into (10.2.22) and using ℎ ≤ 𝛽1 , it yields

1
E(𝜇𝑘ℎ
+
) − E(𝜋) ≤ {(1 − 𝛼ℎ) D𝜙 (𝜇𝑘ℎ , 𝜋) − D𝜙 (𝜇𝑘ℎ
+
, 𝜋)} . (10.2.23)

2. The flow step does not substantially increase the energy. We write

E(𝜇 (𝑘+1)ℎ ) − E(𝜇𝑘ℎ


+
) = E 𝑉 ∇𝜙 ∗ (𝑋 (𝑘+1)ℎ

) − 𝑉 ∇𝜙 ∗ (𝑋𝑘ℎ
∗ 
  
) .

Let 𝑓 (𝑥) B 𝑉 (∇𝜙 ∗ (𝑥)) and apply Itô’s formula. Note that

∇𝑓 (𝑥) = ∇𝑉 (∇𝜙 ∗ (𝑥)) T ∇2𝜙 ∗ (𝑥) = ∇𝑉 (∇𝜙 ∗ (𝑥)) T [∇2𝜙 (∇𝜙 ∗ (𝑥))]
−1
,
2 2 ∗ 2 ∗ −1 2 ∗
∇ 𝑓 (𝑥) = [∇ 𝑉 (∇𝜙 (𝑥))] [∇ 𝜙 (∇𝜙 (𝑥))] [∇ 𝜙 (𝑥)]
+ ∇𝑉 (∇𝜙 ∗ (𝑥)) [∇2𝜙 (∇𝜙 ∗ (𝑥))] [∇3𝜙 (∇𝜙 ∗ (𝑥))] [∇2𝜙 (∇𝜙 ∗ (𝑥))]
T −1 −2
.

Itô’s formula decomposes 𝑓 (𝑋 (𝑘+1)ℎ


∗ ∗ ) into the sum of an integral and a stochastic
) − 𝑓 (𝑋𝑘ℎ
integral; since the latter has mean zero, we focus on the first term.
∗ ∗
E[𝑓 (𝑋 (𝑘+1)ℎ ) − 𝑓 (𝑋𝑘ℎ )]
10.2. MIRROR LANGEVIN 261
∫ (𝑘+1)ℎ

∇2𝑉 (𝑋𝑡 ) [∇2𝜙 (𝑋𝑡 )] , ∇2𝜙 (𝑋𝑡 ) d𝑡


−2
=E
𝑘ℎ
∫ (𝑘+1)ℎ

∇𝑉 (𝑋𝑡 ) T [∇2𝜙 (𝑋𝑡 )] [∇3𝜙 (𝑋𝑡 )] [∇2𝜙 (𝑋𝑡 )] , ∇2𝜙 (𝑋𝑡 ) d𝑡


−1 −2
+E
𝑘ℎ
∫ (𝑘+1)ℎ
∇2𝑉 (𝑋𝑡 ), [∇2𝜙 (𝑋𝑡 )]
−1
d𝑡 (10.2.24)

=E
𝑘ℎ
∫ (𝑘+1)ℎ
tr ∇𝑉 (𝑋𝑡 ) T [∇2𝜙 (𝑋𝑡 )] [∇3𝜙 (𝑋𝑡 )] [∇2𝜙 (𝑋𝑡 )]
−1 −1 
+E d𝑡 . (10.2.25)
𝑘ℎ

By relative smoothness, since ∇2𝑉  𝛽 ∇2𝜙,

(10.2.24) ≤ 𝛽𝑑ℎ .

For (10.2.25), we use Lemma 10.2.15, which implies


∫ (𝑘+1)ℎ
E [∇2𝜙 (𝑋𝑡 )] ∇𝑉 (𝑋𝑡 ) 2 tr [∇2𝜙 (𝑋𝑡 )] [∇2𝜙 (𝑋𝑡 )]
−1 −1  
(10.2.25) ≤ 2𝑀𝜙 d𝑡

∇ 𝜙 (𝑋𝑡 )
𝑘ℎ
∫ (𝑘+1)ℎ
≤ 2𝑀𝜙 𝑑 E k∇𝑉 (𝑋𝑡 )k [∇2𝜙 (𝑋𝑡 )] −1 d𝑡 ≤ 2𝐿𝑀𝜙 𝑑ℎ .
 
𝑘ℎ

Hence, we have proven

E(𝜇 (𝑘+1)ℎ ) − E(𝜇𝑘ℎ


+
) ≤ 𝛽 0𝑑ℎ . (10.2.26)

3. The flow step dissipates the entropy. Let 𝜇𝑡 B law(𝑋𝑡 ) = law(∇𝜙 ∗ (𝑋𝑡∗ )). The
mirror diffusion (10.2.11) evolves according to the vector field −[∇2𝜙]
−1
∇ ln 𝜇𝑡 . Also, note
that ∇𝑦 𝐷𝜙 (𝑥, 𝑦) = −∇2𝜙 (𝑥) (𝑦 − 𝑥). Using these, one can show that

𝜕𝑡 D𝜙 (𝜋, 𝜇𝑡 ) ≤ Eh[∇2𝜙 (𝑋𝑡 )] ∇ ln 𝜇 (𝑋𝑡 ), [∇2𝜙 (𝑋𝑡 )] (𝑍 − 𝑋𝑡 )i = Eh∇ ln 𝜇 (𝑋𝑡 ), (𝑍 − 𝑋𝑡 )i ,


−1

where (𝑍, 𝑋𝑡 ) is an optimal coupling for D𝜙 (𝜋, 𝜇𝑡 ). Using the convexity of H along
generalized geodesics (Theorem 10.2.18),

H(𝜋) − H(𝜇𝑡 ) ≥ Eh∇ ln 𝜇 (𝑋𝑡 ), (𝑍 − 𝑋𝑡 )i .

Using the fact that 𝑡 ↦→ H(𝜇𝑡 ) is decreasing (prove this from the Fokker–Planck equation
for the mirror diffusion!), we then have

D𝜙 (𝜋, 𝜇 (𝑘+1)ℎ ) − D𝜙 (𝜋, 𝜇𝑘ℎ


+
) ≤ ℎ {H(𝜋) − H(𝜇 (𝑘+1)ℎ )} . (10.2.27)

Concluding the proof. Combine (10.2.23), (10.2.26), and (10.2.27) to conclude. 


262 CHAPTER 10. STRUCTURED SAMPLING

To apply Theorem 10.2.20 to the problem of constrained sampling, we can choose 𝜙 to


be a logarithmic barrier for X, which is self-concordant. In the special case when 𝑉 = 𝜙,
the condition that 𝜙 is 𝐿-relatively Lipschitz with respect to itself is commonly expressed
as saying that 𝜙 is a self-concordant barrier with parameters (𝐿, 𝑀𝜙 ). Self-concordant
barriers are also a core part of the theory of interior point methods; in particular, it is
known that every convex body in R𝑑 admits a (𝑑, 2)-self-concordant barrier, and that this
is optimal (see [Che21b; LY21]). However, this situation is “cheating” because if we want
to sample from 𝜋 ∝ exp(−𝜙), it does not make sense to assume we can exactly simulate
the mirror diffusion associated with 𝜙.

Result for the non-smooth case. Although the preceding result applies when 𝜙 is a
logarithmic barrier, it does not apply to perhaps one of the most classical applications
of mirror descent: namely, X is the probability simplex in R𝑑 and 𝜙 (𝑥) B 𝑑𝑖=1 𝑥𝑖 ln 𝑥𝑖 is
Í
the entropy. The next result we formulate adopts assumptions which precisely match the
usual ones for mirror descent in this context.

Theorem 10.2.28 ([AC21]). Suppose that 𝜋 = exp(−𝑉 ) is the target distribution and
that 𝜙 is the mirror map. Let |||·||| be a norm on R𝑑 . Assume:

• 𝑉 is convex and 𝐿-Lipschitz w.r.t. the dual norm |||·||| ∗ , in the sense that

|||∇𝑉 (𝑥)||| ∗ ≤ 𝐿 for all 𝑥 ∈ X .

• 𝜙 is 1-strongly convex w.r.t. |||·|||.


𝜀2
Let (𝜇𝑘ℎ )𝑘∈N denote the law of MLMC. For any 𝜀 > 0, if we take step size ℎ  𝐿2
, then
for the mixture 𝜇¯𝑁ℎ B 𝑁 −1 𝑘=1 𝜇𝑘ℎ it holds that KL( 𝜇¯𝑁ℎ k 𝜋) ≤ 𝜀 after
Í𝑁 √︁

 𝐿 2 D𝜙 (𝜋, 𝜇 + ) 
0
𝑁 =𝑂 iterations .
𝜀4

For example, it is a classical fact that the entropy is strongly convex w.r.t. the ℓ1 norm.
We leave the proof of the non-smooth case as Exercise 10.7.
10.3. PROXIMAL LANGEVIN 263

10.3 Proximal Langevin

10.4 Stochastic Gradient Langevin

Bibliographical Notes
In the context of optimization, self-concordant barriers play a key role in interior point
methods for constrained optimization [NN94; Bub15; Nes18]. Relative convexity and
relative smoothness were introduced in [BBT17; LFN18].
The first use of mirror maps with the Langevin diffusion was via the mirrored Langevin
algorithm (which is different from the mirror Langevin diffusion) in [Hsi+18]. The mirror
Langevin diffusion was introduced in an earlier draft of [Hsi+18], as well as in [Zha+20].
In [Zha+20], Zhang et al. also studied the Euler–Maruyama discretization of the mirror
Langevin diffusion (which differs from MLMC in that it discretizes the diffusion step
as well), but they were unable to prove convergence of the algorithm; they were only
able to prove convergence to a Wasserstein ball of non-vanishing radius around 𝜋, even
as the step size tends to zero. They also conjectured that the non-vanishing bias of the
algorithm is unavoidable. Subsequently, [Che+20b] studied the mirror Langevin diffusion
in continuous time, and [AC21] introduced and studied the MLMC discretization, which
does lead to vanishing bias (as ℎ & 0).
Since then, there have been further studying the non-vanishing bias issue: [Jia21]
studied both the Euler–Maruyama and MLMC discretizations under a “mirror log-Sobolev
inequality” and was only able to prove vanishing bias for the later discretization; [Li+22]
showed that the Euler–Maruyama discretization has vanishing bias under stronger as-
sumptions; and [GV22] studied MLMC as a special case of more general Riemannian
Langevin algorithms. The bias issue is still not settled, and it is certainly of interest to
obtain guarantees for fully discretized algorithms. Nevertheless, in our presentation, we
have stuck with the analysis of [AC21] because it is the cleanest, and because it relies on
assumptions which are well-motivated from convex optimization.

Exercises
Mirror Langevin
264 CHAPTER 10. STRUCTURED SAMPLING

D Exercise 10.1 the mirror Langevin diffusion in the primal space


Use Itô’s formula (Theorem 1.1.18) to show that the mirror Langevin diffusion (10.2.3) in
the primal space solves the SDE

d𝑍𝑡 = {−[∇2𝜙 (𝑍𝑡 )] ∇𝑉 (𝑍𝑡 ) − [∇2𝜙 (𝑍𝑡 )] ∇3𝜙 (𝑍𝑡 ) [∇2𝜙 (𝑍𝑡 )] } d𝑡
−1 −1 −1


+ 2 [∇2𝜙 (𝑍𝑡 )]
−1/2
d𝐵𝑡 .

D Exercise 10.2 Markov semigroup theory for the mirror Langevin diffusion
1. Compute the generator of the mirror Langevin diffusion. Use this to show that 𝜋 is
stationary for the diffusion, and verify the equations (10.2.4) for the carré du champ
and Dirichlet energy.
2. Let ℒdual denote the generator for (𝑍𝑡∗ )𝑡 ≥0 (we write ℒdual instead of ℒ ∗ to avoid
confusion with the adjoint of ℒ). By computing 𝜕𝑡 E 𝑓 (𝑍𝑡∗ ), show that

ℒdual 𝑓 = ℒ(𝑓 ◦ ∇𝜙) .

Then, via a similar calculation to (10.2.5) and (10.2.6), show that the Dirichlet energy
for (𝑍𝑡∗ )𝑡 ≥0 can be expressed as

ℰdual (𝑓 , 𝑔) = h∇𝑓 , [∇2𝜙 ∗ ] ∇𝑔i d𝜋 ∗ ,
−1

where 𝜋 ∗ B (∇𝜙) # 𝜋 is the stationary distribution of (𝑍𝑡∗ )𝑡 ≥0 .


3. Show that the mirror Poincaré inequality in Corollary 10.2.10 implies the following
Poincaré inequality in the dual space: for all 𝑓 : R𝑑 → R,
1 1
E𝜋 ∗ h∇𝑓 , [∇2𝜙 ∗ ] ∇𝑓 i = ℰdual (𝑓 , 𝑓 ) .
−1
var𝜋 ∗ 𝑓 ≤
𝛼 𝛼

D Exercise 10.3 Newton Langevin diffusion


Verify the SDE (10.2.7) for the Newton Langevin diffusion. What happens to the mirror
descent iteration (10.2.2) when 𝜙 = 𝑉 ?

D Exercise 10.4 affine invariance of Newton’s method


Verify the affine invariance of Newton’s algorithm.
10.4. STOCHASTIC GRADIENT LANGEVIN 265

D Exercise 10.5 properties of the Bregman divergence


1. Prove the alternative definition of relative convexity/smoothness (Lemma 10.2.9).

2. If 𝜙 ∗ is the convex conjugate of 𝜙, prove that 𝐷𝜙 (𝑥, 𝑥 0) = 𝐷𝜙 ∗ (∇𝜙 (𝑥 0), ∇𝜙 (𝑥)).

3. Check the identity (10.2.12).

D Exercise 10.6 generalized geodesic convexity of the entropy


Generalize the proof of (1.4.3) to prove the convexity of entropy along generalized
geodesics (Theorem 10.2.18).

D Exercise 10.7 non-smooth guarantee for MLMC


Adapt the proof of Theorem 4.3.9 using the techniques of this chapter to prove the non-
smooth guarantee for MLMC (Theorem 10.2.28).
266 CHAPTER 10. STRUCTURED SAMPLING
CHAPTER 11

Non-Log-Concave Sampling

In this chapter, we study the problem of sampling from a smooth but non-log-concave
target. Although some results from previous chapters also cover some non-log-concave
targets (such as targets satisfying a Poincaré or log-Sobolev inequality), these results do
not encompass the full breadth of the non-log-concave sampling problem.
In general, one cannot hope for polynomial-time guarantees from sampling from
non-log-concave targets in usual metrics such as total variation distance. Instead, taking
inspiration from the literature on non-convex optimization, we will develop a notion of
approximate first-order stationarity for sampling, and show that this goal can achieved
via an averaged version of the LMC algorithm. This is based on the work [Bal+22].

11.1 Approximate First-Order Stationarity via Fisher


Information
Suppose that 𝑉 is smooth, but non-convex. In general, optimization lower bounds show
that finding an approximate global minimizer of 𝑉 is computationally intractable, i.e., the
oracle complexity scales exponentially in the dimension 𝑑. To circumvent this, the notion
of approximate first-order stationarity has arisen as the performance metric of choice in
the non-convex optimization literature. Under this metric, we seek to find the minimal
number of queries required to output a point 𝑥 such that k∇𝑉 (𝑥)k ≤ 𝜀.
Of course, in practice we may desire stronger guarantees, but first-order stationarity

267
268 CHAPTER 11. NON-LOG-CONCAVE SAMPLING

is often a useful first step towards more detailed analysis, and it has the advantage that we
can develop a general theory surrounding this notion. Note that in the convex case, finding
a global minimizer is equivalent to finding a first-order stationarity point, so stationary
point analysis can be viewed as a natural generalization of the convex optimization
analysis to non-convex settings.
To develop a sampling analogue of this concept, we recall that the Langevin diffusion
is the gradient flow of the KL divergence KL(· k 𝜋) w.r.t. the Wasserstein geometry
(Section 1.4). Moreover, the gradient of the KL divergence at 𝜇 is ∇ ln(𝜇/𝜋), and the
squared norm of the gradient is the Fisher information FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ].
Hence, a reasonable definition of finding an approximate first-order stationary point in
sampling is to output a sample from 𝜇 satisfying FI(𝜇 k 𝜋) ≤ 𝜀. We will show shortly
√︁

that it is indeed possible to achieve this goal in polynomially many queries to ∇𝑉 as


soon as ∇𝑉 is Lipschitz, thereby establishing a framework for stationarity analysis in
non-log-concave sampling. Before doing so, however, we pause to gain intuition for this
solution concept.

Lack of spurious stationary points. An interesting feature of the Fisher information


is that, unlike for general non-convex optimization, if 𝜋 satisfies some mild regularity
conditions (e.g., 𝜋 has a smooth and positive density on R𝑑 ), then there are no spurious
stationary points: FI(𝜇 k 𝜋) = 0 implies 𝜇 = 𝜋. This is a specific feature of the sampling
problem.1 The intuition behind the proof is straightforward: if FI(𝜇 k 𝜋) = 0, then
∇ ln(𝜇/𝜋) = 0 (𝜋-a.e.), so the density 𝜇 is proportional to 𝜋. Since 𝜇 is a probability
measure, then 𝜇 must equal 𝜋. (See however the√︁technical remark below.)
This might suggest that our goal of obtaining FI(𝜇 k 𝜋) ≤ 𝜀 is too ambitious, because
obtaining a small value of the Fisher information would solve the general problem of
non-log-concave sampling. This is in fact not the case, and the devil is in the details: it is
true that a small value of FI(𝜇 k 𝜋) implies 𝜇 is close to 𝜋, but how small must FI(𝜇 k 𝜋)
be? For highly non-log-concave targets 𝜋, typically the Fisher information should be
exponentially small in order for 𝜇 to be close to 𝜋 in total variation distance.
We illustrate this point with an example: suppose that the target distribution 𝜋 is a
mixture of Gaussians in one dimension, 𝜋 = 12 𝜋 − + 12 𝜋+ , where 𝜋∓ B normal(∓𝑚, 1). Also,
suppose that 𝜇 is a mixture of the same two Gaussians, but with the wrong mixing weights:
𝜇 = 34 𝜋− + 14 𝜋 + . Then, we leave the following computation to the reader (Exercise 11.1).

1 In fact, the KL divergence KL(· k 𝜋)


is always strictly convex with respect to taking convex combinations
of measures, and hence always has a unique global minimum.
11.1. APPROXIMATE FIRST-ORDER STATIONARITY VIA FISHER INFORMATION 269

3
Proposition 11.1.1. Let 𝑚 > 0 and let 𝜋∓ B normal(∓𝑚, 1). Let 𝜇 B 4 𝜋 − + 14 𝜋+ and
𝜋 B 12 𝜋− + 12 𝜋+ . Then, it holds that

lim inf k𝜇 − 𝜋 k TV > 0


𝑚→∞

whereas
𝑚2 
FI(𝜇 k 𝜋) . 𝑚 2 exp − →0 as 𝑚 → ∞ .
2

Metastability. The example in Proposition 11.1.1 also provides an interpretation of the


Fisher information. If we try to sample from the mixture of Gaussians 𝜋, then for 𝑚  1
it takes an exponentially long time for the Langevin diffusion to jump to one mode from
the other; this is the main reason behind the slow mixing of Langevin. Since it is hard to
jump between the modes, it is difficult for the Langevin diffusion to “learn” the global
mixing weights ( 12 , 12 ) of 𝜋. On the other hand, Proposition 11.1.1 shows that even with
the wrong mixing weights ( 34 , 14 ), the Fisher information FI(𝜇 k 𝜋) is small, demonstrating
that the Fisher information is insensitive to the global weights.
The example in Proposition 11.1.1 therefore paints a cartoon picture of the behavior
of the Langevin diffusion with target 𝜋, initialized at 34 𝛿 −𝑚 + 41 𝛿 +𝑚 : we expect that the
Langevin diffusion quickly explores and captures the local structure of the modes but fails
to jump between the modes, arriving at a distribution which resembles 𝜇; it is this local
mixing that a Fisher information bound captures. Meanwhile, the Langevin diffusion only
obtains the correct global weights after an exponentially long waiting time.
The state 𝜇 is not truly stable for the Langevin diffusion: given enough time, the
diffusion will eventually move away from 𝜇 and reach 𝜋. However, since states like 𝜇
persist for a very long period of time, they are usually called metastable in the statistical
physics literature. A Fisher information bound can be interpreted as a way of quantitatively
measuring the metastability phenomenon.

Technical remark. One has to be slightly careful with the definition of the Fisher
information. For example, suppose that 𝜋 is the standard Gaussian, and suppose that 𝜇 is
the Gaussian restricted to the unit ball. Then, it is tempting to argue that the density of 𝜇
is proportional to that of 𝜋 on the unit ball, and hence ∇ ln(𝜇/𝜋) = 0 (𝜇-a.e.); from the
expression FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ], it suggests that FI(𝜇 k 𝜋) = 0, and in particular
𝜇 is a spurious stationary point. However, this argument is not correct.
The reason is that 𝜇 does not have enough regularity w.r.t. 𝜋 in order to apply the
270 CHAPTER 11. NON-LOG-CONCAVE SAMPLING

formula FI(𝜇 k 𝜋) = E𝜇 [k∇ ln(𝜇/𝜋)k 2 ]. Indeed, in order to apply the formula, we must
d𝜇
require that the density d𝜋 lie in an appropriate Sobolev space w.r.t. 𝜋 (more precisely,
√︃
d𝜇
d𝜋 should lie in the domain of the Dirichlet energy functional). If this does not hold,
then we define the Fisher information to be infinite: FI(𝜇 k 𝜋) = ∞.
√︁ In our theorem below, the Fisher information bound should be interpreted as2follows:
FI(𝜇 k 𝜋) ≤ 𝜀 means that 𝜇 has enough regularity w.r.t. 𝜋 and E𝜇 [k∇ ln(𝜇/𝜋)k ] ≤ 𝜀 2 .

11.2 Fisher Information Bound


As before, we consider the interpolation of LMC,

𝑋𝑡 = 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ ) , 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ] .

The Fisher information bound in the next theorem will be proven using the interpolation
technique (§4.2).

Theorem 11.2.1 ([Bal+22]). Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation of LMC
with step size ℎ > 0. Assume that 𝜋 ∝ exp(−𝑉 ) where ∇𝑉 is 𝛽-Lipschitz. Then, for any
step size 0 < ℎ ≤ 4𝛽1 , for all 𝑁 ∈ N,

1 2 KL(𝜇 0 k 𝜋)
∫ 𝑁ℎ
FI(𝜇𝑡 k 𝜋) d𝑡 ≤ + 6𝛽 2𝑑ℎ .
𝑁ℎ 0 𝑁ℎ
√ √
In particular, if KL(𝜇 0 k 𝜋) ≤ 𝐾0 and we choose ℎ = 𝐾0 /(2𝛽 𝑑𝑁 ), then provided that
𝑁 ≥ 9𝐾0 /𝑑,

1 8𝛽 𝑑𝐾0
∫ 𝑁ℎ
FI(𝜇𝑡 k 𝜋) d𝑡 ≤ √ .
𝑁ℎ 0 𝑁

In order to translate the result into a more useful form, we recall that the Fisher
information is convex in its first argument.

Lemma 11.2.2. The Fisher information functional FI(· k 𝜋) is convex.

Proof. Let 𝜇0, 𝜇1 ∈ P (R𝑑 ) be such that FI(𝜇0 k 𝜋) ∨ FI(𝜇1 k 𝜋) < ∞. For 𝑡 ∈ (0, 1), let
11.2. FISHER INFORMATION BOUND 271

d𝜇
𝜇𝑡 B (1 − 𝑡) 𝜇 0 + 𝑡 𝜇1 , and write 𝑓𝑡 B d𝜋𝑡 = (1 − 𝑡) 𝑓0 + 𝑡 𝑓1 . Then,

k∇𝑓𝑡 k 2 k∇𝑓0 k 2 k∇𝑓1 k 2


∫ ∫ ∫
FI(𝜇𝑡 k 𝜋) = d𝜋 ≤ (1 − 𝑡) d𝜋 + 𝑡 d𝜋
𝑓𝑡 𝑓0 𝑓1
= (1 − 𝑡) FI(𝜇 0 k 𝜋) + 𝑡 FI(𝜇 1 k 𝜋)

follows from the joint convexity of (𝑎, 𝑏) ↦→ k𝑎k 2 /𝑏 on R𝑑 × R>0 . 


1
Hence, for the averaged measure 𝜇¯𝑁ℎ B 𝑁ℎ 𝜇𝑡 d𝑡, we have
∫ 𝑁ℎ
0

1

FI( 𝜇¯𝑁ℎ k 𝜋) ≤ FI(𝜇𝑡 k 𝜋) d𝑡 (11.2.3)
𝑁ℎ
and the guarantees of the theorem translate into guarantees for 𝜇¯𝑁ℎ . Moreover, we can
output a sample from 𝜇¯𝑁ℎ via the following procedure:
1. Pick a time 𝑡 ∈ [0, 𝑁ℎ] uniformly at random.
2. Let 𝑘 be the largest integer such that 𝑘ℎ ≤ 𝑡, and let 𝑋𝑘ℎ denote the 𝑘-th iterate of
the LMC algorithm.
3. Perform a partial LMC update

𝑋𝑡 = 𝑋𝑘ℎ − (𝑡 − 𝑘ℎ) ∇𝑉 (𝑋𝑘ℎ ) + 2 (𝐵𝑡 − 𝐵𝑘ℎ )

and output 𝑋𝑡 .
Combined with Theorem 11.2.1 and (11.2.3), and assuming that KL(𝜇0 k 𝜋) = 𝑂 (𝑑),
we
√︁ conclude that it is possible to algorithmically obtain a sample from a measure 𝜇 with
FI(𝜇 k 𝜋) ≤ 𝜀 using 𝑂 (𝛽 2𝑑 2 /𝜀 4 ) queries to ∇𝑉 .
We now give the proof of Theorem 11.2.1, which combines the usual stationary point
analysis in non-convex optimization with the interpolation argument.
Proof of Theorem 11.2.1. Recall from the proof of Theorem 4.2.5 that
1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝑘ℎ) .
2
This inequality was obtained under the sole assumption that ∇𝑉 is 𝛽-Lipschitz. In The-
orem 4.2.5, we proceeded to apply a log-Sobolev inequality, but here we will instead
telescope this inequality. By integrating over 𝑡 ∈ [𝑘ℎ, (𝑘 + 1)ℎ],

1 (𝑘+1)ℎ

KL(𝜇 (𝑘+1)ℎ k 𝜋) − KL(𝜇𝑘ℎ k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) d𝑡 + 3𝛽 2𝑑ℎ 2 .
2 𝑘ℎ
272 CHAPTER 11. NON-LOG-CONCAVE SAMPLING

Summing over 𝑘 = 0, 1, . . . , 𝑁 − 1 and dividing by 𝑁ℎ,


1 2 KL(𝜇 0 k 𝜋)
∫ 𝑁ℎ
FI(𝜇𝑡 k 𝜋) d𝑡 ≤ + 6𝛽 2𝑑ℎ .
𝑁ℎ 0 𝑁ℎ
This proves the first statement; the second statement is obtained by optimizing over the
choice of ℎ. 

11.3 Applications of the Fisher Information Bound


Asymptotic convergence of averaged LMC. Since Theorem 11.2.1 holds under very
weak assumptions (only smoothness of 𝑉 ) and implies that the Fisher information can
be driven to zero, and FI(𝜇 k 𝜋) = 0 implies that 𝜇 = 𝜋, then putting these facts together
leads to a straightforward proof of the asymptotic convergence of averaged LMC.
For a sequence of positive step sizes (ℎ𝑘 )𝑘∈N+ , let 𝜏𝑛 B 𝑛𝑘=1 ℎ𝑘 and consider the
Í
interpolation

𝑋𝑡 = 𝑋𝜏𝑛−1 − (𝑡 − 𝜏𝑛−1 ) ∇𝑉 (𝑋𝜏𝑛−1 ) + 2 (𝐵𝑡 − 𝐵𝜏𝑛−1 ) , 𝑡 ∈ [𝜏𝑛−1, 𝜏𝑛 ] . (11.3.1)

Theorem 11.3.2. Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation (11.3.1) of LMC, and
suppose that the target is 𝜋 ∝ exp(−𝑉 ) where ∇𝑉 is 𝛽-Lipschitz. Suppose that LMC is
initialized at a measure 𝜇 0 with KL(𝜇0 k 𝜋) < ∞, and that the sequence of step sizes
satisfies 0 < ℎ𝑘 < 4𝛽1 for all 𝑘 ∈ N+ together with the conditions

∞ ∞
ℎ𝑘2 < ∞ .
∑︁ ∑︁
ℎ𝑘 = ∞ , and
𝑘=1 𝑘=1

1
Write 𝜇¯𝜏𝑛 B 𝜇𝑡 d𝑡. Then, 𝜇¯𝜏𝑛 → 𝜋 weakly.
∫ 𝜏𝑛
𝜏𝑛 0

Proof. By repeating the proof of Theorem 11.2.1 but incorporating time-varying step sizes,
we obtain for 𝑡 ∈ [𝜏𝑛 , 𝜏𝑛+1 ]
1
𝜕𝑡 KL(𝜇𝑡 k 𝜋) ≤ − FI(𝜇𝑡 k 𝜋) + 6𝛽 2𝑑 (𝑡 − 𝜏𝑛 ) . (11.3.3)
2
By integrating this inequality and summing,

1 𝜏𝑛
∫ 𝑛
FI(𝜇𝑡 k 𝜋) d𝑡 + 3𝛽 2𝑑 ℎ𝑘2 .
∑︁
KL(𝜇𝜏𝑛 k 𝜋) ≤ KL(𝜇 0 k 𝜋) − (11.3.4)
2 0
𝑘=1
11.3. APPLICATIONS OF THE FISHER INFORMATION BOUND 273

Rearranging and using the convexity of the Fisher information, it yields

2 KL(𝜇0 k 𝜋) 6𝛽 2𝑑 ∑︁ 2

1
∫ 𝜏𝑛
FI( 𝜇¯𝜏𝑛 k 𝜋) ≤ FI(𝜇𝑡 k 𝜋) d𝑡 ≤ + ℎ𝑘 . (11.3.5)
𝜏𝑛 0 𝜏𝑛 𝜏𝑛
𝑘=1

On the other hand, if 𝑡 ∈ [𝜏𝑛 , 𝜏𝑛+1 ], then integrating (11.3.3) and combining with (11.3.4)
yields

2 2 2
ℎ𝑘2 < ∞ .
∑︁
KL(𝜇𝑡 k 𝜋) ≤ KL(𝜇𝜏𝑛 k 𝜋) + 3𝛽 𝑑 (𝑡 − 𝜏𝑛 ) ≤ KL(𝜇 0 k 𝜋) + 6𝛽 𝑑
𝑘=1

Therefore, {KL(𝜇𝑡 k 𝜋) | 𝑡 ≥ 0} is bounded, and the convexity of the KL divergence implies


that {KL( 𝜇¯𝜏𝑛 k 𝜋) | 𝑛 ∈ N+ } is bounded. Since the sublevel sets of KL(· k 𝜋) are compact,
to prove the theorem it suffices to show that every weak limit of ( 𝜇¯𝜏𝑛 )𝑛∈N+ is equal to 𝜋.
Consider a subsequence of ( 𝜇¯𝜏𝑛 )𝑛∈N+ converging to a weak limit 𝜇. ¯
Taking 𝑛 → ∞ in (11.3.5) and noting that 𝜏𝑛 → ∞, we have FI( 𝜇¯𝜏𝑛 k 𝜋) → 0 and thus
along the subsequence as well. It is known that FI(· k 𝜋) is weakly lower semicontinuous,
so FI( 𝜇¯ k 𝜋) = 0. However, since ∇𝑉 is Lipschitz, then 𝜋 has a continuous and strictly
positive density on R𝑑 , so FI( 𝜇¯ k 𝜋) = 0 entails 𝜇¯ = 𝜋 as desired. 

Convergence in total variation distance under a Poincaré inequality. As the


example in Proposition 11.1.1 shows, for general non-log-concave targets a Fisher infor-
mation bound does not translate into guarantees in other metrics. However, this can be
carried out if 𝜋 satisfies appropriate functional inequalities. For example, by a definition a
log-Sobolev inequality for 𝜋 states that

KL(𝜇 k 𝜋) . FI(𝜇 k 𝜋) for all 𝜇 ∈ P (R𝑑 ) ,

and in this case a Fisher information guarantee readily translates into a KL divergence
guarantee; however, this is not very interesting because we have obtained a sharper KL
divergence guarantee for targets 𝜋 satisfying a log-Sobolev inequality in Theorem 4.2.5.
Instead, we will show that under the weaker assumption of a Poincaré inequality, a Fisher
information guarantee implies a total variation guarantee.
The key observation is the following implication of a Poincaré inequality.

Proposition 11.3.6. Suppose that 𝜋 satisfies a Poincaré inequality with constant 𝐶 PI .


274 CHAPTER 11. NON-LOG-CONCAVE SAMPLING

Then, for all 𝜇 ∈ P (R𝑑 ),


𝐶 PI
k𝜇 − 𝜋 k 2TV ≤ FI(𝜇 k 𝜋) .
4

d𝜇
Proof. We can assume 𝜇  𝜋; let 𝑓 B d𝜋 . The total variation distance has the expressions

1 1
∫ ∫
k𝜇 − 𝜋 k TV = |𝑓 − 1| d𝜋 = {(𝑓 ∨ 1) − (𝑓 ∧ 1)} d𝜋
2 2

which yields (𝑓 ∧ 1) d𝜋 = 1 − k𝜇 − 𝜋 k TV and (𝑓 ∨ 1) d𝜋 = 1 + k𝜇 − 𝜋 k TV . Using this,


∫ ∫

∫ √︁ ∫ √︁ √︄∫ ∫
𝑓 d𝜋 = (𝑓 ∧ 1) (𝑓 ∨ 1) d𝜋 ≤ (𝑓 ∧ 1) d𝜋 (𝑓 ∨ 1) d𝜋
√︃
1 − k𝜇 − 𝜋 k 2TV .
√︁
= (1 − k𝜇 − 𝜋 k TV ) (1 + k𝜇 − 𝜋 k TV ) =

Therefore,
∫ √︁ 2
− 𝜋 k 2TV
√︁
k𝜇 ≤ 1− 𝑓 d𝜋 = var𝜋 𝑓.

This is sometimes called Le Cam’s inequality; in statistics, the right-hand side is often
written as 𝐻 2 (𝜇, 𝜋) (1 − 14 𝐻 2 (𝜇, 𝜋)), where 𝐻 2 denotes the squared Hellinger distance.
Applying the Poincaré inequality,

𝐶 PI
k𝜇 − 𝜋 k 2TV ≤ 𝐶 PI E𝜋 [k∇ 𝑓 k 2 ] =
√︁
FI(𝜇 k 𝜋) . 
4

Corollary 11.3.7. Let (𝜇𝑡 )𝑡 ≥0 denote the law of the interpolation of LMC with step
size ℎ > 0. Assume that 𝜋 ∝ exp(−𝑉 ), where ∇𝑉 is 𝛽-Lipschitz and 𝜋 satisfies the
Poincaré
√ inequality
√ with constant 𝐶 PI . Then, if KL(𝜇0 k 𝜋) ≤ 𝐾0 and we choose step size
ℎ = 𝐾0 /(2𝛽 𝑑𝑁 ), then

1 ∫ 𝑡 2 2𝐶 𝛽 𝑑𝐾0
k 𝜇¯𝑁ℎ − 𝜋 k 2TV B
PI
𝜇𝑡 d𝑡 − 𝜋 ≤ √

.
𝑁ℎ 0 TV 𝑁
11.3. APPLICATIONS OF THE FISHER INFORMATION BOUND 275

Bibliographical Notes
Exercises
D Exercise 11.1 Fisher information for mixtures of Gaussians
Prove Proposition 11.1.1.
276 CHAPTER 11. NON-LOG-CONCAVE SAMPLING
Bibliography

[AB15] David Alonso-Gutiérrez and Jesús Bastero. Approaching the Kannan-Lovász-


Simonovits and variance conjectures. Vol. 2131. Lecture Notes in Mathematics.
Springer, Cham, 2015, pp. x+148.
[AC21] Kwangjun Ahn and Sinho Chewi. “Efficient constrained sampling via the
mirror-Langevin algorithm”. In: Advances in Neural Information Processing
Systems. Ed. by M. Ranzato, A. Beygelzimer, K. Nguyen, P. S. Liang, J. W.
Vaughan, and Y. Dauphin. Vol. 34. Curran Associates, Inc., 2021, pp. 28405–
28418.
[AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows in metric
spaces and in the space of probability measures. Second. Lectures in Mathe-
matics ETH Zürich. Birkhäuser Verlag, Basel, 2008, pp. x+334.
[AGS15] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. “Bakry–Émery curvature-
dimension condition and Riemannian Ricci curvature bounds”. In: Ann.
Probab. 43.1 (2015), pp. 339–404.
[Bal+22] Krishna Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, and
Matthew Zhang. “Towards a theory of non-log-concave sampling: first-
order stationarity guarantees for Langevin Monte Carlo”. In: Proceedings of
Thirty Fifth Conference on Learning Theory. Ed. by Po-Ling Loh and Maxim
Raginsky. Vol. 178. Proceedings of Machine Learning Research. PMLR, Feb.
2022, pp. 2896–2923.

277
278 BIBLIOGRAPHY

[Bar+18] Jean-Baptiste Bardet, Nathaël Gozlan, Florent Malrieu, and Pierre-André Zitt.
“Functional inequalities for Gaussian convolutions of compactly supported
measures: explicit bounds and dimension dependence”. In: Bernoulli 24.1
(2018), pp. 333–353.
[BB18] Adrien Blanchet and Jérôme Bolte. “A family of functional inequalities:
Łojasiewicz inequalities and displacement convex functions”. In: J. Funct.
Anal. 275.7 (2018), pp. 1650–1673.
[BB99] Jean-David Benamou and Yann Brenier. “A numerical method for the optimal
time-continuous mass transport problem and related problems”. In: Monge
Ampère equation: applications to geometry and optimization (Deerfield Beach,
FL, 1997). Vol. 226. Contemp. Math. Amer. Math. Soc., Providence, RI, 1999,
pp. 1–11.
[BBI01] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry.
Vol. 33. Graduate Studies in Mathematics. American Mathematical Society,
Providence, RI, 2001, pp. xiv+415.
[BBT17] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. “A descent lemma
beyond Lipschitz gradient continuity: first-order methods revisited and ap-
plications”. In: Math. Oper. Res. 42.2 (2017), pp. 330–348.
[BC13] Franck Barthe and Dario Cordero-Erausquin. “Invariances in variance esti-
mates”. In: Proc. Lond. Math. Soc. (3) 106.1 (2013), pp. 33–64.
[BD01] Louis J. Billera and Persi Diaconis. “A geometric interpretation of the
Metropolis–Hastings algorithm”. In: Statist. Sci. 16.4 (2001), pp. 335–339.
[BEZ20] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. “Coupling and
convergence for Hamiltonian Monte Carlo”. In: Ann. Appl. Probab. 30.3 (2020),
pp. 1209–1250.
[BGG18] François Bolley, Ivan Gentil, and Arnaud Guillin. “Dimensional improvements
of the logarithmic Sobolev, Talagrand and Brascamp–Lieb inequalities”. In:
Ann. Probab. 46.1 (2018), pp. 261–301.
[BGL14] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of
Markov diffusion operators. Vol. 348. Grundlehren der Mathematischen Wis-
senschaften [Fundamental Principles of Mathematical Sciences]. Springer,
Cham, 2014, pp. xx+552.
[BH97] Serguei G. Bobkov and Christian Houdré. “Some connections between
isoperimetric and Sobolev-type inequalities”. In: Mem. Amer. Math. Soc.
129.616 (1997), pp. viii+111.
BIBLIOGRAPHY 279

[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration in-
equalities. A nonasymptotic theory of independence, With a foreword by
Michel Ledoux. Oxford University Press, Oxford, 2013, pp. x+481.
[Bor75] Christer Borell. “The Brunn–Minkowski inequality in Gauss space”. In: Invent.
Math. 30.2 (1975), pp. 207–216.
[Bor85] Christer Borell. “Geometric bounds on the Ornstein–Uhlenbeck velocity
process”. In: Z. Wahrsch. Verw. Gebiete 70.1 (1985), pp. 1–13.
[Bub15] Sébastien Bubeck. “Convex optimization: algorithms and complexity”. In:
Foundations and Trends® in Machine Learning 8.3-4 (2015), pp. 231–357.
[BV05] François Bolley and Cédric Villani. “Weighted Csiszár–Kullback–Pinsker
inequalities and applications to transportation inequalities”. In: Ann. Fac. Sci.
Toulouse Math. (6) 14.3 (2005), pp. 331–352.
[Car92] Manfredo P. do Carmo. Riemannian geometry. Mathematics: Theory & Appli-
cations. Translated from the second Portuguese edition by Francis Flaherty.
Birkhäuser Boston, Inc., Boston, MA, 1992, pp. xiv+300.
[CBL22] Niladri S. Chatterji, Peter L. Bartlett, and Philip M. Long. “Oracle lower
bounds for stochastic gradient sampling algorithms”. In: Bernoulli 28.2 (2022),
pp. 1074–1092.
[CCN21] Hong-Bin Chen, Sinho Chewi, and Jonathan Niles-Weed. “Dimension-free
log-Sobolev inequalities for mixture distributions”. In: Journal of Functional
Analysis 281.11 (2021), p. 109236.
[CFM04] Dario Cordero-Erausquin, Matthieu Fradelizi, and Bernard Maurey. “The (B)
conjecture for the Gaussian measure of dilates of symmetric convex sets and
related problems”. In: J. Funct. Anal. 214.2 (2004), pp. 410–427.
[Cha04] Djalil Chafai. “Entropies, convexity, and functional inequalities: on Φ-
entropies and Φ-Sobolev inequalities”. In: J. Math. Kyoto Univ. 44.2 (2004),
pp. 325–363.
[Che+18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan.
“Underdamped Langevin MCMC: a non-asymptotic analysis”. In: Proceedings
of the 31st Conference on Learning Theory. Ed. by Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet. Vol. 75. Proceedings of Machine Learning
Research. PMLR, June 2018, pp. 300–323.
[Che+20a] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. “Fast mixing
of Metropolized Hamiltonian Monte Carlo: benefits of multi-step gradients”.
In: J. Mach. Learn. Res. 21 (2020), Paper No. 92, 71.
280 BIBLIOGRAPHY

[Che+20b] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and
Austin J. Stromme. “Exponential ergodicity of mirror-Langevin diffusions”.
In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran Associates,
Inc., 2020, pp. 19573–19585.
[Che+21a] Sinho Chewi, Murat A. Erdogdu, Mufan B. Li, Ruoqi Shen, and Matthew
Zhang. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev”.
In: arXiv e-prints, arXiv:2112.12662 (2021).
[Che+21b] Sinho Chewi, Chen Lu, Kwangjun Ahn, Xiang Cheng, Thibaut Le Gouic, and
Philippe Rigollet. “Optimal dimension dependence of the Metropolis-adjusted
Langevin algorithm”. In: Proceedings of Thirty Fourth Conference on Learning
Theory. Ed. by Mikhail Belkin and Samory Kpotufe. Vol. 134. Proceedings of
Machine Learning Research. PMLR, 15–19 Aug 2021, pp. 1260–1300.
[Che+22a] Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. “Improved
analysis for a proximal algorithm for sampling”. In: Proceedings of Thirty
Fifth Conference on Learning Theory. Ed. by Po-Ling Loh and Maxim Ragin-
sky. Vol. 178. Proceedings of Machine Learning Research. PMLR, Feb. 2022,
pp. 2984–3014.
[Che+22b] Sinho Chewi, Patrik R. Gerber, Chen Lu, Thibaut Le Gouic, and Philippe
Rigollet. “The query complexity of sampling from strongly log-concave dis-
tributions in one dimension”. In: Proceedings of Thirty Fifth Conference on
Learning Theory. Ed. by Po-Ling Loh and Maxim Raginsky. Vol. 178. Proceed-
ings of Machine Learning Research. PMLR, Feb. 2022, pp. 2041–2059.
[Che21a] Yuansi Chen. “An almost constant lower bound of the isoperimetric coef-
ficient in the KLS conjecture”. In: Geom. Funct. Anal. 31.1 (2021), pp. 34–
61.
[Che21b] Sinho Chewi. “The entropic barrier is 𝑛-self-concordant”. In: arXiv e-prints,
arXiv:2112.10947 (2021).
[CLL19] Yu Cao, Jianfeng Lu, and Yulong Lu. “Exponential decay of Rényi divergence
under Fokker–Planck equations”. In: J. Stat. Phys. 176.5 (2019), pp. 1172–1184.
[CLW21] Yu Cao, Jianfeng Lu, and Lihan Wang. “Complexity of randomized algorithms
for underdamped Langevin dynamics”. In: Commun. Math. Sci. 19.7 (2021),
pp. 1827–1853.
[Cor02] Dario Cordero-Erausquin. “Some applications of mass transport to Gaussian-
type inequalities”. In: Arch. Ration. Mech. Anal. 161.3 (2002), pp. 257–269.
BIBLIOGRAPHY 281

[Cor17] Dario Cordero-Erausquin. “Transport inequalities for log-concave measures,


quantitative forms, and applications”. In: Canad. J. Math. 69.3 (2017), pp. 481–
501.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Second.
Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2006, pp. xxiv+748.
[CT93] Gong Chen and Marc Teboulle. “Convergence analysis of a proximal-like
minimization algorithm using Bregman functions”. In: SIAM J. Optim. 3.3
(1993), pp. 538–543.
[CV18] Ben Cousins and Santosh Vempala. “Gaussian cooling and 𝑂 ∗ (𝑛 3 ) algorithms
for volume and Gaussian volume”. In: SIAM J. Comput. 47.3 (2018), pp. 1237–
1273.
[CV19] Zongchen Chen and Santosh S. Vempala. “Optimal convergence rate of
Hamiltonian Monte Carlo for strongly logconcave distributions”. In: Ap-
proximation, randomization, and combinatorial optimization. Algorithms and
techniques. Vol. 145. LIPIcs. Leibniz Int. Proc. Inform. Schloss Dagstuhl.
Leibniz-Zent. Inform., Wadern, 2019, Art. No. 64, 12.
[CV21] José A. Carrillo and Urbain Vaes. “Wasserstein stability estimates for
covariance-preconditioned Fokker–Planck equations”. In: Nonlinearity 34.4
(Feb. 2021), pp. 2275–2295.
[Dal17a] Arnak S. Dalalyan. “Further and stronger analogy between sampling and
optimization: Langevin Monte Carlo and gradient descent”. In: Proceedings
of the 2017 Conference on Learning Theory. Ed. by Satyen Kale and Ohad
Shamir. Vol. 65. Proceedings of Machine Learning Research. PMLR, July
2017, pp. 678–689.
[Dal17b] Arnak S. Dalalyan. “Theoretical guarantees for approximate sampling from
smooth and log-concave densities”. In: J. R. Stat. Soc. Ser. B. Stat. Methodol.
79.3 (2017), pp. 651–676.
[DM17] Alain Durmus and Éric Moulines. “Nonasymptotic convergence analysis
for the unadjusted Langevin algorithm”. In: Ann. Appl. Probab. 27.3 (2017),
pp. 1551–1587.
[DM19] Alain Durmus and Éric Moulines. “High-dimensional Bayesian inference via
the unadjusted Langevin algorithm”. In: Bernoulli 25.4A (2019), pp. 2854–
2882.
282 BIBLIOGRAPHY

[DMM19] Alain Durmus, Szymon Majewski, and Błażej Miasojedow. “Analysis of


Langevin Monte Carlo via convex optimization”. In: J. Mach. Learn. Res.
20 (2019), Paper No. 73, 46.
[DR20] Arnak S. Dalalyan and Lionel Riou-Durand. “On sampling from a log-concave
density using kinetic Langevin diffusions”. In: Bernoulli 26.3 (2020), pp. 1956–
1988.
[DT12] Arnak S. Dalalyan and Alexandre B. Tsybakov. “Sparse regression learning
by aggregation and Langevin Monte-Carlo”. In: J. Comput. System Sci. 78.5
(2012), pp. 1423–1443.
[Dwi+19] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. “Log-concave
sampling: Metropolis–Hastings algorithms are fast”. In: Journal of Machine
Learning Research 20.183 (2019), pp. 1–42.
[DZ10] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications.
Vol. 38. Stochastic Modelling and Applied Probability. Corrected reprint of
the second (1998) edition. Springer-Verlag, Berlin, 2010, pp. xvi+396.
[Ebe11] Andreas Eberle. “Reflection coupling and Wasserstein contractivity without
convexity”. In: C. R. Math. Acad. Sci. Paris 349.19-20 (2011), pp. 1101–1104.
[Ebe16] Andreas Eberle. “Reflection couplings and contraction rates for diffusions”.
In: Probab. Theory Related Fields 166.3-4 (2016), pp. 851–886.
[EGZ19] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. “Couplings and quan-
titative contraction rates for Langevin dynamics”. In: Ann. Probab. 47.4 (2019),
pp. 1982–2010.
[EHZ22] Murat A. Erdogdu, Rasa Hosseinzadeh, and Shunshi Zhang. “Convergence of
Langevin Monte Carlo in chi-squared and Rényi divergence”. In: Proceedings
of The 25th International Conference on Artificial Intelligence and Statistics.
Ed. by Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera. Vol. 151.
Proceedings of Machine Learning Research. PMLR, 28–30 Mar 2022, pp. 8151–
8175.
[Eld13] Ronen Eldan. “Thin shell implies spectral gap up to polylog via a stochastic
localization scheme”. In: Geom. Funct. Anal. 23.2 (2013), pp. 532–569.
[Eld15] Ronen Eldan. “A two-sided estimate for the Gaussian noise stability deficit”.
In: Invent. Math. 201.2 (2015), pp. 561–624.
[EM12] Matthias Erbar and Jan Maas. “Ricci curvature of finite Markov chains via
convexity of the entropy”. In: Arch. Ration. Mech. Anal. 206.3 (2012), pp. 997–
1038.
BIBLIOGRAPHY 283

[Eva10] Lawrence C. Evans. Partial differential equations. Second. Vol. 19. Graduate
Studies in Mathematics. American Mathematical Society, Providence, RI,
2010, pp. xxii+749.
[FLO21] James Foster, Terry Lyons, and Harald Oberhauser. “The shifted ODE method
for underdamped Langevin MCMC”. In: arXiv e-prints, arXiv:2101.03446
(2021).
[Föl85] Hans Föllmer. “An entropy approach to the time reversal of diffusion pro-
cesses”. In: Stochastic differential systems (Marseille-Luminy, 1984). Vol. 69.
Lect. Notes Control Inf. Sci. Springer, Berlin, 1985, pp. 156–163.
[Fol99] Gerald B. Folland. Real analysis. Second. Pure and Applied Mathematics
(New York). Modern techniques and their applications, A Wiley-Interscience
Publication. John Wiley & Sons, Inc., New York, 1999, pp. xvi+386.
[FS18] Max Fathi and Yan Shu. “Curvature and transport inequalities for Markov
chains in discrete spaces”. In: Bernoulli 24.1 (2018), pp. 672–698.
[Gen+20] Ivan Gentil, Christian Léonard, Luigia Ripani, and Luca Tamanini. “An en-
tropic interpolation proof of the HWI inequality”. In: Stochastic Process. Appl.
130.2 (2020), pp. 907–923.
[GLL20] Rong Ge, Holden Lee, and Jianfeng Lu. “Estimating normalizing constants
for log-concave distributions: algorithms and lower bounds”. In: STOC ’20—
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Com-
puting. ACM, New York, [2020] ©2020, pp. 579–586.
[GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. “Private convex optimization
via exponential mechanism”. In: Proceedings of Thirty Fifth Conference on
Learning Theory. Ed. by Po-Ling Loh and Maxim Raginsky. Vol. 178. Proceed-
ings of Machine Learning Research. PMLR, Feb. 2022, pp. 1948–1989.
[GM11] Olivier Guédon and Emanuel Milman. “Interpolating thin-shell and sharp
large-deviation estimates for isotropic log-concave measures”. In: Geom.
Funct. Anal. 21.5 (2011), pp. 1043–1068.
[GM96] Wilfrid Gangbo and Robert J. McCann. “The geometry of optimal transporta-
tion”. In: Acta Math. 177.2 (1996), pp. 113–161.
[Goz+14] Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. “Dis-
placement convexity of entropy and related inequalities on graphs”. In:
Probab. Theory Related Fields 160.1-2 (2014), pp. 47–94.
284 BIBLIOGRAPHY

[Gro07] Misha Gromov. Metric structures for Riemannian and non-Riemannian spaces.
English. Modern Birkhäuser Classics. Based on the 1981 French original,
With appendices by M. Katz, P. Pansu and S. Semmes, Translated from the
French by Sean Michael Bates. Birkhäuser Boston, Inc., Boston, MA, 2007,
pp. xx+585.
[GT20] Arun Ganesh and Kunal Talwar. “Faster differentially private samplers via
Rényi divergence analysis of discretized Langevin MCMC”. In: Advances
in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato,
R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020,
pp. 7222–7233.
[GV22] Khashayar Gatmiry and Santosh S. Vempala. “Convergence of the Rieman-
nian Langevin algorithm”. In: arXiv e-prints, arXiv:2204.10818 (2022).
[Han16] Ramon van Handel. Probability in high dimension. 2016.
[HBE20] Ye He, Krishnakumar Balasubramanian, and Murat A. Erdogdu. “On the
ergodicity, bias and asymptotic normality of randomized midpoint sampling
method”. In: Advances in Neural Information Processing Systems. Ed. by H.
Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran
Associates, Inc., 2020, pp. 7366–7376.
[HG14] Matthew D. Hoffman and Andrew Gelman. “The no-U-turn sampler: adap-
tively setting path lengths in Hamiltonian Monte Carlo”. In: J. Mach. Learn.
Res. 15 (2014), pp. 1593–1623.
[Hsi+18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. “Mirrored
Langevin dynamics”. In: Advances in Neural Information Processing Systems.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett. Vol. 31. Curran Associates, Inc., 2018.
[Hsu02] Elton P. Hsu. Stochastic analysis on manifolds. Vol. 38. Graduate Stud-
ies in Mathematics. American Mathematical Society, Providence, RI, 2002,
pp. xiv+281.
[IM12] Marcus Isaksson and Elchanan Mossel. “Maximally stable Gaussian partitions
with discrete applications”. In: Israel J. Math. 189 (2012), pp. 347–396.
[Jia+21] He Jia, Aditi Laddha, Yin Tat Lee, and Santosh Vempala. “Reducing isotropy
and volume to KLS: an 𝑂 ∗ (𝑛 3𝜓 2 ) volume algorithm”. In: Proceedings of the
53rd Annual ACM SIGACT Symposium on Theory of Computing. New York,
NY, USA: Association for Computing Machinery, 2021, pp. 961–974.
BIBLIOGRAPHY 285

[Jia21] Qijia Jiang. “Mirror Langevin Monte Carlo: the case under isoperimetry”.
In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34.
Curran Associates, Inc., 2021, pp. 715–725.
[JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto. “The variational formu-
lation of the Fokker–Planck equation”. In: SIAM J. Math. Anal. 29.1 (1998),
pp. 1–17.
[KKO18] Guy Kindler, Naomi Kirshner, and Ryan O’Donnell. “Gaussian noise sensi-
tivity and Fourier tails”. In: Israel J. Math. 225.1 (2018), pp. 71–109.
[KL22] Bo’az Klartag and Joseph Lehec. “Bourgain’s slicing problem and KLS
isoperimetry up to polylog”. In: arXiv e-prints, arXiv:2203.15551 (2022).
[Kla+16] Bo’az Klartag, Gady Kozma, Peter Ralli, and Prasad Tetali. “Discrete curvature
and abelian groups”. In: Canad. J. Math. 68.3 (2016), pp. 655–674.
[KLM06] Ravi Kannan, László Lovász, and Ravi Montenegro. “Blocking conductance
and mixing in random walks”. In: Combin. Probab. Comput. 15.4 (2006),
pp. 541–570.
[KLS95] Ravi Kannan, László Lovász, and Miklós Simonovits. “Isoperimetric problems
for convex bodies and a localization lemma”. In: Discrete Comput. Geom.
13.3-4 (1995), pp. 541–559.
[Kol14] Alexander V. Kolesnikov. “Hessian metrics, 𝐶𝐷 (𝐾, 𝑁 )-spaces, and optimal
transportation of log-concave measures”. In: Discrete Contin. Dyn. Syst. 34.4
(2014), pp. 1511–1532.
[LC22a] Jiaming Liang and Yongxin Chen. “A proximal algorithm for sampling”. In:
arXiv e-prints, arXiv:2202.13975 (2022).
[LC22b] Jiaming Liang and Yongxin Chen. “A proximal algorithm for sampling from
non-smooth potentials”. In: arXiv e-prints, arXiv:2110.04597 (2022).
[Le 16] Jean-François Le Gall. Brownian motion, martingales, and stochastic calculus.
French. Vol. 274. Graduate Texts in Mathematics. Springer, [Cham], 2016,
pp. xiii+273.
[Led00] Michel Ledoux. “The geometry of Markov diffusion generators”. In: vol. 9. 2.
Probability theory. 2000, pp. 305–366.
[Led01] Michel Ledoux. The concentration of measure phenomenon. Vol. 89. Mathemat-
ical Surveys and Monographs. American Mathematical Society, Providence,
RI, 2001, pp. x+181.
286 BIBLIOGRAPHY

[Léo17] Christian Léonard. “On the convexity of the entropy along entropic interpo-
lations”. In: Measure theory in non-smooth spaces. Partial Differ. Equ. Meas.
Theory. De Gruyter Open, Warsaw, 2017, pp. 194–242.
[LFN18] Haihao Lu, Robert M. Freund, and Yurii Nesterov. “Relatively smooth convex
optimization by first-order methods, and applications”. In: SIAM J. Optim.
28.1 (2018), pp. 333–354.
[Li+19] Xuechen Li, Yi Wu, Lester Mackey, and Murat A. Erdogdu. “Stochastic
Runge–Kutta accelerates Langevin Monte Carlo and beyond”. In: Advances
in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Asso-
ciates, Inc., 2019.
[Li+22] Ruilin Li, Molei Tao, Santosh S. Vempala, and Andre Wibisono. “The mirror
Langevin algorithm converges with vanishing bias”. In: Proceedings of The
33rd International Conference on Algorithmic Learning Theory. Ed. by Sanjoy
Dasgupta and Nika Haghtalab. Vol. 167. Proceedings of Machine Learning
Research. PMLR, 29 Mar–01 Apr 2022, pp. 718–742.
[LO00] Rafał Latała and Krzysztof Oleszkiewicz. “Between Sobolev and Poincaré”.
In: Geometric aspects of functional analysis. Vol. 1745. Lecture Notes in Math.
Springer, Berlin, 2000, pp. 147–168.
[LS88] Gregory F. Lawler and Alan D. Sokal. “Bounds on the 𝐿 2 spectrum for Markov
chains and Markov processes: a generalization of Cheeger’s inequality”. In:
Trans. Amer. Math. Soc. 309.2 (1988), pp. 557–580.
[LS93] László Lovász and Miklós Simonovits. “Random walks in a convex body and
an improved volume algorithm”. In: Random Structures Algorithms 4.4 (1993),
pp. 359–412.
[LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Logsmooth gradient concen-
tration and tighter runtimes for Metropolized Hamiltonian Monte Carlo”.
In: Proceedings of Thirty Third Conference on Learning Theory. Ed. by Jacob
Abernethy and Shivani Agarwal. Vol. 125. Proceedings of Machine Learning
Research. PMLR, Sept. 2020, pp. 2565–2597.
[LST21a] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Lower bounds on Metropolized
sampling methods for well-conditioned distributions”. In: Advances in Neu-
ral Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y.
Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34. Curran Associates,
Inc., 2021, pp. 18812–18824.
BIBLIOGRAPHY 287

[LST21b] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Structured logconcave sampling
with a restricted Gaussian oracle”. In: arXiv e-prints, arXiv:2010.03106 (2021).
[LST21c] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. “Structured logconcave sampling
with a restricted Gaussian oracle”. In: Proceedings of Thirty Fourth Conference
on Learning Theory. Ed. by Mikhail Belkin and Samory Kpotufe. Vol. 134.
Proceedings of Machine Learning Research. PMLR, 15–19 Aug 2021, pp. 2993–
3050.
[LV09] John Lott and Cédric Villani. “Ricci curvature for metric-measure spaces via
optimal transport”. In: Ann. of Math. (2) 169.3 (2009), pp. 903–991.
[LV17] Yin Tat Lee and Santosh S. Vempala. “Eldan’s stochastic localization and
the KLS hyperplane conjecture: an improved lower bound for expansion”.
In: 58th Annual IEEE Symposium on Foundations of Computer Science—FOCS
2017. IEEE Computer Soc., Los Alamitos, CA, 2017, pp. 998–1007.
[LY21] Yin Tat Lee and Man-Chung Yue. “Universal barrier is 𝑛-self-concordant”.
In: Math. Oper. Res. 46.3 (2021), pp. 1129–1148.
[LZT22] Ruilin Li, Hongyuan Zha, and Molei Tao. “Sqrt(𝑑) dimension dependence of
Langevin Monte Carlo”. In: International Conference on Learning Representa-
tions. 2022.
[Ma+21] Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L.
Bartlett, and Michael I. Jordan. “Is there an analog of Nesterov acceleration
for gradient-based MCMC?” In: Bernoulli 27.3 (2021), pp. 1942–1992.
[Maa11] Jan Maas. “Gradient flows of the entropy for finite Markov chains”. In: J.
Funct. Anal. 261.8 (2011), pp. 2250–2292.
[Mar96] Katalin Marton. “A measure concentration inequality for contracting Markov
chains”. In: Geom. Funct. Anal. 6.3 (1996), pp. 556–571.
[Mie13] Alexander Mielke. “Geodesic convexity of the relative entropy in reversible
Markov chains”. In: Calc. Var. Partial Differential Equations 48.1-2 (2013),
pp. 1–31.
[Mil09] Emanuel Milman. “On the role of convexity in isoperimetry, spectral gap
and concentration”. In: Invent. Math. 177.1 (2009), pp. 1–43.
[Mir17] Ilya Mironov. “Rényi differential privacy”. In: 2017 IEEE 30th Computer Secu-
rity Foundations Symposium (CSF). 2017, pp. 263–275.
[MN15] Elchanan Mossel and Joe Neeman. “Robust optimality of Gaussian noise
stability”. In: J. Eur. Math. Soc. (JEMS) 17.2 (2015), pp. 433–482.
288 BIBLIOGRAPHY

[MS19] Oren Mangoubi and Aaron Smith. “Mixing of Hamiltonian Monte Carlo on
strongly log-concave distributions 2: numerical integrators”. In: Proceedings
of the Twenty-Second International Conference on Artificial Intelligence and
Statistics. Ed. by Kamalika Chaudhuri and Masashi Sugiyama. Vol. 89. Pro-
ceedings of Machine Learning Research. PMLR, 16–18 Apr 2019, pp. 586–
595.
[MV18] Oren Mangoubi and Nisheeth Vishnoi. “Dimensionally tight bounds for
second-order Hamiltonian Monte Carlo”. In: Advances in Neural Information
Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc., 2018.
[Nea11] Radford M. Neal. “MCMC using Hamiltonian dynamics”. In: Handbook of
Markov chain Monte Carlo. Chapman & Hall/CRC Handb. Mod. Stat. Methods.
CRC Press, Boca Raton, FL, 2011, pp. 113–162.
[Nes18] Yurii Nesterov. Lectures on convex optimization. Vol. 137. Springer Optimiza-
tion and Its Applications. Springer, Cham, 2018, pp. xxiii+589.
[NN94] Yurii Nesterov and Arkadii S. Nemirovskii. Interior-point polynomial algo-
rithms in convex programming. Vol. 13. SIAM Studies in Applied Mathematics.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1994, pp. x+405.
[NW20] Richard Nickl and Sven Wang. “On polynomial-time computation of high-
dimensional posterior measures by Langevin-type algorithms”. In: arXiv
e-prints, arXiv:2009.05298 (2020).
[NY83] Arkadii S. Nemirovsky and David B. Yudin. Problem complexity and method
efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics.
Translated from the Russian and with a preface by E. R. Dawson. John Wiley
& Sons, Inc., New York, 1983, pp. xv+388.
[Oll07] Yann Ollivier. “Ricci curvature of metric spaces”. In: C. R. Math. Acad. Sci.
Paris 345.11 (2007), pp. 643–646.
[Oll09] Yann Ollivier. “Ricci curvature of Markov chains on metric spaces”. In: J.
Funct. Anal. 256.3 (2009), pp. 810–864.
[Ott01] Felix Otto. “The geometry of dissipative evolution equations: the porous
medium equation”. In: Comm. Partial Differential Equations 26.1-2 (2001),
pp. 101–174.
BIBLIOGRAPHY 289

[OV00] Felix Otto and Cédric Villani. “Generalization of an inequality by Talagrand


and links with the logarithmic Sobolev inequality”. In: J. Funct. Anal. 173.2
(2000), pp. 361–400.
[OV12] Yann Ollivier and Cédric Villani. “A curved Brunn–Minkowski inequality
on the discrete hypercube, or: what is the Ricci curvature of the discrete
hypercube?” In: SIAM J. Discrete Math. 26.3 (2012), pp. 983–996.
[RS15] Firas Rassoul-Agha and Timo Seppäläinen. A course on large deviations with
an introduction to Gibbs measures. Vol. 162. Graduate Studies in Mathematics.
American Mathematical Society, Providence, RI, 2015, pp. xiv+318.
[RV08] Luis Rademacher and Santosh Vempala. “Dispersion of mass and the com-
plexity of randomized geometric algorithms”. In: Adv. Math. 219.3 (2008),
pp. 1037–1069.
[San15] Filippo Santambrogio. Optimal transport for applied mathematicians. Vol. 87.
Progress in Nonlinear Differential Equations and their Applications. Cal-
culus of variations, PDEs, and modeling. Birkhäuser/Springer, Cham, 2015,
pp. xxvii+353.
[SC74] Vladimir N. Sudakov and Boris S. Cirel’son. “Extremal properties of half-
spaces for spherically invariant measures”. In: Zap. Naučn. Sem. Leningrad.
Otdel. Mat. Inst. Steklov. (LOMI) 41 (1974). Problems in the theory of proba-
bility distributions, II, pp. 14–24, 165.
[SL19] Ruoqi Shen and Yin Tat Lee. “The randomized midpoint method for log-
concave sampling”. In: Advances in Neural Information Processing Systems.
Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and
R. Garnett. Vol. 32. Curran Associates, Inc., 2019.
[Ste01] J. Michael Steele. Stochastic calculus and financial applications. Vol. 45. Ap-
plications of Mathematics (New York). Springer-Verlag, New York, 2001,
pp. x+300.
[Stu06a] Karl-Theodor Sturm. “On the geometry of metric measure spaces. I”. In: Acta
Math. 196.1 (2006), pp. 65–131.
[Stu06b] Karl-Theodor Sturm. “On the geometry of metric measure spaces. II”. In: Acta
Math. 196.1 (2006), pp. 133–177.
[Tal91] Michel Talagrand. “A new isoperimetric inequality and the concentration of
measure phenomenon”. In: Geometric aspects of functional analysis (1989–90).
Vol. 1469. Lecture Notes in Math. Springer, Berlin, 1991, pp. 94–124.
290 BIBLIOGRAPHY

[Tal96] Michel Talagrand. “A new look at independence”. In: Ann. Probab. 24.1 (1996),
pp. 1–34.
[Ver18] Roman Vershynin. High-dimensional probability. Vol. 47. Cambridge Series in
Statistical and Probabilistic Mathematics. An introduction with applications
in data science, With a foreword by Sara van de Geer. Cambridge University
Press, Cambridge, 2018, pp. xiv+284.
[Vil03] Cédric Villani. Topics in optimal transportation. Vol. 58. Graduate Stud-
ies in Mathematics. American Mathematical Society, Providence, RI, 2003,
pp. xvi+370.
[Vil09] Cédric Villani. Optimal transport. Vol. 338. Grundlehren der Mathematischen
Wissenschaften [Fundamental Principles of Mathematical Sciences]. Old and
new. Springer-Verlag, Berlin, 2009, pp. xxii+973.
[VW19] Santosh Vempala and Andre Wibisono. “Rapid convergence of the unadjusted
Langevin algorithm: isoperimetry suffices”. In: Advances in Neural Informa-
tion Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer,
F. d’Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8094–
8106.
[Wib18] Andre Wibisono. “Sampling as optimization in the space of measures: the
Langevin dynamics as a composite optimization problem”. In: Proceedings of
the 31st Conference on Learning Theory. Ed. by Sébastien Bubeck, Vianney
Perchet, and Philippe Rigollet. Vol. 75. Proceedings of Machine Learning
Research. PMLR, June 2018, pp. 2093–3027.
[WSC21] Keru Wu, Scott Schmidler, and Yuansi Chen. “Minimax mixing time of the
Metropolis-adjusted Langevin algorithm for log-concave sampling”. In: arXiv
e-prints, arXiv:2109.13055 (2021).
[Zha+20] Kelvin S. Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. “Wasserstein
control of mirror Langevin Monte Carlo”. In: Proceedings of Thirty Third
Conference on Learning Theory. Ed. by Jacob Abernethy and Shivani Agarwal.
Vol. 125. Proceedings of Machine Learning Research. PMLR, Sept. 2020,
pp. 3814–3841.
Index

𝑐-concavity, 60 Bregman transport costs, 257


𝑐-conjugate potentials, 60 Brownian motion, 4
𝑓 -divergence, 53, 224
Caffarelli’s contraction theorem, 79
absolute continuity, 38 carré du champ, 23
Alexandrov curvature, 114 iterated operator, 29, 68
Azuma–Hoeffding inequality, 132 Cheeger’s inequality, 203
chi-squared divergence, 26
Bakry–Émery theorem, 29 coarea inequality, 101
Benamou–Brenier formula, 61 concentration function, 89, 97
blow-up, 88 conductance, 203
Bobkov–Götze theorem, 93 𝑠-conductance, 207
Bochner identity, 68 continuity equation, 39
Bonnet–Myers diameter bound, 117 covariant derivative, 109
bounded differences inequality, 132 curvature
Brascamp–Lieb inequality, 69, 252 Gaussian, 111
Bregman divergence, 71, 252 Ricci, 112
Bregman proximal lemma, 255 coarse, 123
Bregman transport cost, 71, 256 synthetic, 119
Bregman transport inequality, 71, 127, Riemann, 111
258 scalar, 112
Brenier potential, 36 sectional, 112
Brenier’s theorem, 33 curvature-dimension condition, 29, 68,

291
292 INDEX

116, 118 Hörmander’s 𝐿 2 method, 69


Hamilton’s equations of motion, 176
data-processing inequality, 54 Hamiltonian, 176
diffusion coefficient, 9 Hamiltonian Monte Carlo (HMC)
diffusion semigroup, 29, 73 ideal, 176
Dirichlet energy, 23, 202 Metropolized (MHMC), 196
displacement interpolation, 44 Hamilton–Jacobi equation, 61
divergence, 110 Hellinger distance, 274
Donsker–Varadhan variational principle, Herbst argument, 91
54 Hoeffding’s lemma, 131
Doob martingale, 56 Holley–Stroock perturbation, 77
Doob’s maximal inequality, 17 Hopf–Lax semigroup, 61
𝐿𝑝 version, 18 hypercontractivity, 128, 167
drift coefficient, 9
infinitesimal generator, 19, 201
Efron–Stein inequality, 57 isoperimetric profile, 99
elementary process, 5, 15 isoperimetry
entropy, 242 Cheeger, 101, 205
Euler–Maruyama discretization, 143 Gaussian, 98, 107, 212
exponential map, 44 sphere, 98, 132
Itô integral, 4, 16
Fano’s inequality, 242 Itô isometry, 7
feasible start, 199 Itô process, 9
finite variation, 135 Itô’s formula, 10, 139
first variation, 45
Fisher information, 28, 74 Jacobi equation, 121
flow map, 176 Kannan–Lovász–Simonovits (KLS)
Fokker–Planck equation, 21 conjecture, 125
friction, 180 Kantorovich problem, 31
Kolmogorov’s backward equation, 20
generalized geodesic, 258
Kolmogorov’s forward equation, 21
geodesic, 42, 110, 114
Kullback–Leibler (KL) divergence, 27
geodesic convexity, 44 chain rule, 54
Girsanov’s theorem, 156
Gozlan’s theorem, 96 Langevin diffusion, 3
Grönwall’s lemma, 11 Langevin Monte Carlo (LMC), 143
gradient, 108 Laplace–Beltrami operator, 111
Gromov–Hausdorff convergence, 119 Latała–Oleszkiewicz inequality (LOI),
measured, 120 125
INDEX 293

lazy chain, 199 mutual information, 242


Le Cam’s inequality, 274
leapfrog integrator, 196 Newton Langevin diffusion, 251
length, 113 no-U-turn sampler (NUTS), 186
Levi–Civita connection, 109 optimal transport, 30
Lie bracket, 111 dual, 32
local martingale, 8 fundamental theorem, 33
localization, 7
optimal transport plan, 30
localizing sequence, 7
Orlicz function, 90
log-Sobolev inequality (LSI), 27, 65, 148,
Orlicz norm, 90
163
Ornstein–Uhlenbeck (OU) process, 58,
defective, 82
189
modified, 122, 202
Otto calculus, 47
logarithmic map, 44
Otto–Villani theorem, 50
manifold, 42
parallel transport, 110
Hessian, 109
Picard iteration, 13
Markov semigroup, 19
Pinsker’s inequality, 53, 131
martingale, 5
Poincaré inequality (PI), 26, 65, 202
Marton’s tensorization, 85
𝐿𝑝 –𝐿𝑞 , 103
McCann’s interpolation, 44
local, 72
mean squared analysis, 161
mirror, 253
mesh, 135
Poisson bracket, 188
metastability, 269
Poisson equation, 70, 126
metric derivative, 38
Polyak–Łojasiewicz (PL) inequality, 49,
metric geometry, 113
235
Metropolis-adjusted Langevin algorithm
progressive process, 15
(MALA), 196
proximal sampler, 219
Metropolis–Hastings (MH) filter, 193
Metropolized random walk (MRW), 196 quadratic variation, 137
Minkowski content, 99
mirror descent, 249 Rényi divergence, 75, 163
mirror Langevin, 249 randomized midpoint discretization, 171
mirror Langevin Monte Carlo (MLMC), reflection coupling, 161
254 rejection sampling, 191
mirror map, 249 relative convexity, 252
model space, 114 relative smoothness, 252
Monge problem, 31 restricted Gaussian oracle (RGO), 220
Monge–Ampère equation, 126 reversibility, 22, 194
294 INDEX

Ricci curvature, 68 Talagrand’s T1 inequality, 66


Riemannian metric, 42, 108 Talagrand’s T2 inequality, 50, 65
Rothaus lemma, 82 tangent bundle, 108
tangent space, 42, 108
Sanov’s theorem, 96
time reversal, 226
self-concordance, 256
total variation, 136
semimartingale, 137
total variation (TV) distance, 52, 136
shifted ODE discretization, 186
total variation norm, 136
Sobolev norm, 127
inverse, 127 underdamped Langevin diffusion, 180
square integrable process, 15
stochastic calculus, 3 vector field, 109
stochastic differential equation (SDE), 11 volume measure, 111
stopping time, 7
submartingale, 17 warm start, 199
symplectic integrator, 196 Wasserstein gradient flow, 46
synchronous coupling, 161 Wasserstein metric, 31, 60

You might also like