Lecture 03
Lecture 03
Key concepts:
• Langevin diffusion
• (note is here but only discussed in the 4th lecture) Convergence of ULA
3.1 Introduction
In the previous lecture, we saw that the corners of the convex body cause a lot of
problems for the sampling algorithm Ball walk: Ball walk has to choose a smaller
step-size otherwise it would have close-to-zero acceptance rate in many places. It is
not always the case in practice that we encounter distributions that are as nonsmooth
as the uniform distribution on a convex body. In this lecture, we completely avoid
the nonsmooth problem by making a simplifying assumption that we are dealing with
smooth densities of the form
µ ∝ e−f
where f is twice continuously differentiable. We would like to know whether there exist
sampling algorithms better than Ball walk.
3-1
401-4634-24L Lecture 3 Spring 2024
3. (independent increment) For any k ∈ N and {ti }ki=0 with t0 = 0 < t1 < · · · < tk ,
the random variables Bti+1 − Bti for 0 ≤ i ≤ k − 1, are mutually independent
4. (Gaussian increment) for any 0 ≤ s < t, Bt − Bs is distributed as N (0, (t − s)In ).
Intuitively, we may think dBt in Eq. (3.1) is Gaussian noise with mean 0 and variance
dt. Then Eq. (3.1) may be thought of as a noisy gradient
√ descent, with a deterministic
gradient step −∇f (Xt )dt and a diffusion component 2dBt . For a rigorous treatment
of Brownian motion and stochastic calculus, readers are referred to [Pro04].
3-2
401-4634-24L Lecture 3 Spring 2024
• Accept-reject step: go to
n µ(Z )P o
Zk+1 (Xk )
( k+1
Zk+1 with probability min 1, µ(Xk )PX (Zk+1 )
Xk+1 = k
Note that conditioned on Xk , the proposal step boils down to drawing a Gaussian with
mean Xk − h∇f (Xk ) and covariance 2hIn . Hence, the proposal kernel has an explicit
form
!
1 ∥x − (z − h∇f (z))∥22
Pz (x) = n exp − .
(2π · 2h) 2 4h
• Accept-reject step: go to
MRW (X )
µ(Zk+1 )PZ k
Z k+1
k+1 with probability min 1, MRW (Z
µ(Xk )PX k+1 )
Xk+1 = k
3-3
401-4634-24L Lecture 3 Spring 2024
where Bt is the Brownian motion in R. We assume the following fact without proving
it.
Fokker-Planck equation. Let {Xt }t≥0 be a drift-diffusion process following SDE (3.3),
starting from X0 ∼ µ0 . Then for all t ≥ 0, denoting the law of Xt by µt , we have
∂ ∂ ∂2
µt (x) = − [a(x, t)µt (x)] + 2 [D(x, t)µt (x)] , ∀x ∈ R, (3.4)
∂t ∂x ∂ x
3-4
401-4634-24L Lecture 3 Spring 2024
where D(x, t) = b(x, t)2 /2. The above equation is called the Fokker-Planck equation
associated to the drift-diffusion process {Xt }t≥0 . The Fokker-Planck equation describes
the time evolution of the probability density function via a partial differential equa-
tion (PDE). Unlike Eq. (3.3), the Fokker-Planck equation in Eq. (3.4) is completely
deterministic.
In general, there are two main approaches to interpret a drift-diffusion process in
Eq. (3.3) as illustrated in Figure 3.1. The first approach is the pathwise view: given
a random draw of the Brownian motion Bt , Eq. (3.3) becomes an ordinary differential
equation and, it generates a continuous path in R. Each random draw of the Brownian
motion generates a path. The collection of all paths describes the SDE. The second
approach is the density evolution view: since we don’t really care about the identity
of each path, we can focus on the evolution of the law of the density of Xt at any
time t > 0. Fokker-Planck equation enables this second approach via a PDE. The
two approaches are complimentary and are related via Markov semigroup theory and
Kolmogorov’s forward and backward equations. For a detailed exposition and a proof
of the Fokker-Planck equation, see Chapter 1.2 of [Che23].
Example 1 (heat equation). Taking a = 0, b = 1 in Eq. (3.4), we obtain the heat
equation
∂ 1 ∂2
µt = µt .
∂t 2 ∂x2
Starting from 0, the PDE has a closed-form solution
2
1 x
µt = √ exp − .
2πt 2t
The above density is exactly the law of Xt defined via dXt = dBt .
Finally, one can also introduce a higher dimensional formulation of the Fokker-
Planck equation. Consider a drift-diffusion process {Xt }t≥0 on Rn driven by a drift
term a : Rn × R → Rn and a diffusion term b : Rn × R → Rn×m , characterized by the
following SDE
dXt = a(Xt , t)dt + b(Xt , t)dBt , (3.5)
where Bt is the Brownian motion in Rm .
where D = 21 bb⊤ .
3-5
401-4634-24L Lecture 3 Spring 2024
Xt Xt
X0 X0
µt
µ{t+δ}
Figure 3.1. Two interpretations of a drift-diffusion process. Left: pathwise view.
Right: density evolution view.
√
Example 2 (Langevin diffusion). Taking a(x, t) = −∇f (x) and b(x, t) = 2In , the
SDE corresponds to Langevin diffusion
√
dXt = −∇f (Xt )dt + 2dBt .
Note that it is also the divergence of the gradient (∇g), i.e., ∆g = ∇ · ∇g.
3-6
401-4634-24L Lecture 3 Spring 2024
where ∂i is used as a shorthand for ∂x∂ i , (i) used the assumption µ = ce−f with c a
constant. So, the two terms above sum to 0. And we prove µ is a stationary measure.
where C(µ, ν) is the set of all couplings of µ and ν. We say γ is a coupling of µ and ν,
if its marginal on the first variable is µ and its marginal on the second is ν.
3-7
401-4634-24L Lecture 3 Spring 2024
Proof. The main proof strategy is to construct a coupling between µt and µ by taking
advantage of the Langevin SDE (3.1), and then prove that it is bounded. We construct
a coupling as follows. Let γ0 be an optimal coupling of (µ0 , µ) which achieves W22 (µ0 , µ).
Draw (X0 , X0∗ ) ∼ γ0 . Let X0 and X0∗ evolve through the Langevin SDE with the same
copy of Brownian motion {Bs }s≥0 . Let γt denote the law of the resulting (Xt , Xt∗ ). γt
is a coupling of (µt , µ) because
• Marginally, we just followed the Langevin SDE. So the law of Xt is µt
(i) uses the fact that Xt and Xt∗ shared the same Brownian motion. (ii) uses the mean
value theorem in the following way:
where ωt = (1 − t)x + ty, (iii) uses the mean value theorem for the function t →
⟨y − x, ∇f (ωt ) − ∇f (ω0 )⟩ with the derivative ⟨y − x, ∇2 f (ωt )(y − x)⟩. So there exists
τ ∈ [0, 1] such that (iii) holds. The last step follows from m-strong concavity.
Solving the ODE inequality (3.8) or applying Grönwall’s inequality, we obtain
3-8
401-4634-24L Lecture 3 Spring 2024
(i) 2c
Z
≤ ⟨∇f (x), x − x∗ ⟩ exp(−f (x))dx
m
Z
(ii) 2c
= trace(In ) exp(−f (x))dx
m
2n
= .
m
(i) follows from the strong convexity of f : ⟨∇f (x) − ∇f (x∗ ), x − x∗ ⟩ ≥ m2 ∥x − x∗ ∥22
and ∇f (x∗ ) = 0. (ii) follows from integration by parts: for a differentiable function
g : Rn → R and vector field v : Rn → Rn with sufficiently fast decay at infinity, we
have
Z Z
⟨v(x), ∇g(x)⟩ dx = − g(x)(∇ · v)(x)dx. (3.9)
We integrated over ∇g, differentiated over v and the boundary term is 0 because of the
decay at infinity.
3-9
401-4634-24L Lecture 3 Spring 2024
The χ2 -divergence is an upper bound of the total variation distance because of Cauchy-
Schwarz inequality.
Here Eµ and Varµ denote the expectation with respect to µ and variance with respect
to µ respectively
Z
Eµ [g] := g(x)µ(x)dx,
A B C
µ
g
Figure 3.2. Illustration of large Poincaré constant. µ is bimodal with one mode
in region A and the other mode in region C. It has a bottleneck in region B where
density is close to 0. When the two modes are far away, it becomes possible to design
a g with small gradient, hiding inside region B and has a large variance. In this case,
the Poincaré constant of µ has to be large.
3-10
401-4634-24L Lecture 3 Spring 2024
(iii) follows from integration by parts (3.9). (iv) follows from Poincaré inequality. Solv-
ing the ODE for χ2 divergence or apply Grönwall’s inequality, we obtain the desired
result.
Let µk denote the law of Xk . Then we have the following mixing time result.
3-11
401-4634-24L Lecture 3 Spring 2024
Theorem 3.3.1. Assume that the target measure µ ∝ exp(−f ) satisfies mIn ⪯ ∇2 f ⪯
L 1
LIn . Let κ := m . Then, given h ≲ Lκ , for K ≥ 1,
K mhK 1 1
W2 (µ , µ) ≤ exp − W2 (µ0 , µ) + ch 2 n 2 κ,
2
A few remarks
ϵ2
h≲ ,
Lκn
and the number of steps K choice
√
κ2 n
mW2 (µ0 , µ)
K ≳ 2 log .
ϵ ϵ
Proof sketch. Since ULA is the Euler discretization of the continuous Langevin
diffusion in Eq. (3.1), it is natural to analyze the convergence of ULA by comparing
it to the continuous Langevin diffuison. We know in Section 3.2.3 that the continuous
Langevin diffusion convergences exponentially fast to the target measure µ with a rate
that depends on the strong log-concavity m. It remains to analyze the discretization
error and how it accumulates as a function of the total number of steps K.
Given the above intuition, the main problem becomes how to write W2 (µk+1 , µ) as
2
a function of W2 (µk , µ). In other words, we want to upper bound E X k+1 − X(k+1)h 2
2
as a function of E X k − Xkh 2 . This analysis is separate into two parts:
• The one-step discretization error if both the discrete process and the continuous
process are started at the same distribution. The distance between X k+1 and
X̄(k+1)h in Figure 3.3.
3-12
401-4634-24L Lecture 3 Spring 2024
A ULA ...
µ1 µ2 µk
µ0
(LD) X(k+1)h
for time
h
• The Wasserstein distance contraction result for continuous Langevin diffusion ran
for time h, which we already know how to proceed in Section 3.2.3. The distance
between X̄(k+1)h and X(k+1)h in Figure 3.3.
See Section 4.1 of [Che23] for a full proof and other proof techniques in the following
subsections. YC — Or wait a bit for me to type it in
3-13
Bibliography
[B+ 15] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
[Che69] Jeff Cheeger. A lower bound for the smallest eigenvalue of the laplacian. In
Proceedings of the Princeton conference in honor of Professor S. Bochner,
pages 195–199, 1969.
[Maz60] Vladimir Gilelevich Maz’ya. Classes of domains and imbedding theorems for
function spaces. In Doklady Akademii Nauk, volume 133, pages 527–530. Rus-
sian Academy of Sciences, 1960.
14