0% found this document useful (0 votes)
3 views

Lecture 03

Uploaded by

jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 03

Uploaded by

jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

401-4634-24L: Diffusion Models, Sampling and Stochastic Localization

Lecture 3 – Langevin algorithms


Lecturer: Yuansi Chen Spring 2024

Key concepts:

• Sampling from a smooth density

• Langevin diffusion

• Unadjusted Langevin Algorithm (ULA) and Metropolis-adjusted Langevin Algo-


rithm (MALA)

• Convergence of continuous Langevin diffusion

– in Wasserstein distance, using strong-logconcavity via coupling


– in χ2 distance, using Poincaré inequality via Fokker Planck equation

• (note is here but only discussed in the 4th lecture) Convergence of ULA

The material of this lecture is based on Chapter 1 and 4 of [Che23].

3.1 Introduction
In the previous lecture, we saw that the corners of the convex body cause a lot of
problems for the sampling algorithm Ball walk: Ball walk has to choose a smaller
step-size otherwise it would have close-to-zero acceptance rate in many places. It is
not always the case in practice that we encounter distributions that are as nonsmooth
as the uniform distribution on a convex body. In this lecture, we completely avoid
the nonsmooth problem by making a simplifying assumption that we are dealing with
smooth densities of the form

µ ∝ e−f

where f is twice continuously differentiable. We would like to know whether there exist
sampling algorithms better than Ball walk.

3-1
401-4634-24L Lecture 3 Spring 2024

3.1.1 Langevin diffusion


Given f : Rn → R a twice-differentiable function, the Langevin diffusion is the
following stochastic differential equation (SDE)

dXt = −∇f (Xt )dt + 2dBt , (3.1)
where Bt is the Brownian motion in Rn .

Brownian motion. We define Brownian motion in Rn , denoted by {Bt }t≥0 , to be a


stochastic process, i.e. a sequence of random variables in Rn indexed by t ≥ 0, satisfying
the following four properties
1. B0 = 0
2. {Bt }t≥0 is continuous with probability 1

3. (independent increment) For any k ∈ N and {ti }ki=0 with t0 = 0 < t1 < · · · < tk ,
the random variables Bti+1 − Bti for 0 ≤ i ≤ k − 1, are mutually independent
4. (Gaussian increment) for any 0 ≤ s < t, Bt − Bs is distributed as N (0, (t − s)In ).
Intuitively, we may think dBt in Eq. (3.1) is Gaussian noise with mean 0 and variance
dt. Then Eq. (3.1) may be thought of as a noisy gradient
√ descent, with a deterministic
gradient step −∇f (Xt )dt and a diffusion component 2dBt . For a rigorous treatment
of Brownian motion and stochastic calculus, readers are referred to [Pro04].

3.1.2 Sampling algorithms connected to Langevin diffusion


Langevin diffusion in Eq (3.1) is a continuous process. To simulate it in practice, we
need to discretize it.

Unadjusted Langevin Algorithm (ULA). Unadjusted Langevin Algorithm is the


outcome of Euler discretization of the Langevin diffusion. Starting for X0 drawn from
an initial distribution, it iterates as follows: from the current state Xk , it produces the
next state by

Xk+1 = Xk − h∇f (Xk ) + 2hξk (3.2)
where h > 0 is the step-size (or the discretization size) to be chosen by the user and
ξk ∼ N (0, In ) is independent Gaussian noise. Intuitively, taking the limit h → 0 in
ULA would get us back to the Langevin diffusion in Eq (3.1). For small step-size and
large k, we expect the distribution of Xk to be close to the stationary measure of the
Langevin diffusion (hopefully the target measure µ, but we haven’t proved it yet) with
an error that depends on h.

3-2
401-4634-24L Lecture 3 Spring 2024

Metropolis-adjusted Langevin Algorithm (MALA). To ensure that a Markov


chain has the correct stationary measure, we can always add a Metropolis-Hastings filter
(or accept-reject step) to it. This is what Metropolis-adjusted Langevin Algorithm does
in addition to ULA. It iterates as follows: from the current state Xk , it has a proposal
step and an accept-reject step
• Proposal step: same as in ULA

Zk+1 = Xk − h∇f (Xk ) + 2hξk

• Accept-reject step: go to
n µ(Z )P o
Zk+1 (Xk )
( k+1
Zk+1 with probability min 1, µ(Xk )PX (Zk+1 )
Xk+1 = k

Xk with the remaining probability.

Note that conditioned on Xk , the proposal step boils down to drawing a Gaussian with
mean Xk − h∇f (Xk ) and covariance 2hIn . Hence, the proposal kernel has an explicit
form
!
1 ∥x − (z − h∇f (z))∥22
Pz (x) = n exp − .
(2π · 2h) 2 4h

Then, the acceptance rate also has an explicit form


 
µ(z)Pz (x)
min 1,
µ(x)Px (z)
  
1 2 1 2
= min 1, exp −f (z) − ∥x − (z − h∇f (z))∥2 + f (x) + ∥z − (x − h∇f (x))∥2 .
4h 4h
In addition to one gradient evaluation step in the proposal step, MALA requires two
more gradient evaluation steps and two function evaluation steps per iteration.

Metropolized random walk (MRW). We can always introduce a Ball-walk-like


sampling algorithm for sampling a smooth density. In each iteration, it has a Gaussian
proposal followed by an accept-reject step.
• Proposal step:

Zk+1 = Xk + 2hξk .

• Accept-reject step: go to
  MRW (X )

µ(Zk+1 )PZ k
Z k+1
k+1 with probability min 1, MRW (Z
µ(Xk )PX k+1 )
Xk+1 = k

Xk with the remaining probability.


3-3
401-4634-24L Lecture 3 Spring 2024

Here, because of the symmetry of the proposal P MRW , PZMRW


k+1
MRW
(Xk ) cancels with PXk
(Zk+1 ).
n o
µ(Zk+1 )
The acceptance rate boils down to min 1, µ(Xk ) .

Main questions. We are interested in the convergence of the continuous Langevin


diffusion and the three sampling algorithms for sampling a smooth density. In this
lecture, we ask the following three main questions, and we try to answer in the next
section
1. What is the stationary measure of Langevin diffusion (3.1)? We hope it to be
µ ∝ e−f .
2. How fast does Langevin diffusion converge to its stationary measure?
3. What is the mixing time of ULA?

3.2 Convergence of Langevin diffusion


Because both sampling algorithms ULA and MALA are closely related to the Langevin
diffusion, it is natural to make use of the convergence of the Langevin diffusion in
continuous time to analyze the two discrete-time algorithms. We call it SDE-based
mixing proof technique, in contrast to the conductance-based mixing proof technique
in Lecture 2.
We first introduce the Fokker-Planck equation associated with the Langevin diffu-
sion in Eq. (3.1), assume its correctness, and then analyze the Langevin diffusion based
on it. Once we have a good understanding of the convergence of Langevin diffusion,
the mixing time analysis of ULA follows from a careful discretization analysis.

3.2.1 Fokker-Planck equation


Consider a drift-diffusion process {Xt }t≥0 on R driven by a drift term a : R × R → R
and a diffusion term b : R × R → R, and characterized by the following SDE

dXt = a(Xt , t)dt + b(Xt , t)dBt , (3.3)

where Bt is the Brownian motion in R. We assume the following fact without proving
it.

Fokker-Planck equation. Let {Xt }t≥0 be a drift-diffusion process following SDE (3.3),
starting from X0 ∼ µ0 . Then for all t ≥ 0, denoting the law of Xt by µt , we have
∂ ∂ ∂2
µt (x) = − [a(x, t)µt (x)] + 2 [D(x, t)µt (x)] , ∀x ∈ R, (3.4)
∂t ∂x ∂ x

3-4
401-4634-24L Lecture 3 Spring 2024

where D(x, t) = b(x, t)2 /2. The above equation is called the Fokker-Planck equation
associated to the drift-diffusion process {Xt }t≥0 . The Fokker-Planck equation describes
the time evolution of the probability density function via a partial differential equa-
tion (PDE). Unlike Eq. (3.3), the Fokker-Planck equation in Eq. (3.4) is completely
deterministic.
In general, there are two main approaches to interpret a drift-diffusion process in
Eq. (3.3) as illustrated in Figure 3.1. The first approach is the pathwise view: given
a random draw of the Brownian motion Bt , Eq. (3.3) becomes an ordinary differential
equation and, it generates a continuous path in R. Each random draw of the Brownian
motion generates a path. The collection of all paths describes the SDE. The second
approach is the density evolution view: since we don’t really care about the identity
of each path, we can focus on the evolution of the law of the density of Xt at any
time t > 0. Fokker-Planck equation enables this second approach via a PDE. The
two approaches are complimentary and are related via Markov semigroup theory and
Kolmogorov’s forward and backward equations. For a detailed exposition and a proof
of the Fokker-Planck equation, see Chapter 1.2 of [Che23].
Example 1 (heat equation). Taking a = 0, b = 1 in Eq. (3.4), we obtain the heat
equation
∂ 1 ∂2
µt = µt .
∂t 2 ∂x2
Starting from 0, the PDE has a closed-form solution
 2
1 x
µt = √ exp − .
2πt 2t
The above density is exactly the law of Xt defined via dXt = dBt .
Finally, one can also introduce a higher dimensional formulation of the Fokker-
Planck equation. Consider a drift-diffusion process {Xt }t≥0 on Rn driven by a drift
term a : Rn × R → Rn and a diffusion term b : Rn × R → Rn×m , characterized by the
following SDE
dXt = a(Xt , t)dt + b(Xt , t)dBt , (3.5)
where Bt is the Brownian motion in Rm .

Fokker-Planck equation in n-dimension. Let {Xt }t≥0 be a drift-diffusion process


following SDE (3.5), starting from X0 ∼ µ0 . Then for all t ≥ 0, denoting the law of Xt
by µt , we have
n n X n
∂ X ∂ X ∂2
µt (x) = − [ai (x, t)µt (x)] + [Dij (x, t)µt (x)] , ∀x ∈ Rn , (3.6)
∂t i=1
∂xi i=1 j=1
∂xi ∂xj

where D = 21 bb⊤ .

3-5
401-4634-24L Lecture 3 Spring 2024

Xt Xt

X0 X0

µt
µ{t+δ}
Figure 3.1. Two interpretations of a drift-diffusion process. Left: pathwise view.
Right: density evolution view.


Example 2 (Langevin diffusion). Taking a(x, t) = −∇f (x) and b(x, t) = 2In , the
SDE corresponds to Langevin diffusion

dXt = −∇f (Xt )dt + 2dBt .

The associated Fokker-Planck equation is



µt = ∇ · (µt ∇f ) + ∆µt . (3.7)
∂t

Differential operator notation.

• The divergence of a continuously differentiable vector function F : Rn → Rn is


n
X ∂
∇·F = Fi ,
i=1
∂xi

where Fi : Rn → R is the i-th coordinate output of F .

• The Laplacian of a twice-differentiable function g : Rn → R is


n
X ∂2
∆g = 2
g.
i=1
∂x i

Note that it is also the divergence of the gradient (∇g), i.e., ∆g = ∇ · ∇g.

3-6
401-4634-24L Lecture 3 Spring 2024

3.2.2 The stationary measure of Langevin diffusion


Assuming the correctness of the Fokker-Planck equation (3.6), we are ready to show
that µ is a stationary measure of Langevin diffusion. To show that µ is a stationary
measure, it suffices to show that

µt
∂t
vanishes pointwise when µt is evaluated at µ. We already know the Fokker-Planck
equation of Langevin diffusion in Eq. (3.7). It remains to show that
?
0 = ∇ · (µ∇f ) + ∆µ.
We have by definition of the divergence
n
X
∇ · (µ∇f ) = ∂i (µ · ∂i f ),
i=1
and
n
X
∆µ = ∂i2 µ
i=1
n
(i) X
= ∂i (−µ · ∂i f )
i=1
n
X
=− ∂i (µ · ∂i f ),
i=1

where ∂i is used as a shorthand for ∂x∂ i , (i) used the assumption µ = ce−f with c a
constant. So, the two terms above sum to 0. And we prove µ is a stationary measure.

3.2.3 Convergence of Langevin diffusion in Wasserstein dis-


tance
We prove the convergence of Langevin diffusion in Wasserstein distance under strong
logconcavity.

Wasserstein distance. Let µ, ν be two measures on Rn with finite second moments,


i.e., EX∼ν [∥X∥22 ] < ∞ and EX∼µ [∥X∥22 ] < ∞. We define the Wasserstein-2 distance
between µ and ν by
 Z  12
2
W2 (µ, ν) := inf ∥x − y∥2 γ(x, y)dxdy ,
γ∈C(µ,ν)

where C(µ, ν) is the set of all couplings of µ and ν. We say γ is a coupling of µ and ν,
if its marginal on the first variable is µ and its marginal on the second is ν.

3-7
401-4634-24L Lecture 3 Spring 2024

Strong logconcavity. We say a measure µ is m-strongly logconcave if µ ∝ exp(−f )


with f being m-strongly convex, i.e., mIn ⪯ ∇2 f .
Theorem 3.2.1 (Convergence of Langevin diffusion in Wasserstein distance). Let
{Xt }t≥0 be generated according to the Langevin diffusion (3.1) with initialization X0 ∼
µ0 and stationary measure µ ∝ e−f . Assume µ is m-strongly logconcave. Let µt denote
the law of Xt , then

W22 (µt , µ) ≤ exp(−2mt)W22 (µ0 , µ).

Proof. The main proof strategy is to construct a coupling between µt and µ by taking
advantage of the Langevin SDE (3.1), and then prove that it is bounded. We construct
a coupling as follows. Let γ0 be an optimal coupling of (µ0 , µ) which achieves W22 (µ0 , µ).
Draw (X0 , X0∗ ) ∼ γ0 . Let X0 and X0∗ evolve through the Langevin SDE with the same
copy of Brownian motion {Bs }s≥0 . Let γt denote the law of the resulting (Xt , Xt∗ ). γt
is a coupling of (µt , µ) because
• Marginally, we just followed the Langevin SDE. So the law of Xt is µt

• µ is a stationary measure, so the law of Xt∗ remains µ.


Next, we control the E(Xt ,Xt∗ )∼γt ∥Xt − Xt∗ ∥22 . We have
 

d ∥Xt − Xt∗ ∥22 = 2 ⟨Xt − Xt∗ , dXt − dXt∗ ⟩


(i)
= −2 ⟨Xt − Xt∗ , ∇f (Xt ) − ∇f (Xt∗ )⟩ dt
(ii)
≤ −2m ∥Xt − Xt∗ ∥22 dt. (3.8)

(i) uses the fact that Xt and Xt∗ shared the same Brownian motion. (ii) uses the mean
value theorem in the following way:

⟨y − x, ∇f (y) − ∇f (x)⟩ = ⟨y − x, ∇f (ωt ) − ∇f (ω0 )⟩ |t=1


(iii)
= y − x, ∇2 f (ωτ )(y − x)
≥ m ∥y − x∥22 ,

where ωt = (1 − t)x + ty, (iii) uses the mean value theorem for the function t →
⟨y − x, ∇f (ωt ) − ∇f (ω0 )⟩ with the derivative ⟨y − x, ∇2 f (ωt )(y − x)⟩. So there exists
τ ∈ [0, 1] such that (iii) holds. The last step follows from m-strong concavity.
Solving the ODE inequality (3.8) or applying Grönwall’s inequality, we obtain

∥Xt − Xt∗ ∥22 ≤ exp(−2mt) ∥X0 − X0∗ ∥22 .

3-8
401-4634-24L Lecture 3 Spring 2024

Taking expectation on both sides, we obtain


Eγt ∥Xt − Xt∗ ∥22 ≤ exp(−2mt)Eγ0 ∥X0 − X0∗ ∥22 = exp(−2mt)W22 (µ0 , µ).
We complete the proof by noticing that γt is one coupling and W22 (µt , µ) takes the
infimum over all couplings.
The following results show that in sampling a strongly log-concave density, it is not
hard to have a reasonable control of the initial Wasserstein distance.
Lemma 1. Let µ ∝ e−f , where f is m-strongly convex and minimized at x∗ . Then
2n
EX∼µ ∥X − x∗ ∥22 ≤ .
m
Remark that when f satisfies mIn ⪯ ∇2 f ⪯ LIn , x∗ can be obtained up to ϵ-error
L
in m log(1/ϵ) iterations via gradient descent method (see e.g., [B+ 15]).
Proof. Let µ = c exp(−f ), where c is a constant. We have
Z
EX∼µ ∥X − x ∥2 = c ∥x − x∗ ∥22 exp(−f (x))dx
∗ 2

(i) 2c
Z
≤ ⟨∇f (x), x − x∗ ⟩ exp(−f (x))dx
m
Z
(ii) 2c
= trace(In ) exp(−f (x))dx
m
2n
= .
m
(i) follows from the strong convexity of f : ⟨∇f (x) − ∇f (x∗ ), x − x∗ ⟩ ≥ m2 ∥x − x∗ ∥22
and ∇f (x∗ ) = 0. (ii) follows from integration by parts: for a differentiable function
g : Rn → R and vector field v : Rn → Rn with sufficiently fast decay at infinity, we
have
Z Z
⟨v(x), ∇g(x)⟩ dx = − g(x)(∇ · v)(x)dx. (3.9)

We integrated over ∇g, differentiated over v and the boundary term is 0 because of the
decay at infinity.

3.2.4 Convergence of Langevin diffusion in χ2 -divergence


To show that the above convergence is not mainly due to the choice of Wasserstein dis-
tance, we prove the convergence of Langevin diffusion in χ2 -divergence under Poincaré
inequality.

3-9
401-4634-24L Lecture 3 Spring 2024

χ2 -divergence. Let ν, µ be two measures on Rn . We define the χ2 -divergence between


ν and µ by
  Z  2
2 ν ν(x)
χ (ν ∥ µ) := Varµ = µ(x)dx − 1.
µ µ(x)

The χ2 -divergence is an upper bound of the total variation distance because of Cauchy-
Schwarz inequality.

Poincaré inequality. We say a measure µ satisfies a Poincaré inequality with con-


stant CPI if for all differentiable and square integrable function with respect to ν, we
have

Varµ [g] ≤ CPI Eµ ∥∇g(x)∥22 .

Here Eµ and Varµ denote the expectation with respect to µ and variance with respect
to µ respectively
Z
Eµ [g] := g(x)µ(x)dx,

Varµ [g] := Eµ [g 2 ] − (Eµ [g])2 .

Similar to the isoperimetry in Lecture 2, Poincaré inequality is also an intrinsic property


of the measure µ, and this definition has nothing to do with the sampling algorithm.
Intuitively, the Poincaré constant being large also indicates that the measure µ has
a bottleneck (see Figure 3.2). Additionally, the isoperimetric constant and Poincaré
constant are related as ψ ≤ C2PI according to [Maz60] and [Che69].

A B C

µ
g

Figure 3.2. Illustration of large Poincaré constant. µ is bimodal with one mode
in region A and the other mode in region C. It has a bottleneck in region B where
density is close to 0. When the two modes are far away, it becomes possible to design
a g with small gradient, hiding inside region B and has a large variance. In this case,
the Poincaré constant of µ has to be large.

3-10
401-4634-24L Lecture 3 Spring 2024

Theorem 3.2.2 (Convergence of Langevin diffusion in χ2 -divergence). Let {Xt }t≥0 be


generated according to the Langevin diffusion (3.1) with initialization X0 ∼ µ0 and
stationary measure µ ∝ e−f . Assume µ is m-strongly logconcave. Let µt denote the law
of Xt , then
 
2 2t
χ (µt ∥ µ) ≤ exp − χ2 (µ0 ∥ µ).
CPI

Proof. Taking derivative with respect to t, we have


Z  2 
d 2 d µt (x)
χ (µt ∥ µ) = − 1 µ(x)dx
dt dt µ2 (x)
Z   
(i) µt (x) ∂ µt (x)
=2 µ(x)dx
µ(x) ∂t µ(x)
    
Z   −∇ · µ∇ µt
(ii) µt (x)  µ
= 2  µ(x)dx
µ(x) µ(x)
Z 2
(iii) µt
= −2 ∇ µ(x)dx
µ 2
(iv) 2 2
≤ − χ (µt ∥ µ)
CPI
In (i) we switched the order of derivative and integral, which can be done after verifying
conditions for dominated convergence. (ii) follows from the Fokker-Planck equation for
µt in Eq. (3.7) and the observation that
  
µt
∇ · (µt ∇f ) + ∆µt = −∇ · µ∇ .
µ

(iii) follows from integration by parts (3.9). (iv) follows from Poincaré inequality. Solv-
ing the ODE for χ2 divergence or apply Grönwall’s inequality, we obtain the desired
result.

3.3 Mixing time of ULA


Recall the iteration of ULA from Eq. (3.2),

Xk+1 = Xk − h∇f (Xk ) + 2hξk . (3.10)

Let µk denote the law of Xk . Then we have the following mixing time result.

3-11
401-4634-24L Lecture 3 Spring 2024

Theorem 3.3.1. Assume that the target measure µ ∝ exp(−f ) satisfies mIn ⪯ ∇2 f ⪯
L 1
LIn . Let κ := m . Then, given h ≲ Lκ , for K ≥ 1,
 
K mhK 1 1
W2 (µ , µ) ≤ exp − W2 (µ0 , µ) + ch 2 n 2 κ,
2

where c is a universal constant.

A few remarks

• If we set the initial measure µ0 to be point mass at x∗ which is the mode of µ,


then
n
W2 (µ0 , µ)2 = EX∼µ ∥X − x∗ ∥22 ≤ ,
m
as a result of integration by parts and strong log-concavity.

• For mixing,
√ we want to achieve mW2 ≤ ϵ. It is more convenient to use the
metric mW2 instead of W2 because the former is scale-invariant.

• In order for mW2 ≤ ϵ in Theorem 3.3.1, we need both terms to be less than ϵ.
It results in the step-size choice

ϵ2
h≲ ,
Lκn
and the number of steps K choice
√
κ2 n

mW2 (µ0 , µ)
K ≳ 2 log .
ϵ ϵ

Proof sketch. Since ULA is the Euler discretization of the continuous Langevin
diffusion in Eq. (3.1), it is natural to analyze the convergence of ULA by comparing
it to the continuous Langevin diffuison. We know in Section 3.2.3 that the continuous
Langevin diffusion convergences exponentially fast to the target measure µ with a rate
that depends on the strong log-concavity m. It remains to analyze the discretization
error and how it accumulates as a function of the total number of steps K.
Given the above intuition, the main problem becomes how to write W2 (µk+1 , µ) as
2
a function of W2 (µk , µ). In other words, we want to upper bound E X k+1 − X(k+1)h 2
2
as a function of E X k − Xkh 2 . This analysis is separate into two parts:

• The one-step discretization error if both the discrete process and the continuous
process are started at the same distribution. The distance between X k+1 and
X̄(k+1)h in Figure 3.3.

3-12
401-4634-24L Lecture 3 Spring 2024

A ULA ...
µ1 µ2 µk

µ0

µ1h µ2h ... µkh


Langevin diffusion
(LD)
for time h
B X k+1
ULA
via one-step
X k discretization

(LD) X (k+1)h What we want to bound


for time via one-step
h coupling
Xkh

(LD) X(k+1)h
for time
h

Figure 3.3: Illustration of ULA discretization analysis.

• The Wasserstein distance contraction result for continuous Langevin diffusion ran
for time h, which we already know how to proceed in Section 3.2.3. The distance
between X̄(k+1)h and X(k+1)h in Figure 3.3.

See Section 4.1 of [Che23] for a full proof and other proof techniques in the following
subsections. YC — Or wait a bit for me to type it in

3-13
Bibliography

[B+ 15] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.

[Che69] Jeff Cheeger. A lower bound for the smallest eigenvalue of the laplacian. In
Proceedings of the Princeton conference in honor of Professor S. Bochner,
pages 195–199, 1969.

[Che23] Sinho Chewi. Log-concave sampling. Book draft available at


https://chewisinho. github. io, 2023.

[Maz60] Vladimir Gilelevich Maz’ya. Classes of domains and imbedding theorems for
function spaces. In Doklady Akademii Nauk, volume 133, pages 527–530. Rus-
sian Academy of Sciences, 1960.

[Pro04] Philip E Protter. Stochastic integration and differential equations. Springer


Mathematics and Statistics eBooks 2005 English/International, 2004.

14

You might also like