0% found this document useful (0 votes)
40 views36 pages

Inductive Moment Matching

The document introduces Inductive Moment Matching (IMM), a novel generative model that enables efficient one- or few-step sampling without the need for pre-training or extensive tuning. IMM achieves high-quality outputs, surpassing traditional diffusion models with a state-of-the-art FID score on datasets like ImageNet and CIFAR-10. The method leverages stochastic interpolants and a single-stage training procedure to ensure stability and convergence to the data distribution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views36 pages

Inductive Moment Matching

The document introduces Inductive Moment Matching (IMM), a novel generative model that enables efficient one- or few-step sampling without the need for pre-training or extensive tuning. IMM achieves high-quality outputs, surpassing traditional diffusion models with a state-of-the-art FID score on datasets like ImageNet and CIFAR-10. The method leverages stochastic interpolants and a single-stage training procedure to ensure stability and convergence to the data distribution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Inductive Moment Matching

Linqi Zhou 1 Stefano Ermon 2 Jiaming Song 1


arXiv:2503.07565v1 [cs.LG] 10 Mar 2025

Figure 1. Generated samples on ImageNet-256×256 using 8 steps.

Abstract 1. Introduction
Generative models for continuous domains have enabled
Diffusion models and Flow Matching generate numerous applications in images (Rombach et al., 2022; Sa-
high-quality samples but are slow at inference, haria et al., 2022; Esser et al., 2024), videos (Ho et al.,
and distilling them into few-step models often 2022a; Blattmann et al., 2023; OpenAI, 2024), and au-
leads to instability and extensive tuning. To dio (Chen et al., 2020; Kong et al., 2020; Liu et al., 2023),
resolve these trade-offs, we propose Inductive yet achieving high-fidelity outputs, efficient inference, and
Moment Matching (IMM), a new class of gen- stable training remains a core challenge — a trilemma that
erative models for one- or few-step sampling continues to motivate research in this domain. Diffusion
with a single-stage training procedure. Unlike models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song
distillation, IMM does not require pre-training et al., 2020b), one of the leading techniques, require many
initialization and optimization of two networks; inference steps for high-quality results, while step-reduction
and unlike Consistency Models, IMM guarantees methods, such as diffusion distillation (Yin et al., 2024;
distribution-level convergence and remains sta- Sauer et al., 2025; Zhou et al., 2024; Luo et al., 2024a) and
ble under various hyperparameters and standard Consistency Models (Song et al., 2023; Geng et al., 2024;
model architectures. IMM surpasses diffusion Lu & Song, 2024; Kim et al., 2023), often risk training
models on ImageNet-256×256 with 1.99 FID us- collapse without careful tuning and regularization (such as
ing only 8 inference steps and achieves state-of- pre-generating data-noise pair and early stopping).
the-art 2-step FID of 1.98 on CIFAR-10 for a To address the aforementioned trilemma, we introduce In-
model trained from scratch. ductive Moment Matching (IMM), a stable, single-stage
training procedure that learns generative models from
scratch for single- or multi-step inference. IMM operates
1 on the time-dependent marginal distributions of stochas-
Luma AI 2 Stanford University. Correspondence to: Linqi
Zhou <[email protected]>. tic interpolants (Albergo et al., 2023) — continuous-time
stochastic processes that connect two arbitrary probability

1
Inductive Moment Matching

density functions (data at t = 0 and prior at t = 1). By


learning a (stochastic or deterministic) mapping from any
marginal at time t to any marginal at time s < t, it can
naturally support one- or multi-step generation (Figure 2).
IMM models can be trained efficiently from mathematical
induction. For time s < r < t, we form two distributions Figure 2. Using an interpolation from data to prior, we define a
at s by running a one-step IMM from samples at r and one-step sampler that moves from any t to s < t, directly trans-
t. We then minimize their divergence, enforcing that the forming qt (xt ) to qs (xs ). This can be repeated by jumping to an
distributions at s are independent of the starting time-steps. intermediate r < t before moving to s < r.
This construction by induction guarantees convergence to
the data distribution. To help with training stability, we
in FM, the intermediate variable xt = αt x + σt ϵ becomes
model IMM based on certain stochastic interpolants and
a deterministic interpolation and its interpolant velocity
optimize the objective with stable sample-based divergence
vt = αt′ x + σt′ ϵ reduces to FM velocity. Thus, its training
estimators such as moment matching (Gretton et al., 2012).
and inference both reduce to that of FM. When ϵ ∼ N (0, I),
Notably, we prove that Consistency Models (CMs) are a
stochastic interpolants reduce to v-prediction diffusion.
single-particle, first-moment matching special case of IMM,
which partially explains the training instability of CMs.
2.2. Maximum Mean Discrepancy
On ImageNet-256×256, IMM surpasses diffusion models
and achieves 1.99 FID with only 8 inference steps using Maximum Mean Discrepancy (MMD, Gretton et al. (2012))
standard transformer architectures. On CIFAR-10, IMM between distribution p(x), q(y) for x, y ∈ RD is an inte-
similarly achieves state-of-the-art of 1.98 FID with 2-step gral probability metric (Müller, 1997) commonly defined
generation for a model trained from scratch. on Reproducing Kernel Hilbert Space (RKHS) H with a
positive definite kernel k : RD × RD → R as
2
2. Preliminaries MMD2 (p(x), q(y)) = ∥Ex [k(x, ·)] − Ey [k(y, ·)]∥H (1)

2.1. Diffusion, Flow Matching, and Interpolants where the norm is in H. Choices such as the RBF kernel
imply an inner product of infinite-dimensional feature maps
For a data distribution q(x), Variance-Preserving (VP) dif- consisting of all moments of p(x) and q(y), i.e. E[xj ] and
fusion models (Ho et al., 2020; Song et al., 2020b) and E[yj ] for integer j ≥ 1 (Steinwart & Christmann, 2008).
Flow Matching (FM) (Lipman et al., 2022; Liu et al., 2022)
construct time-augmented variables xt as an interpolation
between data x ∼ q(x) and prior ϵ ∼ N (0, I) such that 3. Inductive Moment Matching
xt = αt x + σt ϵ where α0 = σ1 = 1, α1 = σ0 = 0. VP We introduce Inductive Moment Matching (IMM), a method
diffusion commonly chooses αt = cos π2 t , σt = sin π2 t that trains a model of both high quality and sampling ef-
and FM chooses αt = 1 − t, σt = t. Both v-prediction ficiency in a single stage. To do so, we assume a time-
diffusion (Salimans & Ho, 2022) and FM are trained by augmented interpolation between data (distribution at t = 0)
matching the conditional velocity vt = αt′ x + σt′ ϵ such and prior (distribution at t = 1) and propose learning an
that a neural network Gθ (xt , t) approximates Ex,ϵ [vt |xt ]. implicit one-step model (i.e. a one-step sampler) that trans-
Samples can then be generated via probability-flow ODE forms the distribution at time t to the distribution at time s
(PF-ODE) dx dt = Gθ (xt , t) starting from ϵ ∼ N (0, I).
t
for any s < t (Section 3.1). The model enables direct one-
Stochastic interpolants. Unifying diffusion models and step sampling from t = 1 to s = 0 and few-step sampling
FM, stochastic interpolants (Albergo et al., 2023; Albergo & via recursive application from any t to any r < t and then
Vanden-Eijnden, 2022) construct a conditional interpolation to any s < r until s = 0; this allows us to learn the model
qt (xt |x, ϵ) = N (It (x, ϵ), γt2 I) between any data x ∼ q(x) from its own samples via bootstrapping (Section 3.2).
and prior ϵ ∼ p(ϵ) and sets constraints I1 (x, ϵ) = ϵ,
I0 (x, ϵ) = x, and γ1 = γ0 = 0. Similar to FM, a determin- 3.1. Model Construction via Interpolants
istic sampler can be learned by explicitly matching the con-
Given data x ∼ q(x) and prior ϵ ∼ p(ϵ), the time-
ditional interpolant velocity vt = ∂t It (x, ϵ) + γ̇t z where
augmented interpolation xt defined in Albergo et al. (2023)
z ∼ N (0, I) such that Gθ (xt , t) ≈ Ex,ϵ,z [vt |xt ]. Sam-
follows xt ∼ qt (xt |x, ϵ). This implies a marginal interpo-
pling is performed following the PF-ODE dx dt = Gθ (xt , t)
t
lating distribution
similarly starting from prior ϵ ∼ p(ϵ).
ZZ
When γt ≡ 0 and It (x, ϵ) = αt x + σt ϵ for αt , σt defined qt (xt ) = qt (xt |x, ϵ)q(x)p(ϵ)dxdϵ. (2)

2
Inductive Moment Matching

We learn a model distribution implicitly defined by a one- 3.2. Learning via Inductive Bootstrapping
step sampler that transforms qt (xt ) into qs (xs ) for some
While sound, the naı̈ve objective in Eq. (7) is difficult to
s ≤ t. This can be done via a special class of interpolants,
optimize in practice because when t is far from s, the in-
which preserves the marginal distribution qs (xs ) while in-
put distribution qt (xt ) can be far from the target qs (xs ).
terpolating between x and xt . We term these marginal-
Fortunately, our interpolant construction implies that the
preserving interpolants among a class of generalized inter-
model definition in Eq. (6) satisfies boundary condition
polants. Formally, we define xs as a generalized interpolant
qs (xs ) = pθs|s (xs ) regardless of θ (see Lemma 4), which
between x and xt if, for all s ∈ [0, t], its distribution follows
indicates that pθs|t (xs ) ≈ qs (xs ) when t is close to s. Fur-
2
qs|t (xs |x, xt ) = N (Is|t (x, xt ), γs|t I) (3) thermore, the interpolant enforces pθs|t (xs ) ≈ pθs|r (xs ) for
and satisfies constraints It|t (x, xt ) = xt , I0|t (x, xt ) = x, any r < t close to t as long as the model is continuous
γt|t = γ0|t = 0, and qt|1 (xt |x, ϵ) ≡ qt (xt |x, ϵ). When around t. Therefore, we can construct an inductive learning
t = 1, it reduces to regular stochastic interpolants. Next, we algorithm for pθs|t (xs ) by using samples from pθs|r (xs ).
define marginal-preserving interpolants. For better analysis, we define a sequence number n for
Definition 1 (Marginal-Preserving Interpolants). A general- parameter θn and function r(s, t) where s ≤ r(s, t) < t
ized interpolant xs is marginal-preserving if for all t ∈ [0, 1] θn−1
such that pθs|t
n
(xs ) learns to match ps|r (xs ).1 We omit r’s
and for all s ∈ [0, t], the following equality holds: arguments when context is clear and let r(s, t) be a finite
decrement from t but truncated at s ≤ t (see Appendix B.3
ZZ
qs (xs ) = qs|t (xs |x, xt )qt (x|xt )qt (xt )dxt dx, (4) for well-conditioned r(s, t)).
where General objective. With marginal-preserving interpolants
and mapping r(s, t), we learn θn in the following objective:
qt (xt |x, ϵ)q(x)p(ϵ)
Z
qt (x|xt ) = dϵ. (5)
qt (xt )
h i
θn−1
L(θn ) = Es,t w(s, t)MMD2 (ps|r (xs ), pθs|t
n
(xs )) (8)
That is, this class of interpolants has the same marginal at s
where w(s, t) is a weighting function. We choose MMD as
regardless of t. For all t ∈ [0, 1], we define our noisy model
our objective due to its superior optimization stability and
distribution at s ∈ [0, t] as
ZZ show that this objective learns the correct data distribution.
θ
ps|t (xs ) = qs|t (xs |x, xt )pθs|t (x|xt )qt (xt )dxt dx (6) Theorem 1. Assuming r(s, t) is well-conditioned, the in-
terpolant is marginal-preserving, and θn∗ is a minimizer of
where the interpolant is marginal preserving and pθs|t (x|xt ) Eq. (8) for each n with infinite data and network capacity,
is our clean model distribution implicitly parameterized for all t ∈ [0, 1], s ∈ [0, t],
as a one-step sampler. This definition also enables multi- θ∗
step sampling. To produce a clean sample x given xt ∼ lim MMD2 (qs (xs ), ps|t
n
(xs )) = 0. (9)
n→∞
qt (xt ) in two steps via an intermediate s: (1) we sample
x̂ ∼ pθs|t (x|xt ) followed by x̂s ∼ qs|t (xs |x̂, xt ) and (2) In other words, θn eventually learns the target distribution
qs (xs ) by parameterizing a one-step sampler pθs|t
n
(x|xt ).
if the marginal of x̂s matches qs (xs ), we can obtain x by
x ∼ pθ0|s (x|x̂s ). We are therefore motivated to minimize di-
vergence between Eq. (4) and (6) using the objective below. 4. Simplified Formulation and Practice
Naı̈ve objective. As one can easily draw samples from the We present algorithmic and practical decisions below.
model, it can be naı̈vely learned by directly minimizing
h i 4.1. Algorithmic Considerations
L(θ) = Es,t D(qs (xs ), pθs|t (xs )) (7) Despite theoretical soundness, it remains unclear how to
empirically choose a marginal-preserving interpolantt, wWe
with time distribution p(s, t) and a sample-based divergence
preent a sufficient condition for marginal preservation.
metric D(·, ·) such as MMD or GAN (Goodfellow et al.,
2020). If an interpolant xs is marginal-preserving, then the Definition 2 (Self-Consistent Interpolants). Given s, t ∈
minimum loss is 0 (see Lemma 3). One might also notice [0, 1], s ≤ t, an interpolant xs ∼ qs|t (xs |x, xt ) is self-
the similarity between right-hand sides of Eq. (4) and (6). consistent if for all r ∈ [s, t], the following holds:
However, qs (xs ) = pθs|t (xs ) does not necessarily imply
Z
qs|t (xs |x, xt ) = qs|r (xs |x, xr )qr|t (xr |x, xt )dxr (10)
pθs|t (x|xt ) = qt (x|xt ). In fact, the minimizer pθs|t (x|xt ) is
not unique and, under mild assumptions, a deterministic 1
Note that n is different from optimization steps. Advancing
minimizer exists (see Section 4). from n − 1 to n can take arbitrary number of optimization steps.

3
Inductive Moment Matching

In other words, xs has the same distribution if one (1) di-


rectly samples it by interpolating x and xt and (2) first sam-
ples any xr (given x and xt ) and then samples xs (given
x and xr ). Furthermore, self-consistency implies marginal
preservation (Lemma 5).
DDIM interpolant. Denoising Diffusion Implicit Mod-
els (Song et al., 2020a) was introduced as a fast ODE sam-
pler for diffusion models, defined as
 
σs σs
DDIM(xt , x, s, t) = αs − αt x + xt (11)
σt σt
and sample xs = DDIM(xt , Ex [x|xt ], s, t) can be drawn
when Ex [x|xt ] is approximated by a network. We show
in Appendix C.1 that DDIM as an interpolant, i.e. γs|t ≡
0 and Is|t (x, xt ) = DDIM(xt , x, s, t), is self-consistent. Figure 3. With self-consistent interpolants, IMM uses M particle
Moreover, with deterministic interpolants such as DDIM, samples (M = 2 is shown above) for moment matching. Samples
there exists a deterministic minimizer pθs|t (x|xt ) of Eq. (7). from pθs|t (xs ) are obtained by drawing from pθs|t (x|xt ) followed
Proposition 1. (Informal) If γs|t ≡ 0 and Is|t (x, xt ) satis- by qs|t (xs |x, xt ). Solid and dashed red lines indicate sampling
with and without gradient propagation respectively. After M sam-
fies mild assumptions, there exists a deterministic pθs|t (x|xt )
ples are drawn, sample xs is repulsed by x′s and attracted towards
that attains 0 loss for Eq. (7). samples of x̃s and x̃′s through kernel function k(·, ·).
See Appendix B.6 for formal statement and proof. This
allows us to define pθs|t (x|xt ) = δ(x − gθ (xt , s, t)) for a
An empirical estimate of the above objective uses M particle
neural network gθ (xt , s, t) with parameter θ by default. samples to approximate each distribution indexed by t. In
Eliminating stochasticity. We use DDIM interpolant, de- practice, we divide a batch of model output with size B into
terministic model, and prior p(ϵ) = N (0, σd2 I) where σd B/M groups within which share the same (s, t) sample, and
is the data standard deviation (Lu & Song, 2024). As a the objective is approximated by instantiating B/M number
result, one can draw xs from model via xs = fs,t θ
(xt ) := of M × M matrices. Note that the number of model passes
DDIM(xt , gθ (xt , s, t), s, t) where xt ∼ qt (xt ). does not change with respect to M (see Appendix C.4). A
M = 2 version is visualized in Figure 3 and a simplified
Re-using xt for xr . Inspecting Eq. (8) and (6), one re- training algorithm is shown in Algorithm 1. A full training
quires xr ∼ qr (xr ) to generate samples from the target algorithm is shown in Appendix D.
distribution. Instead of sampling xr given a new (x, ϵ)
pair, we can reduce variance by reusing xt and x such that
4.2. Other Implementation Choices
xr = DDIM(xt , x, r, t). This is justified because xr de-
rived from xt preserves the marginal distribution qr (xr ) We defer detailed analysis of each decision to Appendix C.
(see Appendix C.2).
Flow trajectories. We investigate the two most used flow
Stop gradient. We set n to optimization step number, i.e. trajectories (Nichol & Dhariwal, 2021; Lipman et al., 2022),
advancing from n − 1 to n is a single optimizer step where
• Cosine. αt = cos 12 πt , σt = sin 12 πt .
 
θn is initialized from θn−1 . Equivalently, we can omit n
• OT-FM. αt = 1 − t, σt = t.
from θn and write θn−1 as the stop-gradient parameter θ− .
Network gθ (xt , s, t). We set gθ (xt , s, t) = cskip (t)xt +
Simplified objective. Let xt , x′t be i.i.d. random variables
cout (t)Gθ (cin (t)xt , cnoise (s), cnoise (t)) with a neural net-
from qt (xt ) and xr , x′r are variables obtained by reusing
work Gθ , following EDM (Karras et al., 2022). For all
xt , x′t respectively, the training objective can be derived p
choices we let cin (t) = 1/ αt2 + σt2 /σd (Lu & Song,
from the MMD definition in Eq. (1) (see Appendix C.3) as
h h   2024). Listed below are valid choices for other coefficients.

LIMM (θ) = Ext ,x′t ,xr ,x′r ,s,t w(s, t) k ys,t , ys,t (12) • Identity. cskip (t) = 0, cout (t) = 1.
2
• Simple-EDM (Lu & Song, p 2024). cskip (t) = αt /(αt +
     ii
′ ′ ′
+ k ys,r , ys,r − k ys,t , ys,r − k ys,t , ys,r 2 2 2
σt ), cout (t) = −σd σt / αt + σt .
where ys,t = fs,tθ ′
(xt ), ys,t θ
= fs,t θ −
(x′t ), ys,r = fs,r (xr ), • Euler-FM. cskip (t) = 1, cout (t) = −tσd . This is specific
′ θ− ′ to OT-FM schedule.
ys,r = fs,r (xr ), k(·, ·) is a kernel function, and w(s, t) is
θ
a prior weighting function. We show in Appendix C.5 that fs,t (xt ) similarly follows the

4
Inductive Moment Matching

θ
EDM parameterization of the form fs,t (xt ) = cskip (s, t)xt + Algorithm 1 Training (see Appendix D for full version)
cout (s, t)Gθ (cin (t)xt , cnoise (s), cnoise (t)). Input: parameter θ, DDIM(xt , x, s, t), B, M , p
Noise conditioning cnoise (·). We choose cnoise (t) = ct for Output: learned θ
some constant c ≥ 1. We find our model convergence while model not converged do
relatively insensitive to c but recommend using larger c, e.g. Sample data x, label c, and prior ϵ with batch size
1000 (Song et al., 2020b; Peebles & Xie, 2023), because it B and split into B/M groups. Each group shares a
enables sufficient distinction between nearby r and t. (s, r, t) sample.
For each group, xt ← DDIM(ϵ, x, t, 1).
Mapping function r(s, t). We find that r(s, t) via constant For each group, xr ← DDIM(xt , x, r, t).
decrement in ηt = σt /αt works well where the decrement is For each instance, set c = ∅ with prob. p.
chosen in the form of (ηmax − ηmin )/2k for some appropriate Minimize the empirical loss L̂IMM (θ) in Eq. (67).
k (details in Appendix C.7). end while
Kernel function. We use time-dependent Laplace kernels of Algorithm 2 Pushforward Sampling (details in Appendix F)
the form ks,t (x, y) = exp(−w̃(s, t) max(∥x − y∥2 , ϵ)/D)
2
for x, y ∈ RD , some ϵ > 0 to avoid undefined gradients, and Input: model f θ , {ti }N
i=0 , N (0, σd I), (optional) w
w̃(s, t) = 1/cout (s, t). We find Laplace kernels provide bet- Output: xt0
ter gradient signals than RBF kernels. (see Appendix C.8). Sample xN ∼ N (0, σd2 I)
for i = N, . . . , 1 do
Weighting w(s, t) and distribution p(s, t). We follow xti−1 ← ftθi−1 ,ti (xti ) or ftθi−1 ,ti ,w (xti ).
VDM (Kingma et al., 2021; Kingma & Gao, 2024) and end for
define p(t) = U(ϵ, T ) and p(s|t) = U(ϵ, t) for constants
ϵ, T ∈ [0, 1]. Similarly, weighting is defined as
cskip (s, t)xt +cout (s, t)Gw w
θ (xt , s, t, c) where Gθ (xt , s, t, c)
αta
 
1 d is as defined in Eq. (14) and we drop cin (·) and cnoise (·) for
w(s, t) = σ(b − λt ) − λt (13)
2 dt αt2 + σt2 notational simplicity. We justify this decision in Appendix E.
Similar to diffusion models, c is randomly dropped with
where σ(·) is sigmoid function, λt denotes log-SNRt and
probability p during training without special practices.
a ∈ {1, 2}, σ(·), b ∈ R are constants (see Appendix C.9).
We present pushforward sampling in Algorithm 2 and detail
4.3. Sampling both samplers in Appendix F.

Pushforward sampling. A sample xs can be obtained by


directly pushing xt ∼ qt (xt ) through fs,t θ
(xt ). This can 5. Connection with Prior Works
be iterated for an arbitrary number of steps starting from Our work is closely connected with many prior works. De-
ϵ ∼ N (0, σd2 I) until s = 0. We note that, by definition, tailed analysis is found in Appendix G.
θ
one application of fs,t (xt ) is equivalent to one DDIM step
using the learned network gθ (xt , s, t) as the x prediction. Consistency Models. Consistency models (CMs) (Song
This sampler can then be viewed as a few-step sampler et al., 2023; Song & Dhariwal, 2023; Lu & Song, 2024)
using DDIM where gθ (xt , s, t) outputs a realistic sample x uses a network gθ (xt , t) that outputs clean data given noisy
instead of its expectation Ex [x|xt ] as in diffusion models. input xt . It requires point-wise consistency gθ (xt , t) =
gθ (xr , r) for any r < t where xr is obatined via an ODE
Restart sampling. Similar to Xu et al. (2023); Song et al. solver from xt using either pretrained model or groundtruth
(2023), one can introduce stochasticity during sampling by data. Discrete-time CM must satisfy gθ (x0 , 0) = x0 and
re-noising a sample to a higher noise-level before sampling trains via loss Ext ,x,t [d(gθ (xt , t), gθ− (xr , r))] where d(·, ·)
again. For example, a two-step restart sampler from xt is commonly chosen as L2 or LPIPS (Zhang et al., 2018).
θ
requires s ∈ (0, t) for drawing sample x̂ = f0,s (xs ) where
θ
xs ∼ qs (xs |f0,t (xt )). We show in the following Lemma that CM objective with
L2 distance is a single-particle estimate of IMM objective
Classifier-free guidance. Given a data-label pair (x, c), with energy kernel.
during inference time, classifier-free guidance (Ho & Sal- 2
Lemma 1. When xt = x′t , xr = x′r , k(x, y) = −∥x − y∥ ,
imans, 2022) with weight w replaces conditional model
and s >h 0 is a small constant, Eq. (12) reduces to CM loss
output Gθ (xt , s, t, c) by a reweighted model output via 2
i
Ext ,x,t w(t)∥gθ (xt , t) − gθ (xr , r)∥ + C for a valid

wGθ (xt , s, t, c) + (1 − w)Gθ (xt , s, t, ∅) (14) r(t) < t and some constant C.
where ∅ denotes the null-token indicating unconditional out- Notice that single-particle estimate ignores the repulsion
θ
put. Similarly, we define our guided model as fs,t,w (xt ) = force imposed by k(·, ·), thus further biasing the estimate of

5
Inductive Moment Matching

MMD. Energy kernel also only matches the first moment, Stochastic interpolants (Albergo et al., 2023; Albergo &
ignoring all higher moments. These decisions can be sig- Vanden-Eijnden, 2022) extend these ideas by explicitly
nificant contributors to training instability and performance defining a stochastic path between data and prior, then
degradation of CMs. matching its velocity to facilitate distribution transfer. While
IMM builds on top of the interpolant construction, it di-
Improved CMs (Song & Dhariwal, 2023) propose pseudo-
rectly learns one-step mappings between any intermediate
huber loss as d(·, ·) which we justify in the Lemma below.
marginal distributions.
Lemma 2. Negative pseudo-huber loss kc (x, y) = c −
q
2
Diffusion distillation. To resolve diffusion models’ sam-
∥x − y∥ + c2 for c > 0 is a conditionally positive def- pling inefficiency, recent methods (Salimans & Ho, 2022;
inite kernel that matches all moments of x and y where Meng et al., 2023; Yin et al., 2024; Zhou et al., 2024; Luo
weights on higher moments depend on c. et al., 2024a; Heek et al., 2024) focus on distilling one-step
From a moment-matching perspective, the improved per- or few-step models from pre-trained diffusion models. Some
formance is explained by the loss matching all moments of approaches (Yin et al., 2024; Zhou et al., 2024) propose
the distributions. In addition to pseudo-huber loss, many jointly optimizing two networks but the training requires
other kernels (Laplace, RBF, etc.) are all valid choices in careful tuning in practice and can lead to mode collapse (Yin
the design space. et al., 2024). Another recent work (Salimans et al., 2024)
explicitly matches the first moment of the data distribution
We also extend IMM loss to the differential limit by taking available from pre-trained diffusion models. In contrast, our
r(s, t) → t, the result of which subsumes the continuous- method implicitly matches all moments using MMD and
time CM (Lu & Song, 2024) as a single-particle estimate can be trained from scratch with a single model.
(Appendix H). We leave experiments for this to future work.
Few-step generative models from scratch. Early one-
Diffusion GAN and Adversarial Consistency Distillation. step generative models primarily relied on GANs (Good-
Diffusion GAN (Xiao et al., 2021) parameterizes the gener- fellow et al., 2020; Karras et al., 2020; Brock, 2018)
ative distribution as pθs|t (xs |xt ) =
RR
qs|t (xs |x, xt )δ(x − and MMD (Li et al., 2015; 2017) (or their combination)
Gθ (xt , z, t))p(z)dzdx for s as a fixed decrement from t but scaling adversarial training remains challenging. Re-
and p(z) a noise distribution. It defines the interpolant cent independent classes of few-step models, e.g. Consis-
qs|t (xs |x, xt ) as the DDPM posterior distribution, which is tency Models (CMs) (Song et al., 2023; Song & Dhariwal,
self-consistent (see Appendix G.2) and introduces random- 2023; Lu & Song, 2024), Consistency Trajectory Models
ness to the sampling process to match qt (x|xt ) instead of (CTMs) (Kim et al., 2023; Heek et al., 2024) and Shortcut
the marginal. Both Diffusion GAN and Adversarial Consis- Models (SMs) (Frans et al., 2024) still face training instabil-
tency Distillation (Sauer et al., 2025) use GAN objective, ity and require specialized components (Lu & Song, 2024)
which shares similarity to MMD in that MMD is defined as (e.g., JVP for flash attention) or other special practices (e.g.,
an integral probability metric where the optimal discrimina- high weight decay for SMs, combined LPIPS (Zhang et al.,
tor is chosen in RKHS. This eliminates the need for explicit 2018) and GAN losses for CTMs, and special training sched-
adversarial optimization of a neural-network discriminator. ules (Geng et al., 2024)) to remain stable. In contrast, our
Generative Moment Matching Network. GMMN (Li method trains stably with a single loss and achieves strong
et al., 2015) directly applies MMD to train a generator performance without special training practices.
Gθ (z) where z ∼ N (0, I) to match the data distribu-
tion. It is a special case of IMM in that when t = 1 and 7. Experiments
r(s, t) ≡ s = 0 our loss reduces to naı̈ve GMMN objective.
We evaluate IMM’s empirical performance (Section 7.1),
training stability (Section 7.2), sampling choices (Sec-
6. Related Works tion 7.3), scaling behavior (Section 7.4) and ablate our prac-
Diffusion, Flow Matching, and stochastic interpolants. tical decisions (Section 7.5).
Diffusion models (Sohl-Dickstein et al., 2015; Song et al.,
2020b; Ho et al., 2020; Kingma et al., 2021) and Flow 7.1. Image Generation
Matching (Lipman et al., 2022; Liu et al., 2022) are widely We present FID (Heusel et al., 2017) results for uncondi-
used generative frameworks that learn a score or velocity tional CIFAR-10 and class-conditional ImageNet-256×256
field of a noising process from data into a simple prior. in Table 1 and 2. For CIFAR-10, we separate baselines into
They have been scaled successfully for text-to-image (Rom- diffusion and flow models, distillation models, and few-step
bach et al., 2022; Saharia et al., 2022; Podell et al., 2023; models from scratch. IMM belongs to the last category
Chen et al., 2023; Esser et al., 2024) and text-to-video (Ho in which it achieves state-of-the-art performance of 1.98
et al., 2022a; Blattmann et al., 2023; OpenAI, 2024) tasks.

6
Inductive Moment Matching

Family Method FID (↓) Steps (↓) Family Method FID(↓) Steps (↓) #Params
DDPM (Ho et al., 2020) 3.17 1000 BigGAN (Brock, 2018) 6.95 1 112M
DDPM++ (Song et al., 2020b) 3.16 1000 GAN GigaGAN (Kang et al., 2023) 3.45 1 569M
NCSN++ (Song et al., 2020b) 2.38 1000 StyleGAN-XL (Karras et al., 2020) 2.30 1 166M
Diffusion DPM-Solver (Lu et al., 2022) 4.70 10 VQGAN (Esser et al., 2021) 26.52 1024 227M
& Flow iDDPM (Nichol & Dhariwal, 2021) 2.90 4000 MaskGIT (Chang et al., 2022) 6.18 8 227M
Masked
EDM (Karras et al., 2022) 2.05 35 MAR (Li et al., 2024) 1.98 100 400M
& AR
Flow Matching (Lipman et al., 2022) 6.35 142 VAR-d20 (Tian et al., 2024a) 2.57 10 600M
Rectified Flow (Liu et al., 2022) 2.58 127 VAR-d30 (Tian et al., 2024a) 1.92 10 2B

PD (Salimans & Ho, 2022) 4.51 2 ADM (Dhariwal & Nichol, 2021) 10.94 250 554M
CDM (Ho et al., 2022b) 4.88 8100 -
2-Rectified Flow (Salimans & Ho, 2022) 4.85 1
SimDiff (Hoogeboom et al., 2023) 2.77 512 2B
DFNO (Zheng et al., 2023) 3.78 1
LDM-4-G (Rombach et al., 2022) 3.60 250 400M
KD (Luhman & Luhman, 2021) 9.36 1 Diffusion U-DiT-L (Tian et al., 2024b) 3.37 250 916M
TRACT (Berthelot et al., 2023) 3.32 2 & Flow DiT-XL/2 (w = 1.0) (Peebles & Xie, 2023) 9.62 250 675M
Diff-Instruct (Luo et al., 2024a) 5.57 1 DiT-XL/2 (w = 1.25) (Peebles & Xie, 2023) 3.22 250 675M
Few-Step
PID (LPIPS) (Tee et al., 2024) 3.92 1 DiT-XL/2 (w = 1.5) (Peebles & Xie, 2023) 2.27 250 675M
via Distillation
DMD (Yin et al., 2024) 3.77 1 SiT-XL/2 (w = 1.0) (Ma et al., 2024) 9.35 250 675M
CD (LPIPS) (Song et al., 2023) 2.93 2 SiT-XL/2 (w = 1.5) (Ma et al., 2024) 2.15 250 675M
CTM (w/ GAN) (Kim et al., 2023) 1.87 2 iCT (Song et al., 2023) 34.24 1 675M
SiD (Zhou et al., 2024) 1.92 1 20.3 2 675M
SiM (Luo et al., 2024b) 2.06 1 Shortcut (Frans et al., 2024) 10.60 1 675M
sCD (Lu & Song, 2024) 2.52 2 7.80 4 675M
3.80 128 675M
iCT (Song & Dhariwal, 2023) 2.83 1
IMM (ours) (XL/2, w = 1.25) 7.77 1 675M
2.46 2 Few-Step
5.33 2 675M
ECT (Geng et al., 2024) 3.60 1 from Scratch
3.66 4 675M
Few-Step 2.11 2 2.77 8 675M
from Scratch sCT (Lu & Song, 2024) 2.97 1 IMM (ours) (XL/2, w = 1.5) 8.05 1 675M
2.06 2 3.99 2 675M
IMM (ours) 3.20 1 2.51 4 675M
1.98 2 1.99 8 675M

Table 1. CIFAR-10 results trained without label conditions. Table 2. Class-conditional ImageNet-256×256 results.
100
5.5 8
M=1 Pushforward/Uniform
Positional emb. 90 M=2
5.0
Fourier emb. w/ scale=16
7 Pushforward/EDM
M=4
80
M=8 6
Restart/Uniform
4.5
FID-50k

FID-50k
70 M=16 Restart/EDM
FID-50k

5
4.0 60
4
3.5 50

3
40
3.0
2
2.5 50k 100k 150k 200k 250k 300k 350k 400k
1 2 4 8
Training Steps Inference Steps
50k 100k 150k 200k 250k
Training Steps Figure 5. More particles indi- Figure 6. ImageNet-
cate more stable training on 256×256 FID with different
Figure 4. FID convergence for different embeddings (left). CIFAR-
ImageNet-256×256. sampler types.
10 samples from Fourier embedding (scale = 16) (right).

using pushforward sampler. For ImageNet-256×256, we 7.2. IMM Training is Stable


use the popular DiT (Peebles & Xie, 2023) architecture be- We show that IMM is stable and achieves reasonable perfor-
cause of its scalability, and compare it with GANs, masked mance across a range of parameterization choices.
and autoregressive models, diffusion and flow models, and
few-step models trained from scratch. Positional vs. Fourier embedding. A known issue for
CMs (Song et al., 2023) is its training instability when us-
We observe decreasing FID with more steps and IMM ing Fourier embedding with scale 16, which forces reliance
achieves 1.99 FID with 8 steps (with w = 1.5), surpassing on positional embeddings for stability. We find that IMM
DiT and SiT (Ma et al., 2024) using the same architecture does not face this problem (see Figure 4). For Fourier em-
except for trivially injecting time s (see Appendix I). No- bedding we use the standard NCSN++ (Song et al., 2020b)
tably, we also achieve better 8-step FID than the 10-step architecture and set embedding scale to 16; for positional
VAR (Tian et al., 2024a) of comparable size and approaches embeddings, we adopt DDPM++ (Song et al., 2020b). Both
the performance of its 2B parameter variant. However, dif- embedding types converge reliably, and we include samples
ferent from VAR, IMM grants flexibility of variable number from the Fourier embedding model in Figure 4.
of inference steps, thus allowing scaling at inference time.
IMM also matches MAR’s (Li et al., 2024) performance Particle number. Particle number M for estimating MMD
while achieving 10× speedup, once again demonstrating its is an important parameter for empirical success (Gretton
efficient inferece-time scaling capability. Lastly, we sim- et al., 2012; Li et al., 2015), where the estimate is more
ilarly surpass Shortcut models’ (Frans et al., 2024) best accurate with larger M . In our case, naı̈vely increasing M
performance with only 8 steps. We defer inference details can slow down convergence because we have a fixed batch
to Section 7.3 and Appendix I.2. size B in which the samples are grouped into B/M groups

7
Inductive Moment Matching

30 S/2,1-step
120 S/2 S/2 Correlation: -0.92 S/2,2-step
25
B/2 S/2,4-step
100
B/2 25
S/2,8-step
L/2 20 L/2 B/2,1-step
20

FID-50K
FID-50k
80
FID-50k
XL/2 B/2,2-step
XL/2
15 B/2,4-step
60 15 B/2,8-step
L/2,1-step
40 10
10 L/2,2-step
L/2,4-step
20 5 5 L/2,8-step
XL/2,1-step
0 XL/2,2-step
101 102 103 10 1
10 2
XL/2,4-step
109 1010 1011 1012
Training Compute (GFLOPs) Sampling Compute (GFLOPs) Transformer GFLOPs XL/2,8-step

Figure 7. IMM scales with both training and inference compute, and exhibits strong correlation between model size and performance.

Figure 8. Sample visual quality increases with increase in both model size and sampling compute.

2
90
c=1 means fewer t’s are sampled for each step, thus slowing con-
c = 32
log abs. mean diff.

4 80
vergence. A general rule of thumb is to use a large enough
c = 512
6 70
c = 1000
M for stability, but not too large for slowed convergence.
FID-50k

8
10 cnoise(t) = 1t
60

12 cnoise(t) = 32t
cnoise(t) = 256t
50 Noise embedding cnoise (·). We plot in Figure 9 the log ab-
14 cnoise(t) = 512t
16 cnoise(t) = 1000t
40
solute mean difference of t and r(s, t) in the positional em-
0.0 0.2 0.4
t
0.6 0.8 1.0 50k 100k 150k 200k
Training Steps
250k 300k 350k 400k bedding space. Increasing c increases distinguishability of
Figure 9. Log distance in embedding space for cnoise (t) = ct (left).
nearby distributions. We also observe similar convergence
Similar ImageNet-256×256 convergence across different c (right). on ImageNet-256×256 across different c, demonstrating the
insensitivity of our framework w.r.t. noise function.
of M where each group shares the same t. The larger M
means that fewer t’s are sampled. On the other hand, using 7.3. Sampling
extremely small numbers of particles, e.g. M = 2, leads We investigate different sampling settings for best perfor-
to training instability and performance degradation, espe- mance. One-step sampling is performed by simple push-
cially on a large scale with DiT architectures. We find that forward from T to ϵ (concrete values in Appendix I.2).
there exists a sweet spot where a few particles effectively On CIFAR-10 we use 2 steps and set intermediate time
help with training stability while further increasing M slows t1 such that ηt1 = 1.4, a choice we find to work well empir-
down convergence (see Figure 5). We see that in ImageNet- ically. On ImageNet-256×256 we go beyond 2 steps and,
256×256, training collapses when M = 1 (which is CM) for simplicity, investigate (1) uniform decrement in t and
and M = 2, and achieves lowest FID under the same com- (2) EDM (Karras et al., 2024) schedule (detailed in Ap-
putation budget with M = 4. We hypothesize M < 4 does pendix I.2). We plot FID of all sampler settings in Figure 6
not allow sufficient mixing between particles and larger M with guidance weight w = 1.5. We find pushforward sam-

8
Inductive Moment Matching

id/cos id/FM sEDM/cos sEDM/FM eFM k = 11


90 FID-50k
CIFAR-10 3.77 3.45 2.39 2.10 2.53 k = 12
80
k = 13 w(s, t) = 1 40.19
ImageNet-256×256 46.44 47.32 27.33 28.67 27.01 k = 14

FID-50k
70
+ ELBO weight 96.43
60
Table 3. FID results with different flow schedules and network 50
+ αt 33.44
parameterization. 40 + 1/(αt2 + σt2 ) 27.43
50k 100k 150k 200k 250k 300k 350k 400k

plers with uniform schedule to work the best on ImageNet- Training Steps
Table 4. Ablation of weight-
256×256 and use this as our default setting for multi-step Figure 10. ImageNet-256×256 ing function w(s, t) on
generation. Additionally, we concede that pushforward com- FID progression with different t, r ImageNet-256×256.
bined with restart samplers can achieve superior results. We gap with M = 4.
120
leave such experiments to future works. 9 constant dec. in t constant dec. in t
8 constant dec. in t constant dec. in t
7
constant dec. in t 100 constant dec. in t
constant inc. in 1/ t constant inc. in 1/ t

FID-50k
7.4. Scaling Behavior

FID-50k
6 80
5
4 60
Similar to diffusion models, IMM scale with training and
3
inference compute as well as model size on ImageNet- 2
40

256×256. We plot in Figure 7 FID vs. training and in- 50k 100k
Training Steps
150k 200k 50k 100k 150k 200k 250k 300k 350k 400k
Training Steps
ference compute in GFLOPs and we find strong correlation Figure 11. FID progression on different types of mapping function
between compute used and performance. We further vi- r(s, t) for CIFAR-10 (left) and ImageNet-256×256 (right).
sualize samples in Figure 8 with increasing model size,
i.e. DiT-S, DiT-B, DiT-L, DiT-XL, and increasing inference
the form of (ηmax − ηmin )/2k for an appropriately chosen
steps, i.e. 1, 2, 4, 8 steps. The sample quality increases along
k. We show in Figure 10 that the performance is relatively
both axes as larger transformers with more inference steps
stable across k ∈ {11, 12, 13} but experiences instability
capture more complex distributions. This also explains that
for k = 14. This suggests that, for a given particle number,
more compute can sometimes yield different visual content
there exists a largest k for stable optimization.
from the same initial noise as shown in the visual results.
Weighting function. In Table 4 we first ablate the weight-
7.5. Ablation Studies ing factors in three groups: (1) the VDM ELBO factors
1 d
2 σ(b−λt )(− dt λt ), (2) weighting αt (i.e. when a = 1), and
All ablation studies are done with DDPM++ architecture for (3) weighting 1/(αt2 + σt2 ). We find it necessary to use αt
CIFAR-10 and DiT-B for ImageNet-256×256. FID compar- jointly with ELBO weighting because it converts v-pred net-
isons use 2-step samplers by default. work to a ϵ-pred parameterization (see Appendix C.9), con-
Flow schedules and parameterization. We investigate all sistent with diffusion ELBO-objective. Factor 1/(αt2 + σt2 )
combinations of network parameterization and flow sched- upweighting middle time-steps further boosts performance,
ules: Simple-EDM + cosine (sEDM/cos), Simple-EDM a helpful practice also known for FM training (Esser et al.,
+ OT-FM (sEDM/FM), Euler-FM + OT-FM (eFM), Iden- 2024). We leave additional study of the exponent a to Ap-
tity + cosine (id/cos), Identity + OT-FM (id/FM). Iden- pendix I.4 and find that a = 2 emphasizes optimizing the
tity parameterization consistently fall behind other types loss when t is small while a = 1 more equally distributes
of parameterization, which all show similar performance weights to larger t. As a result, a = 2 achieves higher
across datasets (see Table 3). We see that on smaller scale quality multi-step generation than a = 1.
(CIFAR-10), sEDM/FM works the best but on larger scale
(ImageNet-256×256), eFM works the best, indicating that 8. Conclusion
OT-FM schedule and Euler paramaterization may be more
scalable than other choices. We present Inductive Moment Matching, a framework that
learns a few-step generative model from scratch. It trains
Mapping function r(s, t). Our choices for ablation are (1) by first leveraging self-consistent interpolants to interpolate
constant decrement in ηt , (2) constantdecrement in t, (3) between data and prior and then matching all moments of
constant decrement in λt = log αt2 /σt2 , (4) constant incre- its own distribution interpolated to be closer to that of data.
ment in 1/ηt (see Appendix C.6). For fair comparison, we Our method guarantees convergence in distribution and gen-
choose the decrement gap so that the minimum t − r(s, t) eralizes many prior works. Our method also achieves state-
is ≈ 10−3 and use the same network parameterization. FID of-the-art performance across benchmarks while achieving
progression in Figure 11 show that (1) consistently outper- orders of magnitude faster inference. We hope it provides a
forms other choices. We additionally ablate the mapping new perspective on training few-step models from scratch
gap using M = 4 in (1). The constant decrement is in and inspire a new generation of generative models.

9
Inductive Moment Matching

9. Impact Statement Esser, P., Rombach, R., and Ommer, B. Taming transformers
for high-resolution image synthesis. In Proceedings of the
This paper advances research in diffusion models and gen- IEEE/CVF conference on computer vision and pattern recogni-
erative AI, which enable new creative possibilities and de- tion, pp. 12873–12883, 2021.
mocratize content creation but also raise important consider- Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini,
ations. Potential benefits include expanding artistic expres- H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling
sion, assisting content creators, and generating synthetic rectified flow transformers for high-resolution image synthesis.
data for research. However, we acknowledge challenges In Forty-first International Conference on Machine Learning,
around potential misuse for deepfakes, copyright concerns, 2024.
and impacts on creative industries. While our work aims Frans, K., Hafner, D., Levine, S., and Abbeel, P. One step diffusion
to advance technical capabilities, we encourage ongoing via shortcut models. arXiv preprint arXiv:2410.12557, 2024.
discussion about responsible development and deployment Geng, Z., Pokle, A., Luo, W., Lin, J., and Kolter, J. Z. Consistency
of these technologies. models made easy. arXiv preprint arXiv:2406.14548, 2024.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-
10. Acknowledgement Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative
adversarial networks. Communications of the ACM, 63(11):
We thank Wanqiao Xu, Bokui Shen, Connor Lin, and Sam- 139–144, 2020.
rath Sinha for additional technical discussions and helpful
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and
suggestions.
Smola, A. A kernel two-sample test. The Journal of Machine
Learning Research, 13(1):723–773, 2012.
References Heek, J., Hoogeboom, E., and Salimans, T. Multistep consistency
Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows models. arXiv preprint arXiv:2403.06807, 2024.
with stochastic interpolants. arXiv preprint arXiv:2209.15571,
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochre-
2022.
iter, S. Gans trained by a two time-scale update rule converge
Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic to a local nash equilibrium. Advances in neural information
interpolants: A unifying framework for flows and diffusions. processing systems, 30, 2017.
arXiv preprint arXiv:2303.08797, 2023.
Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv
Auffray, Y. and Barbillon, P. Conditionally positive definite ker- preprint arXiv:2207.12598, 2022.
nels: theoretical contribution, application to interpolation and
approximation. PhD thesis, INRIA, 2009. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic
models. Advances in neural information processing systems,
Berthelot, D., Autef, A., Lin, J., Yap, D. A., Zhai, S., Hu, S., 33:6840–6851, 2020.
Zheng, D., Talbott, W., and Gu, E. Tract: Denoising diffusion
models with transitive closure time-distillation. arXiv preprint Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A.,
arXiv:2303.04248, 2023. Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen
video: High definition video generation with diffusion models.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, arXiv preprint arXiv:2210.02303, 2022a.
M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.
Stable video diffusion: Scaling latent video diffusion models to Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and
large datasets. arXiv preprint arXiv:2311.15127, 2023. Salimans, T. Cascaded diffusion models for high fidelity image
generation. Journal of Machine Learning Research, 23(47):
Brock, A. Large scale gan training for high fidelity natural image 1–33, 2022b.
synthesis. arXiv preprint arXiv:1809.11096, 2018.
Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-
Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. to-end diffusion for high resolution images. In International
Maskgit: Masked generative image transformer. In Proceedings Conference on Machine Learning, pp. 13213–13232. PMLR,
of the IEEE/CVF Conference on Computer Vision and Pattern 2023.
Recognition, pp. 11315–11325, 2022.
Kang, M., Zhu, J.-Y., Zhang, R., Park, J., Shechtman, E., Paris, S.,
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, and Park, T. Scaling up gans for text-to-image synthesis. In
J., Luo, P., Lu, H., et al. Pixart-α: Fast training of diffusion Proceedings of the IEEE/CVF Conference on Computer Vision
transformer for photorealistic text-to-image synthesis. arXiv and Pattern Recognition, pp. 10124–10134, 2023.
preprint arXiv:2310.00426, 2023.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, Aila, T. Analyzing and improving the image quality of stylegan.
W. Wavegrad: Estimating gradients for waveform generation. In Proceedings of the IEEE/CVF conference on computer vision
arXiv preprint arXiv:2009.00713, 2020. and pattern recognition, pp. 8110–8119, 2020.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on image Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the
synthesis. Advances in neural information processing systems, design space of diffusion-based generative models. Advances in
34:8780–8794, 2021. neural information processing systems, 35:26565–26577, 2022.

10
Inductive Moment Matching

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., and Luo, W., Huang, Z., Geng, Z., Kolter, J. Z., and Qi, G.-j. One-step
Laine, S. Analyzing and improving the training dynamics of diffusion distillation through score implicit matching. arXiv
diffusion models. In Proceedings of the IEEE/CVF Conference preprint arXiv:2410.16794, 2024b.
on Computer Vision and Pattern Recognition, pp. 24174–24184,
2024. Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-
Eijnden, E., and Xie, S. Sit: Exploring flow and diffusion-based
Kim, D., Lai, C.-H., Liao, W.-H., Murata, N., Takida, Y., Uesaka, generative models with scalable interpolant transformers. arXiv
T., He, Y., Mitsufuji, Y., and Ermon, S. Consistency trajectory preprint arXiv:2401.08740, 2024.
models: Learning probability flow ode trajectory of diffusion.
arXiv preprint arXiv:2310.02279, 2023. Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J.,
and Salimans, T. On distillation of guided diffusion models. In
Kingma, D. and Gao, R. Understanding diffusion objectives as Proceedings of the IEEE/CVF Conference on Computer Vision
the elbo with simple data augmentation. Advances in Neural and Pattern Recognition, pp. 14297–14306, 2023.
Information Processing Systems, 36, 2024.
Müller, A. Integral probability metrics and their generating classes
Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational dif- of functions. Advances in applied probability, 29(2):429–443,
fusion models. Advances in neural information processing 1997.
systems, 34:21696–21707, 2021.
Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion
Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Dif- probabilistic models. In International conference on machine
fwave: A versatile diffusion model for audio synthesis. arXiv learning, pp. 8162–8171. PMLR, 2021.
preprint arXiv:2009.09761, 2020.
OpenAI. Video generation models as world simulators. https:
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. //openai.com/sora/, 2024.
Mmd gan: Towards deeper understanding of moment matching
network. Advances in neural information processing systems, Peebles, W. and Xie, S. Scalable diffusion models with transform-
30, 2017. ers. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 4195–4205, 2023.
Li, T., Tian, Y., Li, H., Deng, M., and He, K. Autoregressive
image generation without vector quantization. arXiv preprint Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T.,
arXiv:2406.11838, 2024. Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent
diffusion models for high-resolution image synthesis. arXiv
Li, Y., Swersky, K., and Zemel, R. Generative moment matching preprint arXiv:2307.01952, 2023.
networks. In International conference on machine learning, pp.
1718–1727. PMLR, 2015. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B.
High-resolution image synthesis with latent diffusion models.
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, In Proceedings of the IEEE/CVF conference on computer vision
M. Flow matching for generative modeling. arXiv preprint and pattern recognition, pp. 10684–10695, 2022.
arXiv:2210.02747, 2022.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L.,
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Sali-
W., and Plumbley, M. D. Audioldm: Text-to-audio generation mans, T., et al. Photorealistic text-to-image diffusion models
with latent diffusion models. arXiv preprint arXiv:2301.12503, with deep language understanding. Advances in neural infor-
2023. mation processing systems, 35:36479–36494, 2022.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning Salimans, T. and Ho, J. Progressive distillation for fast sampling
to generate and transfer data with rectified flow. arXiv preprint of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
arXiv:2209.03003, 2022.
Salimans, T., Mensink, T., Heek, J., and Hoogeboom, E. Multistep
Lu, C. and Song, Y. Simplifying, stabilizing and scal- distillation of diffusion models via moment matching. arXiv
ing continuous-time consistency models. arXiv preprint preprint arXiv:2406.04103, 2024.
arXiv:2410.11081, 2024.
Sauer, A., Lorenz, D., Blattmann, A., and Rombach, R. Adversarial
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: diffusion distillation. In European Conference on Computer
A fast ode solver for diffusion probabilistic model sampling in Vision, pp. 87–103. Springer, 2025.
around 10 steps. Advances in Neural Information Processing
Systems, 35:5775–5787, 2022. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli,
S. Deep unsupervised learning using nonequilibrium thermody-
Luhman, E. and Luhman, T. Knowledge distillation in iterative namics. In International conference on machine learning, pp.
generative models for improved sampling speed. arXiv preprint 2256–2265. PMLR, 2015.
arXiv:2101.02388, 2021.
Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit
Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., and Zhang, Z. Diff- models. arXiv preprint arXiv:2010.02502, 2020a.
instruct: A universal approach for transferring knowledge from
pre-trained diffusion models. Advances in Neural Information Song, Y. and Dhariwal, P. Improved techniques for training consis-
Processing Systems, 36, 2024a. tency models. arXiv preprint arXiv:2310.14189, 2023.

11
Inductive Moment Matching

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S.,
and Poole, B. Score-based generative modeling through stochas-
tic differential equations. arXiv preprint arXiv:2011.13456,
2020b.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency


models. arXiv preprint arXiv:2303.01469, 2023.
Steinwart, I. and Christmann, A. Support vector machines.
Springer Science & Business Media, 2008.

Tee, J. T. J., Zhang, K., Yoon, H. S., Gowda, D. N., Kim, C., and
Yoo, C. D. Physics informed distillation for diffusion models.
arXiv preprint arXiv:2411.08378, 2024.

Tian, K., Jiang, Y., Yuan, Z., Peng, B., and Wang, L. Visual
autoregressive modeling: Scalable image generation via next-
scale prediction. arXiv preprint arXiv:2404.02905, 2024a.

Tian, Y., Tu, Z., Chen, H., Hu, J., Xu, C., and Wang, Y. U-dits:
Downsample tokens in u-shaped diffusion transformers. arXiv
preprint arXiv:2405.02730, 2024b.

Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learn-
ing trilemma with denoising diffusion gans. arXiv preprint
arXiv:2112.07804, 2021.

Xu, Y., Deng, M., Cheng, X., Tian, Y., Liu, Z., and Jaakkola, T.
Restart sampling for improving generative processes. Advances
in Neural Information Processing Systems, 36:76806–76838,
2023.

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Free-
man, W. T., and Park, T. One-step diffusion with distribution
matching distillation. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 6613–
6623, 2024.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O.
The unreasonable effectiveness of deep features as a perceptual
metric. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 586–595, 2018.

Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and Anandku-
mar, A. Fast sampling of diffusion models via operator learning.
In International conference on machine learning, pp. 42390–
42402. PMLR, 2023.
Zhou, M., Zheng, H., Wang, Z., Yin, M., and Huang, H. Score
identity distillation: Exponentially fast distillation of pretrained
diffusion models for one-step generation. In Forty-first Interna-
tional Conference on Machine Learning, 2024.

12
Inductive Moment Matching

A. Background: Properties of Stochastic Interpolants


We note some relevant properties of stochastic interpolants for our exposition.
Boundary satisfaction. For an interpolant distribution qt (xt |x, ϵ) defined in Albergo et al. (2023), and the marginal qt (xt )
as defined in Eq. (2), we can check that q1 (x1 ) = p(x1 ) and q0 (x0 ) = q(x0 ) so that x1 = ϵ and x0 = x.
ZZ
q1 (x1 ) = q1 (x1 |x, ϵ)q(x)p(ϵ)dxdϵ (15)
ZZ
= δ(x1 − ϵ)q(x)p(ϵ)dxdϵ (16)
ZZ
= q(x)p(x1 )dx (17)

= p(x1 ) (18)
ZZ
q0 (x0 ) = q0 (x0 |x, ϵ)q(x)p(ϵ)dxdϵ (19)
ZZ
= δ(x0 − x)q(x)p(ϵ)dxdϵ (20)
ZZ
= q(x0 )p(ϵ)dx (21)

= q(x0 ) (22)

Joint distribution. The joint distribution of x and xt is written as


Z
qt (x, xt ) = qt (xt |x, ϵ)q(x)p(ϵ)dϵ (23)

Indepedence of joint at t = 1.
Z
q1 (x, x1 ) = q1 (x1 |x, ϵ)q(x)p(ϵ)dϵ (24)
Z
= δ(x1 − ϵ)q(x)p(ϵ)dϵ (25)

= q(x)p(x1 ) (26)
= q(x)p(ϵ) (27)

in which case x1 = ϵ.

B. Theorems and Derivations


B.1. Divergence Minimizer

Lemma 3. Assuming marginal-preserving interpolant and metric D(·, ·), a minimizer θ∗ of Eq. (7) exists, i.e. pθs|t (x|xt ) =
qt (x|xt ), and the minimum is 0.

Proof. We directly substitute qt (x|xt ) into the objective to check. First,


ZZ
θ∗ ∗
ps|t (xs ) = qs|t (xs |x, xt )pθs|t (x|xt )qt (xt )dxt dx (28)
ZZ
= qs|t (xs |x, xt )qt (x|xt )qt (xt )dxt dx (29)
(a)
= qs (xs ) (30)

13
Inductive Moment Matching

where (a) is due to definition of marginal preservation. So the objective becomes


h i
θ∗
Es,t w(s, t)D(qs (xs ), ps|t (xs )) = Es,t [w(s, t)D(qs (xs ), qs (xs ))] = 0 (31)

In general, the minimizer qt (x|xt ) exists. However, this does not show that the minimizer is unique. In fact, the minimizer
is not unique in general because a deterministic minimizer can also exist under certain assumptions on the interpolant (see
Appendix B.6).

Failure Case without Marginal Preservation. We additionally show that marginal-preservation property of the interpolant
qs|t (xs |x, xt ) is important for the naı̈ve objective in Eq. (7) to attain 0 loss (Lemma 3). Consider the failure case below
where the constructed interpolant is a generalized interpolant but not necessarily marginal-preserving. Then we show that
there exists a t such that pθs|t (xs ) can never reach qs (xs ) regardless of θ.
Proposition 2 (Example Failure Case). Let q(x) = δ(x), p(ϵ) = δ(ϵ − 1), and suppose an interpolant Is|t (x, xt ) =
(1 − st )x + st xt and γs|t = st 1 − st 1(t < 1), then D(qs (xs ), pθs|t (xs )) > 0 for all 0 < s < t < 1 regardless of the
p

learned distribution pθs|t (x|xt ) given any metric D(·, ·).

Proof. This example first implies the learning target


ZZ
qs (xs ) = qs (xs |x, ϵ)q(x)p(ϵ)dxdϵ (32)
ZZ
= δ(xs − ((1 − s)x + sϵ))δ(x)δ(ϵ − 1)dxdϵ (33)

= δ(xs − s) (34)

is a delta distribution. However, we show that if we select any t < 1 and 0 < s < t, pθs|t (xs ) can never be a delta distribution.
ZZ
pθs|t (xs ) = qs|t (xs |x, xt )pθs|t (x|xt )qt (xt )dxt dx (35)

s s s2 s2
ZZ
= N ((1 − )x + xt , 2 (1 − 2 )I)pθs|t (x|xt )δ(xt − t)dxt dx (36)
t t t t
s s2 s2
Z
= N ((1 − )x + s, 2 (1 − 2 )I)pθs|t (x|xt = t)dx (37)
t t t

Now, we show the model distribution has non-zero variance under these choices of t and s. Expectations are over pθs|t (xs )
or conditional interpolant qs|t (xs |x, xt ) for all equations below.

2
Var(xs ) = E x2s − E [xs ]
 
(38)
ZZ
2
E x2s |x, xt pθs|t (x|xt )qt (xt )dxdxt − E [xs ]
 
= (39)
ZZ
2
E x2s |x, xt pθs|t (x|xt )qt (xt )dxdxt − E [xs ]
 
= (40)
ZZ h i
2 2 2
= Var(xs |x, xt ) + E [xs |x, xt ] pθs|t (x|xt )qt (xt )dxdxt − 2E [xs ] + E [xs ] (41)
ZZ h i
2
= Var(xs |x, xt ) + E [xs |x, xt ] pθs|t (x|xt )qt (xt )dxdxt (42)
ZZ
2
− E [xs ] E [xs |xxt ] pθs|t (x|xt )qt (xt )dxdxt + E [xs ]
ZZ h i
2 2
= Var(xs |x, xt ) + E [xs |x, xt ] − E [xs ] E [xs |x, xt ] + E [xs ] pθs|t (x|xt )qt (xt )dxdxt (43)
| {z }
(a)

14
Inductive Moment Matching

where (a) can be simplified as


 2
2
Var(xs |x, xt ) + E [xs |x, xt ] − E [xs ] > 0 (44)

because Var(xs |x, xt ) > 0 for all 0 < s < t < 1 due to its non-zero Gaussian noise. Therefore, Var(xs ) > 0, implying
pθs|t (xs ) can never be a delta function regardless of model pθs|t (x|xt ). A valid metric D(·, ·) over probability pθs|t (xs ) and
qs (xs ) implies
D(qs (xs ), pθs|t (xs )) = 0 ⇐⇒ pθs|t (xs ) = qs (xs )
which means
pθs|t (xs ) ̸= qs (xs ) =⇒ D(qs (xs ), pθs|t (xs )) > 0

B.2. Boundary Satisfaction of Model Distribution


The operator output pθs|t (xs ) satisfies boundary condition.
Lemma 4 (Boundary Condition). For all s ∈ [0, 1] and all θ, the following boundary condition holds.

qs (xs ) = pθs|s (xs ) (45)

Proof.
ZZ
pθs|s (xs ) = qs,s (xs |x, x̄s ) pθs|s (x|x̄s )qs,1 (x̄s )dx̄s dx
| {z }
δ(xs −x̄s )
Z
= pθs|s (x|xs )qs (xs )dx
Z
= qs (xs ) pθs|s (x|xs )dx

= qs (xs )

B.3. Definition of Well-Conditioned r(s, t)


For simplicity, the mapping function r(s, t) is well-conditioned if

r(s, t) = max(s, t − ∆(t)) (46)

where ∆(t) ≥ ϵ > 0 is a positive function such that r(s, t) is increasing for t ≥ s + c0 (s) where c0 (s) is the largest t that
is mapped to s. Formally, c0 (s) = sup{t : r(s, t) = s}. For t ≥ s + c0 (s), the inverse w.r.t. t exists, i.e. r−1 (s, ·) and
r−1 (s, r(s, t)) = t. All practical implementations follow this general form, and are detailed in Appendix C.6.

B.4. Main Theorem


Theorem 1. Assuming r(s, t) is well-conditioned, the interpolant is marginal-preserving, and θn∗ is a minimizer of Eq. (8)
for each n with infinite data and network capacity, for all t ∈ [0, 1], s ∈ [0, t],
θ∗
lim MMD2 (qs (xs ), ps|t
n
(xs )) = 0. (9)
n→∞

Proof. We prove by induction on sequence number n. First, r(s, t) is well-conditioned by following the definition in
Eq. (46). Furthermore, for notational convenience, we let rn−1 (s, ·) := r−1 (s, r−1 (s, r−1 (s, . . . ))) be n nested application
of r−1 (s, ·) on the second argument. Additionally, r0−1 (s, t) = t.

15
Inductive Moment Matching

Base case: n = 1. Given any s ≥ 0, r(s, u) = s for all s < u ≤ c0 (s), implying
θ∗ (a) θ∗ (b)
MMD2 (pθs|s
0
(xs ), ps|u
1
(xs )) = MMD2 (qs (xs ), ps|u
1
(xs )) = 0 (47)

for u ≤ c0 (s) where (a) is implied by Lemma 4 and (b) is implied by Lemma 3.
n−1 θ∗ −1
Inductive assumption: n − 1. We assume ps|u (xs ) = qs (xs ) for all s ≤ u ≤ rn−2 (s, c0 (s)).
θ∗
n−1 −1
We inspect the target distribution ps|r(s,u) (xs ) in Eq. (8) if optimized on s ≤ u ≤ rn−1 (s, c0 (s)). On this interval, we
−1 −1
can apply r(s, ·) to the inequality and get s = r(s, s) ≤ r(s, u) ≤ r(s, rn−1 (s, c0 (s))) = rn−2 (s, c0 (s)) since r(s, ·) is
n−1 θ∗ −1
increasing. And by inductive assumption ps|r(s,u) (xs ) = qs (xs ) for s ≤ r(s, u) ≤ rn−2 (s, c0 (s)), this implies minimizing
h ∗ i
θn−1
Es,u w(s, u)MMD2 (ps|r(s,u) (xs ), pθs|u
n
(xs ))
−1
on s ≤ u ≤ rn−1 (s, c0 (s)) is equivalent to minimizing
h i
Es,u w(s, u)MMD2 (qs (xs ), pθs|u
n
(xs )) (48)

−1 θ∗
for s ≤ u ≤ rn−1 (s, c0 (s)). Lemma 3 implies that its minimum achieves ps|u
n
(xs ) = qs (xs ).

Lastly, taking n → ∞ implies limn→∞ rn−1 (s, s) = 1 and thus the induction covers the entire [s, 1] interval given each s.

θn
Therefore, limn→∞ MMD(qs (xs ), ps|t (xs )) = 0 for all 0 ≤ s ≤ t ≤ 1.

B.5. Self-Consistency Implies Marginal Preservation


Without assuming marginal preservation, it is important to define the marginal distribution of xs under generalized
interpolants qs|t (xs |x, xt ) as
ZZ
qs|t (xs ) = qs|t (xs |x, xt )qt (x|xt )qt (xt )dxt dx (49)

and we show that with self-consistent interpolants, this distribution is invariant of t, i.e. qs|t (xs ) = qs (xs ).
Lemma 5. If the interpolant qs|t (xs |x, xt ) is self-consistent, the marginal distribution qs|t (xs ) as defined in Eq. (49)
satisfies qs (xs ) = qs|t (xs ) for all t ∈ [s, 1].

Proof. For t ∈ [s, 1],


ZZ
qs|t (xs ) = qs|t (xs |x, xt ))qt (x|xt )qt (xt )dxt dx (50)

qt (xt |x, ϵ)q(x)p(ϵ)


ZZ Z
= qs|t (xs |x, xt ))qt (xt ) dϵdxt dx (51)
qt (xt )
ZZ Z
= qs|t (xs |x, xt )q(x) qt (xt |x, ϵ)p(ϵ)dϵdxt dx (52)
ZZ Z 
= q(x)p(ϵ) qs|t (xs |x, xt )qt (xt |x, ϵ)dxt dϵdx (53)
ZZ Z 
= q(x)p(ϵ) qs|t (xs |x, xt )qt,1 (xt |x, ϵ)xt dϵdx (54)
ZZ
(a)
= qs,1 (xs |x, ϵ)q(x)p(ϵ)dϵdx (55)
ZZ
(b)
= qs (xs |x, ϵ)q(x)p(ϵ)dϵdx (56)

= qs (xs ) (57)
where (a) uses definition of self-consistent interpolants and (b) uses definition of our generalized interpolant.

16
Inductive Moment Matching

We show in Appendix C.1 that DDIM is an example self-consistent interpolant. Furthermore, DDPM posteior (Ho et al.,
2020; Kingma et al., 2021) is also self-consistent (see Lemma 6).

B.6. Existence of Deterministic Minimizer


We present the formal statement for the deterministic minimizer.
Proposition 3. If for all t ∈ [0, 1], s ∈ [0, t], γs|t ≡ 0, Is|t (x, xt ) is invertible w.r.t. x, and there exists C1 < ∞ such that
It|1 (x, ϵ) < C1 ∥x − ϵ∥, then there exists a function hs|t : RD → RD such that
ZZ
qs (xs ) = qs|t (xs |x, xt )δ(x − hs|t (xt ))qt (xt )dxdxt . (58)

−1 −1 −1
Proof. Let Is|t (·, xt ) be the inverse of Is|t w.r.t. x such that Is|t (Is|t (x, xt ), xt ) = x and Is|t (Is|t (y, xt ), xt ) = y for
D
all x, xt , y ∈ R . Since there exists C1 < ∞ such that It|1 (x, ϵ) < C1 ∥x − ϵ∥ for all t ∈ [0, 1], the PF-ODE of the
original interpolant It|1 (x, ϵ) = It (x, ϵ) exists for all t ∈ [0, 1] (Albergo et al., 2023). Then, for all t ∈ [0, 1], s ∈ [0, t], we
let
Z s  

ĥs|t (xt ) = xt + Ex,ϵ Iu (x, ϵ) xu du, (59)
t ∂u
which pushes forward the measure qt (xt ) to qs (xs ). We define:
−1
hs|t (xt ) = Is|t (ĥs|t (xt ), xt ). (60)
Then, since γs|t ≡ 0, qs|t (xs |x, xt ) = δ(xs − Is|t (x, xt )) where x ∼ δ(x − hs|t (xt )). Therefore,
−1
xs = Is|t (hs|t (xt ), xt ) = Is|t (Is|t (ĥs|t (xt ), xt ), xt ) = ĥs|t (xt ) (61)
whose marginal follows qs (xs ) due to it being the result of PF-ODE trajectories starting from qt (xt ).

Concretely, DDIM interpolant satisfies all of the deterministic assumption, the regularity condition, and the invertibility
assumption because it is a linear function of x and xt . Therefore, any diffusion or FM schedule with DDIM interpolant will
enjoy a deterministic minimizer pθs|t (x|xt ).

C. Analysis of Simplified Parameterization


C.1. DDIM Interpolant
We check that DDIM interpolant is self-consistent. By definition, qs|t (xs |x, xt ) = δ(xs − DDIM(xt , x, t, s)). We check
that for all s ≤ r ≤ t,
Z Z
qs|r (xs |x, xr )qr|t (xr |x, xt )dxr = δ(xs − DDIM(xr , x, r, s))δ(xr − DDIM(xt , x, t, r))dxr

= δ(xs − DDIM(DDIM(xt , x, t, r), x, r, s))


where
DDIM(DDIM(xt , x, t, r), x, r, s) = αs x + (σs /σr )([αr x + (σr /σt )(xt − αt x)] − αr x)
= αs x + (σs /σr )(σr /σt )(xt − αt x)
= αs x + (σs /σt )(xt − αt x)
= DDIM(xt , x, t, s)
Therefore, δ(xs − DDIM(DDIM(xt , x, t, r), x, r, s)) = δ(xs − DDIM(xt , x, t, s)). So DDIM is self-consistent.
It also implies a Gaussian forward process qt (xt |x) = N (αt x, σt2 σd2 I) as in diffusion models. By definition,
Z
qt,1 (xt |x) = qt (xt |x) = qt (xt |x, ϵ)p(ϵ)dϵ

so that xt is a deterministic transform given x and ϵ, i.e., xt = DDIM(ϵ, x, t, 1) = αt x + σt ϵ, which implies qt (xt |x) =
N (αt x, σt2 σd2 I).

17
Inductive Moment Matching

C.2. Reusing xt for xr


We propose that instead of sampling xr via forward flow αr x + σr ϵ we reuse xt such that xr = DDIM(xt , x, r, t) to
reduce variance. In fact, for any self-consistent interpolant, one can reuse xt via xr ∼ qr|t (xr |x, xt ) and xr will follow
qr (xr ) marginally. We check
ZZ ZZ Z
(a)
qr (xr ) = qr|t (xr ) = qr|t (xr |x, xt )qt (x|xt )qt (xt )dxdxt = qr|t (xr |x, xt ) qt (xt |x, ϵ)q(x)p(ϵ)dϵdxdxt

where (a) is due to Lemma 5. We can see that sampling x, xt first then xr ∼ qr|t (xr |x, xt ) respects the marginal distribution
qr (xr ).

C.3. Simplified Objective


We derive our simplified objective. Given MMD defined in Eq. (1), we write our objective as
h 2 i
θ θ−
LIMM (θ) = Es,t w(s, t) Ext [k(fs,t (xt ), ·)] − Exr [k(fs,r (xr ), ·)] (62)
H
(a)
h − 2 i
θ θ
= Es,t w(s, t) Ext ,xr [k(fs,t (xt ), ·) − k(fs,r (xr ), ·)] (63)
H
h D Ei
θ− θ−
θ
= Es,t w(s, t) Ext ,xr [k(fs,t (xt ), ·) − k(fs,r (xr ), ·)], Ex′t ,x′r [k(fs,t θ
(x′t ), ·) − k(fs,r (x′r ), ·)] (64)
h h
θ− θ−
= Es,t w(s, t) Ext ,xr ,x′t ,x′r ⟨k(fs,t θ θ
(xt ), ·), k(fs,t (x′t ), ·)⟩ + ⟨k(fs,r (xr ), ·), k(fs,r (x′r ), ·)⟩ (65)
ii
θ− θ−
− ⟨k(fs,t θ
(xt ), ·), k(fs,r (x′r ), ·)⟩ − ⟨k(fs,t θ
(x′t ), ·), k(fs,r (xr ), ·)⟩
h h
θ− θ−
= Ext ,xr ,x′t ,x′r ,s,t w(s, t) k(fs,t θ θ
(xt ), fs,t (x′t )) + k(fs,r (xr ), fs,r (x′r )) (66)
ii
θ− θ−
− k(fs,tθ
(xt ), fs,r (x′r )) − k(fs,t θ
(x′t ), fs,r (xr ))

where ⟨·, ·⟩ is in RKHS, (a) is due to the correlation between xr and xt by re-using xt .

C.4. Empirical Estimation


As proposed in Gretton et al. (2012), MMD is typically estimated with V-statistics by instantiating a matrix of size M × M
such that a batch of B x samples, {x(i) }B i=1 , is separated into groups of M (assume B is divisible by M ) particles
B/M,M
{x(i,j) }i=1,j=1 where each group share a (si , ri , ti ) sample. The Monte Carlo estimate becomes

B/M M M
1 X 1 XXh (i,j) (i,k) − (i,j) − (i,k)
L̂IMM (θ) = w(si , tj ) 2 k(fsθi ,ti (xti ), fsθi ,ti (xti )) + k(fsθi ,ri (xri ), fsθi ,ri (xri )) (67)
B/M i=1 M j=1
k=1

i
(i,j) (i,k)
− 2k(fsθi ,ti (xti ), fsθi ,ri (xri ))

Computational efficiency. First we note that regardless of M , we require only 2 model forward passes - one with and one
without stop gradient, since the model takes in all B instances together within the batch and produce outputs for the entire
batch. For the calculation of our loss, although the need for M particles may imply inefficient computation, the cost of this
matrix computation is negligible in practice compared to the complexity of model forward pass. Suppose a forward pass for
a single instance is O(K), then the total computation for computation loss for a batch of B instances is O(BK) + O(BM ).
Deep neural networks often has K ≫ M , so O(BK) dominates the computation.

C.5. Simplified Parameterization


θ
We derive fs,t (xt ) for each parameterization, which now generally follows the form

θ
fs,t (xt ) = cskip (s, t)xt + cout (s, t)Gθ (cin (s, t)xt , cnoise (s), cnoise (t))

18
Inductive Moment Matching

Identity. This is simply DDIM with x-prediction network.

!
θ σs xt σs
fs,t (xt ) = (αs − αt )Gθ p , cnoise (s), cnoise (t) + xt
σt σd αt2 + σt2 σt

Simple-EDM.

" !#
θ σs αt σt xt σs
fs,t (xt ) = (αs − αt ) 2 2 xt − p σd Gθ p , cnoise (s), cnoise (t) + xt
σt αt + σt αt + σt2
2 σd αt2 + σt2 σt
!
αs αt + σs σt αs σt − σs αt xt
= xt − p σd G θ , cnoise (s), cnoise (t)
αt2 + σt2
p
αt2 + σt2 σd αt2 + σt2

 
1 1 √xt2
θ
 
When noise schedule is cosine, fs,t (xt ) = cos 2 π(t − s) xt − sin 2 π(t − s) σd Gθ 2
, cnoise (s), cnoise (t) .
σd αt +σt
And similar to Lu & Song (2024), we can show that predicting xs = DDIM(xt , x, s, t) with L2 loss is equivalent to
v-prediction with cosine schedule.

  2
θ σs σs
w(s, t) fs,t (xt ) − (αs − αt )x + xt
σt σt
!   2
αs αt + σs σt αs σt − σs αt xt σs σs
= w(s, t) 2 2 xt − p σ d Gθ p , cnoise (s), cnoise (t) − (αs − αt )x + xt
αt + σt αt2 + σt2 σd αt2 + σt2 σt σt
2
! p !
αt2 + σt2

xt σs σs αs αt + σs σt
= w̃(s, t) σd Gθ , cnoise (s), cnoise (t) − − (αs − αt )x + xt − xt
αt2 + σt2
p
σd αt2 + σt2 αs σt − σs αt σt σt
| {z }
Gtarget

where

p !
αt2 + σt2

σs σs αs αt + σs σt
Gtarget = − (αs − αt )x + xt − x t
αs σt − σs αt σt σt αt2 + σt2
p !
αt2 + σt2 σs αt2 +  2 2
 
σs σs σ
t − αs αt σt −  σs
σ
t
= − (αs − αt )x + (α t x + σ t ϵ)
αs σt − σs αt σt σt (αt2 + σt2 )
p !
αt2 + σt2 (αs σt − σs αt )(αt2 + σt2 ) + σs αt3 − αs αt2 σt σs αt2 − αs αt σt

= − x+ ϵ
αs σt − σs αt σt (αt2 + σt2 ) αt2 + σt2
p !
αt2 + σt2 (αs σt − σs αt )(αt2 + σt2 ) − (αs σt − σs αt )αt2

(αs σt − σs αt )αt
= − x − ϵ
αs σt − σs αt αt2 + σt2 σt (αt2 + σt2 )
αt ϵ − σt x
=p
αt2 + σt2

This reduces to v-target if cosine schedule is used, and it deviates from v-target if FM schedule is used instead.

19
Inductive Moment Matching

Euler-FM. We assume OT-FM schedule.


" !#
θ s x t s
fs,t (xt ) = ((1 − s) − (1 − t)) xt − tσd Gθ p , cnoise (s), cnoise (t) + xt
t 2
σd αt + σt 2 t
" !#
s x s
= ((1 − ) xt − tσd Gθ p t , cnoise (s), cnoise (t) + xt
t σd αt2 + σt2 t
!
x
= xt − (t − s)σd Gθ p t , cnoise (s), cnoise (t)
σd αt2 + σt2

This results in Euler ODE from xt to xs . We also show that the network output reduces to v-prediction if matched with
xs = DDIM(xt , x, s, t). To see this,
 s s  2
θ
w(s, t) fs,t (xt ) − ((1 − s) − (1 − t))x + xt
t t
 s s  2
θ
= w(s, t) fs,t (xt ) − (1 − )x + xt
t t
! 2
xt  s s 
= w(s, t) xt − (t − s)σd Gθ p , cnoise (s), cnoise (t) − (1 − )x + xt
σd αt2 + σt2 t t
!   2
xt 1 s s 
= w̃(s, t) σd Gθ p , cnoise (s), cnoise (t) − − (1 − )x + ( − 1)xt
σd αt2 + σt2 (t − s) t t
2
!  
x 1 s s 
= w̃(s, t) σd Gθ p t , cnoise (s), cnoise (t) − − (1 − )x + ( − 1)xt
σd αt2 + σt2 (t − s) t t
| {z }
Gtarget

where
 
1 s s 
Gtarget = − (1 − )x + ( − 1)((1 − t)x + tϵ)
(t − s) t t
 
1 s s 
= − (1 − )x + ( − s − 1 + t)x + (s − t)ϵ
(t − s) t t
 
1
= − ((t − s)x + (s − t)ϵ)
(t − s)
=ϵ−x

which is v-target under OT-FM schedule. This parameterization naturally allows zero-SNR sampling and satisfies boundary
condition at s = 0, similar to Simple-EDM above. This is not true for Identity parametrization using Gθ as it satisfies
boundary condition only at s > 0.

C.6. Mapping Function r(s, t)


We discuss below the concrete choices for r(s, t). We use a constant decrement ϵ > 0 in different spaces.
Constant decrement in η(t) := ηt = σt /αt . This is the choice that we find to work better than other choices in practice.
First, let its inverse be η −1 (·),

r(s, t) = max s, η −1 (η(t) − ϵ)




We choose ϵ = (ηmax − ηmin )/2k for some k. We generally choose ηmax ≈ 160 and ηmin ≈ 0. We find k = {10, . . . , 15}
works well enough depending on datasets.

20
Inductive Moment Matching

Constant decrement in t.

r(s, t) = max (s, t − ϵ)

We choose ϵ = (T − ϵ)/2k .
Constant decrement in λ(t) := log-SNRt = 2 log(αt /σt ).
Let its inverse be λ−1 (·), then

r(s, t) = max s, λ−1 (λ(t) − ϵ))




We choose ϵ = (λmax − λmin )/2k . This choice comes close to the first choice, but we refrain from this because r(s, t)
becomes close to t both when t ≈ 0 and t ≈ 1 instead of just t ≈ 1. This gives more chances for training instability than the
first choice.
Constant increment in 1/η(t).
  
1
r(s, t) = max s, η −1
1/η(t) − ϵ

We choose ϵ = (1/η(t)min − 1/η(t)max )/2k .

C.7. Time Distribution p(s, t)


In all cases we choose p(t) = U(ϵ, T ) and p(s|t) = U(ϵ, t) for some ϵ ≥ 0 and T ≤ 1. The decision for time distribution is
coupled with r(s, t). We list the constraints on p(s, t) for each r(s, t) choice below.
Constant decrement in η(t). We need to choose T < 1 because, for example, assuming OT-FM schedule, ηt = t/(1 − t),
one can observe that constant decrement in ηt when t ≈ 1 results in r(s, t) that is too close to t due to ηt ’s exploding gradient
around 1. We need to define T < 1 such that r(s, T ) is not too close to T for s reasonably far away. With ηmax ≈ 160, we
can choose T = 0.994 for OT-FM and T = 0.996 for VP-diffusion.
Constant decrement in t. No constraints needed. T = 1, ϵ = 0.
Constant decrement in λt . One can similarly observe exploding gradient causing r(s, t) to be too close to t at both
t ≈ 0 and t ≈ 1, so we can choose ϵ > 0, e.g. 0.001, in addition to choosing T = 0.994 for OT-FM and T = 0.996 for
VP-diffusion.
Constant increment in 1/ηt . This experience exploding gradient for t ≈ 0, so we require ϵ > 0, e.g. 0.005. And T = 1.

C.8. Kernel Function


For our Laplace kernel k(x, y) = exp(−w̃(s, t) max(∥x − y∥2 , ϵ)/D), we let ϵ > 0 be a reasonably small constant, e.g.
10−8 . Looking at its gradient w.r.t. x,
(
−w̃(s,t) max(∥x−y∥2 ,ϵ)/D − w̃(s,t)
D e
−w̃(s,t)∥x−y∥2 /D x−y
∥x−y∥ , if ∥x − y∥2 > ϵ
∇x e = 2

0, otherwise

one can notice that the gradient is self-normalized


 to be a unit vector, which
 is helpful in practice. In comparison, the
1 2
gradient of RBF kernel of the form k(x, y) = exp − 2 w̃(s, t)∥x − y∥ /D ,

2 w̃(s, t) −w̃(s,t) 1 ∥x−y∥2 /D


∇x e−w̃(s,t)∥x−y∥ /D
=− e 2 (x − y) (68)
D
whose magnitude can vary a lot depending on how far x is from y.
For w̃(s, t), we find it helpful to write out the L2 loss between the arguments. For simplicity, we denote Ĝθ (xt , s, t) =

21
Inductive Moment Matching
 
Gθ √xt2 2
, cnoise (s), cnoise (t)
σd αt +σt


θ
w̃(s, t) fs,t θ
(xt ) − fs,r (x′r )
2
 
= w̃(s, t) cskip (s, t)xt + cout (s, t)Ĝθ (xt , s, t) − cskip (s, r)x′r + cout (s, r)Ĝ(x′r , s, r)
2
1 
′ ′

= w̃(s, t)cout (s, t) Ĝθ (xt , s, t) − cskip (s, r)xr + cout (s, r)Ĝ(xr , s, r) − cskip (s, t)xt
cout (s, t) 2

We simply set w̃(s, t) = 1/cout (s, t) for the overall weighting to be 1. This allows invariance of magnitude of kernels w.r.t. t.

C.9. Weighting Function w(s, t)


To review VDM (Kingma et al., 2021), the negative ELBO loss for diffusion model is

  
1 d 2
LELBO (θ) = Ex,ϵ,t − λt ∥ϵθ (xt , t) − ϵ∥ (69)
2 dt

where ϵθ is the noise-prediction network and λt = log-SNRt . The weighted-ELBO loss proposed in Kingma & Gao (2024)
introduces an additional weighting function w(t) monotonically increasing in t (monotonically decreasing in log-SNRt )
understood as a form of data augmentation. Specifically, they use sigmoid as the function such that the weighted ELBO is
written as

   
1 d 2
Lw-ELBO (θ) = Ex,ϵ,t σ(b − λt ) − λt ∥ϵθ (xt , t) − ϵ∥ (70)
2 dt

where σ(·) is sigmoid function.


The αt is tailored towards the Simple-EDM and Euler-FM parameterization as we have shown in Appendix C.5 that the
networks σd Gθ amounts to v-prediction in cosine and OT-FM schedules. Notice that ELBO diffusion loss matches
ϵ instead of v. Inspecting the gradient of Laplace kernel, we have (again, for simplicity we let Ĝθ (xt , s, t) =
Gθ ( √xt2 2 , cnoise (s), cnoise (t)))
σd αt +σt

∂ −w̃(s,t) fs,t
θ θ−
(xt )−fs,r (xr ) /D
e 2
∂θ

θ θ
w̃(s, t) −w̃(s,t) fs,t
θ θ−
(xt )−fs,r (xr ) /D fs,t (xt ) − fs,r (xr ) ∂ θ
=− e 2 f (xt )
D θ (x )
fs,t t − θ − (x )
fs,r r 2 ∂θ s,t
w̃(s, t) −w̃(s,t) θ θ− Ĝθ (xt , s, t) − Ĝtarget ∂ θ
fs,t (xt )−fs,r (xr ) /D
=− e 2 f (xt )
D Ĝθ (xt , s, t) − Ĝtarget ∂θ s,t
2

∂ θ
for some constant Ĝtarget . We can see that gradient ∂θ fs,t (xt ) is guided by vector Ĝθ (xt , s, t) − Ĝtarget . Assuming
Ĝθ (xt , s, t) is v-prediction, as is the case for Simple-EDM parameterization with cosine schedule and Euler-FM parame-
terization with OT-FM schedule, we can reparameterize v- to ϵ-prediction with ϵθ as the new parameterization. We omit
arguments to network for simplicity.
We show below that for both cases ϵθ − ϵtarget = αt (Ĝθ − Ĝtarget ) for some constants ϵtarget and Ĝtarget . For Simple-EDM,
we know x-prediction from v-prediction parameterization (Salimans & Ho, 2022), xθ = αt xt − σt Ĝθ , and we also know

22
Inductive Moment Matching

x-prediction from ϵ-prediction, xθ = (xt − σt ϵθ )/αt . We have

xt − σt ϵθ
= αt xt − σt Ĝθ (71)
αt
⇐⇒ xt − σt ϵθ = αt2 xt − αt σt Ĝθ (72)
⇐⇒ (1 − αt2 )xt + αt σt Ĝθ = σt ϵθ (73)
⇐⇒ σt2 xt + αt σt Ĝθ = σt ϵθ (74)
⇐⇒ ϵθ = σt xt + αt Ĝθ (75)
 
⇐⇒ ϵθ − ϵtarget = σt xt + αt Ĝθ − σt xt + αt Ĝtarget (76)

⇐⇒ ϵθ − ϵtarget = αt (Ĝθ − Ĝtarget ) (77)

For Euler-FM, we know x-prediction from v-prediction parameterization, xθ = xt − tĜθ and we also know x-prediction
from ϵ-prediction, xθ = (xt − tϵθ )/(1 − t). We have

xt − tϵθ
= xt − tĜθ (78)
1−t
⇐⇒ xt − tϵθ = (1 − t)xt − t(1 − t)Ĝθ (79)
⇐⇒ txt + t(1 − t)Ĝθ = tϵθ (80)
⇐⇒ ϵθ = xt + (1 − t)Ĝθ (81)
⇐⇒ ϵθ = xt + αt Ĝθ (82)
 
⇐⇒ ϵθ − ϵtarget = xt + αt Ĝθ − xt + αt Ĝtarget (83)
⇐⇒ ϵθ − ϵtarget = αt (Ĝθ − Ĝtarget ) (84)

In both cases, (Ĝθ (xt , s, t) − Ĝtarget ) can be rewritten to (ϵθ (xt , s, t) − ϵtarget ) by multiplying a factor αt , and the guidance
vector now matches that of the ELBO-diffusion loss. Therefore, we are motivated to incorporate αt into w(s, t) as proposed.
The exponent a for αta takes a value of either 1 or 2. We explain the reason for each decision here. When a = 1, we guide
∂ θ
the gradient ∂θ fs,t (xt ), with score difference (ϵθ (xt , s, t) − ϵtarget ). To motivate a = 2, we first note that the weighted
gradient
∂ θ 1 ∂ θ ∂
w̃(s, t) fs,t (xt ) = f (xt ) = Ĝθ (xt , s, t)
∂θ cout (s, t) ∂θ s,t ∂θ
and as shown above that ϵ-prediction is parameterized as

ϵθ (xt , s, t) = xt + αt Ĝθ (xt , s, t).


Multiplying an additional αt to ∂θ Ĝθ (xt , s, t) therefore implicitly reparameterizes our model into an ϵ-prediction model.
The case of a = 2 therefore implicitly reparameterizes our model into an ϵ-prediction model guided by the score difference
(ϵθ (xt , s, t) − ϵtarget ). Empirically, αt2 additionally downweights loss for larger t compared to αt , allowing the model to
train on smaller time-steps more effectively.
Lastly, the division of αt2 + σt2 is inspired by the increased weighting for middle time-steps (Esser et al., 2024) for Flow
Matching training. This is purely an empirical decision.

D. Training Algorithm

E. Classifier-Free Guidance
We refer readers to Appendix C.5 for analysis of each parameterization. Most notably, the network Gθ in both (1) Simple-
EDM with cosine diffusion schedule and (2) Euler-FM with OT-FM schedule are equivalent to v-prediction parameterization

23
Inductive Moment Matching

Algorithm 3 IMM Training


Input: model f θ , data distribution q(x) and label distribution q(c|x) (if label is used), prior distribution N (0, σd2 I),
time distribution p(t) and p(s|t), DDIM interpolator DDIM(xt , x, s, t) and its flow coefficients αt , σt , mapping function
r(s, t), kernel function k(·, ·), weighting function w(s, t), batch size B, particle number M , label dropout probability p
θ
Output: learned model fs,t
0
Initialize n ← 0, θ ← θ
while model not converged do
B/M,M
Sample a batch of data, label, and prior, and split into B/M groups, {(x(i,j) , c(i,j) , ϵ(i,j) )}i=1,j=1
B/M B/M
For each group, sample {(si , ti )}i=1 and ri = r(si , ti ) for each i. This results in a tuple {(si , ri , ti )}i=1
(i,j)
xti ← DDIM(ϵ(i,j) , x(i,j) , ti , 1) = αti x(i,j) + σti ϵ(i,j) , ∀(i, j)
(i,j) (i,j)
xri ← DDIM(xti , x(i,j) , ri , ti ), ∀(i, j)
(Optional) Randomly drop each label c(i,j) to be null token ∅ with probability p
θn+1 ← optimizer step by minimizing L̂IMM (θn ) using model f θn (see Eq. (67)) (optionally inputting c(i,j) into
network)
end while

in diffusion (Salimans & Ho, 2022) and FM (Lipman et al., 2022). When conditioned on label c during sampling, it is
customary to use classifier-free guidance to reweight this v-prediction network via

Gw
θ (cin (t)xt , cnoise (s), cnoise (t), c) (85)
= wGθ (cin (t)xt , cnoise (s), cnoise (t), c) + (1 − w)Gθ (cin (t)xt , cnoise (s), cnoise (t), ∅)
θ
with guidance weight w so that the classifier-free guided fs,t,w (xt ) is
θ
fs,t,w (xt ) = cskip (s, t)xt + cout (s, t)Gw
θ (cin (t)xt , cnoise (s), cnoise (t), c) (86)

F. Sampling Algorithms
Pushforward sampling. We assume a series of N time steps {ti }N
i=0 with T = tN > tN −1 > · · · > t2 > t1 > t0 = ϵ for
the maximum time T and minimum time ϵ. Denote σd as data standard deviation.

Algorithm 4 Pushforward Sampling


2
Input: model f θ , time steps {ti }N i=0 , prior distribution N (0, σd I), (optional) guidance weight w
Output: xt0
Sample xN ∼ N (0, σd2 I)
for i = N, . . . , 1 do
(Optional) w ← 1 if N = 1 // can optionally discard unconditional branch for N = 1
xti−1 ← ftθi−1 ,ti (xti ) or ftθi−1 ,ti ,w (xti )
end for

Restart sampling. Different from pushforward sampling, N time steps {ti }N i=0 do not need to be strictly decreasing for
all time steps, e.g. T = tN ≥ tN −1 ≥ · · · ≥ t2 ≥ t1 ≥ t0 = ϵ (assuming T > ϵ). Different from pushforward sampling,
restart sampling first denoise a clean sample before resampling a noise to be added to this clean sample. Then a clean sample
is predicted again. The process is iterated for N steps.

G. Connection with Prior Works


G.1. Consistency Models
Consistency models explicitly match PF-ODE trajectories using a network gθ (xt , t) that directly outputs a sample given any
xt ∼ qt (xt ). The
h network explicitly uses EDM
i parameterization to satisfy boundary condition gθ (x0 , 0) = x0 and trains
2
via loss Ext ,x,t ∥gθ (xt , t) − gθ− (xr , r)∥ where xr is a deterministic function of xt from an ODE solver.

24
Inductive Moment Matching

Algorithm 5 Restart Sampling


2
Input: model f θ , time steps {ti }N i=0 , prior distribution N (0, σd I), DDIM interpolant coefficients αt and σt , (optional)
guidance weight w
Output: xt0
Sample xN ∼ N (0, σd2 I)
for i = N, . . . , 1 do
(Optional) w ← 1 if N = 1 // can optionally discard unconditional branch for N = 1
x̃ ← ftθ0 ,ti (xti ) or ftθ0 ,ti ,w (xti )
if i ̸= 1 then
ϵ̃ ∼ N (0, σd2 I)
xti−1 ← αti−1 x̃ + σti−1 ϵ̃ // or more generally xti−1 ∼ qt (xti−1 |x̃, ϵ̃)
else
xt0 = x̃
end if
end for

We show that CM loss is a special case of our simplified IMM objective.


2
Lemmah 1. When xt = x′t , xr = x′r , k(x, i y) = −∥x − y∥ , and s > 0 is a small constant, Eq. (12) reduces to CM loss
2
Ext ,x,t w(t)∥gθ (xt , t) − gθ− (xr , r)∥ + C for a valid r(t) < t and some constant C.

Proof. Since xt = x′t , xr = x′r , we have fs,t θ θ


(xt ) = fs,t (x′t ) and fs,r
θ θ
(xr ) = fs,r (x′r ). So k(fs,t
θ θ
(xt ), fs,t (x′t )) =
2
θ
k(fs,r θ
(xr ), fs,r (x′r )) = 1 by definition. Since k(x, y) = ∥x − y∥ , it is easy to see Eq. (12) reduces to

2
h i
θ θ
Ext ,t w(s, t) fs,t (xt ) − fs,r (xr ) +C (87)

θ
where C = 2 and w(s, t) is a weighting function. If s is a small positive constant, we further have fs,t (xt ) ≈ gθ (xt , t)
where we drop s as input. If gθ (xt , t) itself satisfies boundary condition at s = 0, we can directly take s = 0 in which case
θ
f0,t (xt ) = gθ (xt , t). And under these assumptions, our loss becomes
h i
2
Ext ,x,t w(t)∥gθ (xt , t) − gθ− (xr , r)∥ + C, (88)

which is simply a CM loss using ℓ2 distance.

However, one can notice that from a moment-matching perspective, this loss is problematic in two aspects. First, it assumes
single particle estimate, which now ignores the entropy repulsion term in MMD that arises only during multi-particle
estimation. This can contribute to mode collapse and training instability of CM. Second, the choice of energy kernel only
matches the first moment, which is insufficient for matching two complex distributions! We should use kernels that match
higher moments in practice. In fact, we show in the following Lemma that the pseudo-huber loss proposed in Song &
Dhariwal (2023) matches higher moments as a kernel.
q
2
Lemma 2. Negative pseudo-huber loss kc (x, y) = c − ∥x − y∥ + c2 for c > 0 is a conditionally positive definite kernel
that matches all moments of x and y where weights on higher moments depend on c.

q
2
Proof. We first check that negative pseudo-huber loss c− ∥x − y∥ + c2 is a conditionally positive definite kernel (Auffray
& Barbillon,
Pn 2009). By definition, k(x, y) is conditionally positive definite if for x1 , · · · , xn ∈ RD and c1 , · · · , cn ∈ RD
with i=1 ci = 0
n X
X n
ci cj k(xi , xj ) ≥ 0 (89)
i=1 j=1

25
Inductive Moment Matching

We know that negative L2 distance −∥x − y∥ is conditionally positive definite. We prove this below for completion. Due to
triangle inequality, −∥x − y∥ ≥ −∥x∥ − ∥y∥. Then
n X
X n n X
X n
ci cj (−∥xi − xj ∥) ≥ ci cj (−∥xi ∥ − ∥xj ∥) (90)
i=1 j=1 i=1 j=1
  !
n
X n
X n
X n
X
=− ci  cj ∥xj ∥ − cj ci ∥xi ∥ (91)
i=1 j=1 j=1 i=1
(a)
= 0 (92)
q
Pn 2
where (a) is due to i=1 ci = 0. Now since c − ∥z∥ + c2 ≥ −∥z∥ for all c > 0, we have
n X
n  q n X
 X n
2
X
2
ci cj c − ∥xi − xj ∥ + c ≥ ci cj (−∥xi − xj ∥) ≥ 0 (93)
i=1 j=1 i=1 j=1

So negative pseudo-huber loss is a valid conditionally positive definite kernel.


q
2
Next, we analyze pseudo-huber loss’s effect on higher-order moments by directly Taylor expanding ∥z∥ + c2 − c at
z=0
1 1 1 5
q
2 2 4 6 8 9
∥z∥ + c2 − c = ∥z∥ − 3 ∥z∥ + ∥z∥ − ∥z∥ + O(∥z∥ ) (94)
2c 8c 16c5 128c7
1 2 1 4 1 6 5 8 9
= ∥x − y∥ − 3 ∥x − y∥ + ∥x − y∥ − ∥x − y∥ + O(∥x − y∥ ) (95)
2c 8c 16c5 128c7
(96)
k
where we substitute z = x − y. Each higher order ∥x − y∥ for k > 2 expands to a polynomial containing up to k-th
moments, i.e., {x, x2 , . . . xk }, {y, y 2 , . . . y k }, thus the implicit feature map contains all higher moments where c contributes
to the weightings in front of each term.

Furthermore, we extend our finite difference (between r(s, t) and t) IMM objective to the differential limit by taking
r(s, t) → t in Appendix H. This results in a new objective that similarly subsumes continuous-time CM (Song et al., 2023;
Lu & Song, 2024) as a single-particle special case.

G.2. Diffusion GAN and Adversarial Consistency Distillation


Diffusion GAN (Xiao et al., 2021) parameterizes its generative distribution as
Z
θ
ps|t (xs |xt ) = qs|t (xs |Gθ (xt , z), xt )p(z)dz

where Gθ is a neural network, p(z) is standard Gaussian distribution, and qs|t (xs |x, xt ) is the DDPM posterior
2
qs|t (xs |x, xt ) = N (µs,t , σs,t I) (97)
αt σs2 αt2 σs2
µQ = xt + αs (1 − )x
αs σt2 αs2 σt2
2 αt2 σs2
σQ = σs2 (1 − )
αs2 σt2
Note that DDPM posterior is a stochastic interpolant, and more importantly, it is self-consistent, which we show in the
Lemma below.
Lemma 6. For all 0 ≤ s < t ≤ 1, DDPM posterior distribution from t to s as defined in Eq. (97) is a self-consistent
Gaussian interpolant between x and xt .

26
Inductive Moment Matching

Proof. Let xr ∼ qr|t (xr |x, xt ) and xs ∼ qs|r (xs |x, xr ), we show that xs follows qs|t (xs |x, xt ).
s
αt σr2 α2 σ 2 αt2 σr2
xr = 2 xt + αr (1 − t2 r2 )x + σr 1− ϵ1 (98)
αr σt αr σt αr2 σt2
s
αr σs2 αr2 σr2 αr2 σs2
xs = x r + αs (1 − )x + σs 1− ϵ2 (99)
αs σr2 αs2 σt2 αs2 σt2

where ϵ1 , ϵ2 ∼ N (0, I) are i.i.d. Gaussian noise. Directly expanding


" s # s
αr σs2 αt σr2 αt2 σr2 αt2 σr2 αr2 σr2 α2 σ 2
xs = 2 2 xt + αr (1 − 2 2 )x + σr 1 − 2 2 ϵ1 + αs (1 − 2 2 )x + σs 1 − r2 s2 ϵ2 (100)
αs σr αr σt αr σt α r σt αs σt αs σt
" s # s
αt σs2
 2 2
αt2 σr2 αr2 σr2 αr σs2 α2 σ 2 α2 σ 2

αr σs
= 2 x t + 2
(1 − 2 2 ) + αs (1 − 2 2 ) x+ 2
σr 1 − t2 r2 ϵ1 + σs 1 − r2 s2 ϵ2 (101)
αs σt αs σr αr σt αs σt αs σr αr σt αs σt
s s
αt σs2 αt2 σs2 αr σs2 αt2 σr2 α2 σ 2
= 2 x t + α s (1 − 2 2 )x + 1 − 2 2 ϵ1 + σs 1 − r2 s2 ϵ2 (102)
αs σt αs σt αs σr αr σt αs σt
s
2
(a) αt σs αt2 σs2 αt2 σs2
= x t + α s (1 − )x + σ s 1 − ϵ3 (103)
αs σt2 αs2 σt2 αs2 σt2

where (a) is due to the fact that sum of two independent Gaussian variables with variance a2 and b2 is also Gaussian with
variance a2 + b2 , and ϵ3 ∼ N (0, I) is another independent Gaussian noise. We show the calculation of the variance:

αr2 σs4 αt2 σr2 2 αr2 σs2 αr2 σs4 αt2 σs4 2 αr2 σs4
(1 − ) + σ s (1 − ) = − + σs −
αs2 σr2 αr2 σt2 αs2 σt2 αs2 σr2 σt2 αs2 αs2 σt2
2 2
α σ
= σs2 (1 − t2 s2 )
αs σt

This shows xs follows qs|t (xs |x, xt ) and completes the proof.

This shows another possible design of the interpolant that can be used, and diffusion GAN’s formulation generally complies
with our design of the generative distribution, except that it learns this conditional distribution of x given xt directly while
we learn a marginal distribution. When they directly learn the conditional distribution by matching pθs|t (x|xt ) with qt (x|xt ),
the model is forced to learn qt (x|xt ) and there only exists one minimizer. However, in our case, the model can learn multiple
different solutions because we match the marginals instead.
GAN loss and MMD loss. We also want to draw attention to similarity between GAN loss used in Xiao et al. (2021); Sauer
et al. (2025) and MMD loss. MMD is an integral probability metric over a set of functions F in the following form

DF (P, Q) = sup Ex∼P (x) f (x) − Ey∼Q(y) f (y)


f ∈F

where a supremum is taken on this set of functions. This naturally gives rise to an adversarial optimization algorithm if F
is defined as the set of neural networks. However, MMD bypasses this by selecting F as the RKHS where the optimal f
can be analytically found. This eliminates the adversarial objective and gives a stable minimization objective in practice.
However, this is not to say that RKHS is the best function set. With the right optimizers and training scheme, the adversarial
objective may achieve better empirical performance, but this also makes the algorithm difficult to scale to large datasets.

G.3. Generative Moment Matching Network


It is trivial to check that GMMN is a special parameterization. We fix t = 1, and due to boundary condition, r(s, t) ≡ s = 0

implies training target pθs|r(s,t) (xs ) = qs (xs ) is the data distribution. Additionally, pθs|t (xs ) is a simple pushforward of prior
p(ϵ) through network gθ (ϵ) where drop dependency on t and s since they are constant.

27
Inductive Moment Matching

H. Differential Inductive Moment Matching


Similar to the continuous-time CMs presented in (Lu & Song, 2024), our MMD objective can be taken to the differential
limit. Consider the simplifed loss and parameterization in Eq. (12), we use the RBF kernel as our kernel of choice for
simplicity.
θ
Theorem 2 (Differential Inductive Moment Matching). Let fs,t (xt ) be a twice continuously differentiable function
with bounded first and second derivatives, let k(·, ·) be RBF kernel with unit bandwidth, x, x′ ∼ q(x), xt ∼ qt (xt |x),
x′t ∼ qs (x′t |x′ ), xr = DDIM(xt , x, t, r) and x′r = DDIM(x′t , x′ , t, r), the following objective

1 h 
θ θ ′
 
θ θ ′

LIMM-∞ (θ, t) = lim Ex ,x′ ,x ,x′ k f
s,t (x t ), f (x
s,t t ) + k fs,r (x r ), fs,r (x r ) (104)
r→t (t − r)2 t t r r

   i
− k fs,tθ θ
(xt ), fs,r (x′r ) − k fs,t θ
(x′t ), fs,r
θ
(xr )

can be analytically derived as


" ⊤
′ 2
θ
dfs,t θ
(xt ) dfs,t (x′t )
Ext ,x′t e− 2 ∥fs,t (xt )−fs,t (xt )∥
1 θ θ
(105)
dt dt
θ ⊤ ⊤ df θ (x′ )
!#
dfs,t (xt )  θ 
s,t t
− θ
fs,t (xt ) − fs,t (x′t ) fs,t
θ θ
(xt ) − fs,t (x′t )
dt dt

Proof. Firstly, the limit can be exchanged with the expectation due to dominated convergence theorem where the integrand
consists of kernel functions which can be assumed to be upper bounded by 1, e.g. RBF kernels are upper bounded by 1,
and thus integrable. It then suffices to check the limit of the integrand. Before that, let us review the first and second-order
2
Taylor expansion of e−∥x−y∥ . We let a, b ∈ RD be constants to be expanded around. We note down the Taylor expansion
to second-order below for notational convenience.
2
• e−∥x−b∥ /2
at x = a:
2 2 1 2
e−∥a−b∥ /2
− e−∥a−b∥ /2
(a − b)⊤ (x − a) + (x − a)⊤ e−∥a−b∥ /2 (a − b)(a − b)⊤ − I (x − a)

2
2
• e−∥y−a∥ /2
at y = b:

2 2 1 2
e−∥b−a∥ /2
− e−∥b−a∥ /2
(b − a)⊤ (y − b) + (y − b)⊤ e−∥b−a∥ /2 (b − a)(b − a)⊤ − I (y − b)

2
2
• e−∥x−y∥ /2
at x = a, y = b:
2 2 2
e−∥a−b∥ /2 − e−∥a−b∥ /2 (a − b)⊤ (x − a) − e−∥b−a∥ /2 (b − a)⊤ (y − b)
1 2
+ (y − b)⊤ e−∥b−a∥ /2 (b − a)(b − a)⊤ − I (y − b)

2
1 2
+ (y − b)⊤ e−∥b−a∥ /2 (b − a)(b − a)⊤ − I (y − b)

2
2
+ (x − a)⊤ e−∥b−a∥ /2 I − (a − b)(a − b)⊤ (y − b)


Putting it together, the above results imply


2 2 2 2
e−∥x−y∥ /2
+ e−∥a−b∥ /2
− e−∥x−b∥ − e−∥y−a∥ /2
/2

2
≈ (x − a)⊤ e−∥b−a∥ /2
I − (a − b)(a − b)⊤ (y − b)


since it is easy to check that the remaining terms cancel.

28
Inductive Moment Matching

θ
Substituting x = fs,t θ
(xt ), a = fs,r θ
(xr ), y = fs,t (x′t ), b = fs,r
θ
(x′r ), we furthermore have

1 1  θ θ

lim (x − a) = lim fs,t (xt ) − fs,r (xr ) (106)
r→t t − r r→t t − r
" #
θ
1 dfs,t (xt ) 2
= lim (t − r) + O(|t − r| ) (107)
r→t t − r dt
θ
dfs,t (xt )
= (108)
dt
Similarly,

1 1  θ ′ θ
 dfs,t (x′t )
lim (y − b) = lim θ
fs,t (xt ) − fs,r (x′r ) = (109)
r→t t − r r→t t − r dt

Therefore, LIMM-∞ (θ, t) can be derived as


"
θ ⊤
′ 2 dfs,t (xt )
Ext ,x′t e− 2 ∥fs,t (xt )−fs,t (xt )∥
1 θ θ
(110)
dt
#
   ⊤  df θ (x′ )
s,t t
θ
I − fs,t θ
(xt ) − fs,t (x′t ) fs,t
θ θ
(xt ) − fs,t (x′t ) (111)
dt
" ⊤
′ 2
θ
dfs,t θ
(xt ) dfs,t (x′t )
= Ext ,x′t e− 2 ∥fs,t (xt )−fs,t (xt )∥
1 θ θ
(112)
dt dt
θ ⊤ ⊤ df θ (x′ )
!#
dfs,t (xt )  θ 
s,t t
− θ
fs,t (xt ) − fs,t (x′t ) fs,t
θ θ
(xt ) − fs,t (x′t )
dt dt

H.1. Pseudo-Objective
Due to the stop-gradient operation, we can similarly find a pseudo-objective whose gradient matches the gradient of
LIMM-∞ (θ, t) in the limit of r → t.
θ
Theorem 3. Let fs,t (xt ) be a twice continuously differentiable function with bounded first and second derivatives,
k(·, ·) be RBF kernel with unit bandwidth, x, x′ ∼ q(x), xt ∼ qt (xt |x), x′t ∼ qt (x′t |x′ ), xr = DDIM(xt , x, t, r) and
x′r = DDIM(x′t , x′ , t, r), the gradient of the following pseudo-objective

1 h    −
θ−

∇θ L−
IMM-∞ (θ, t) = lim
θ
∇θ Ext ,x′t ,xr ,x′r k fs,t θ
(xt ), fs,t (x′t ) + k fs,rθ
(xr ), fs,r (x′r ) (113)
r→t (t − r)
   i
θ− θ−
− k fs,tθ
(xt ), fs,r (x′r ) − k fs,t
θ
(x′t ), fs,r (xr )

can be used to optimize θ and can be analytically derived as


"
− − 2
θ
− 21 fs,t (x′t )−fs,t
θ
(xt )
Ext ,x′t e (114)

− ⊤
i⊤ df θ− (x′ ) h
!
θ
dfs,t (x′t ) h
θ− θ− s,t t θ− θ−
i⊤
− fs,t (xt ) − fs,t (x′t ) fs,t (xt ) − fs,t (x′t ) θ
∇θ fs,t (xt )+
dt dt
− ⊤
i⊤ df θ− (xt ) h
! !#
θ
dfs,t (xt ) h
θ− θ− s,t θ− θ−
i⊤
− fs,t (xt ) − fs,t (x′t ) fs,t (xt ) − fs,t (x′t ) θ
∇θ fs,t (x′t )
dt dt

29
Inductive Moment Matching
− −
θ
Proof. Similar to the derivation of LIMM-∞ (θ, t), let x = fs,t θ
(xt ), a = fs,r θ
(xr ), y = fs,t (x′t ), b = fs,r
θ
(x′r ), we have
1 2
∇θ (x − a)⊤ e−∥b−a∥ /2 I − (a − b)(a − b)⊤ (y − b)

lim
r→t (t − r)
" !
1 −∥b−a∥2 /2 ⊤




= lim e (y − b) − (a − b) (y − b) (a − b) ∇θ x+
r→t (t − r)
! #
 
⊤ ⊤ ⊤
(x − a) − (a − b) (x − a) (a − b) ∇θ y

where − −
1 θ
dfs,t (xt ) 1 θ
dfs,t (x′t )
lim (x − a) = , lim (y − b) =
r→t (t − r) dt r→t (t − r) dt

df θ (xt )
Note that s,tdt is now parameterized by θ− instead of θ because the gradient is already taken w.r.t. fs,t
θ
(xt ) outside of
the brackets, so (x − a) and (y − b) merely require evaluation at current θ with no gradient information, which θ− satisfies.
The objective can be derived as
∇θ L −
IMM-∞ (θ, t)
"
− − 2
θ
− 12 fs,t (x′t )−fs,t
θ
(xt )
= Ext ,x′t e

− ⊤
i⊤ df θ− (x′ ) h
!
θ
dfs,t (x′t ) h
θ− θ− s,t t θ− θ−
i⊤
− fs,t (xt ) − fs,t (x′t ) fs,t (xt ) − fs,t (x′t ) θ
∇θ fs,t (xt )+
dt dt
− ⊤
i⊤ df θ− (xt ) h
! !#
θ
dfs,t (xt ) h
θ− θ− s,t θ− θ−
i⊤
− fs,t (xt ) − fs,t (x′t ) fs,t (xt ) − fs,t (x′t ) θ
∇θ fs,t (x′t )
dt dt

H.2. Connection with Continuous-Time CMs


′ ′
Observing Eq. (105) and Eq.(110), we can see that when  xt = xt and xr = xr , s being a small positive constant, then
2
θ
fs,t (x′t ) = fs,t
θ
(xt ), and exp − 12 fs,t
θ
(x′t ) − fs,t
θ
(xt ) θ
= 1, and fs,t (x′t ) ≈ gθ (xt , t) where since s is fixed we discard
the dependency on s as input. Then, Eq. (105) reduces to
" 2#
θ
dfs,t (xt )
Ext (115)
dt

which is the same as differential consistency loss (Song et al., 2023; Geng et al., 2024). And Eq. (110) reduces to

θ−
" #
dfs,t (xt ) θ
Ext ∇θ fs,t (xt ) (116)
dt

which is the pseudo-objective for continuous-time CMs (Song et al., 2023; Lu & Song, 2024) (minus a weighting function
of choice).

I. Experiment Settings
I.1. Training & Parameterization Settings
We summarize our best runs in Table 5. Specifically, for ImageNet-256×256, we adopt a latent space paradigm for
computational efficiency. For its autoencoder, we follow EDM2 (Karras et al., 2024) and pre-encode all images from

30
Inductive Moment Matching

CIFAR-10 ImageNet-256×256
Parameterization Setting
Architecture DDPM++ DiT-S DiT-B DiT-L DiT-XL DiT-XL
GFlops 21.28 6.06 23.01 80.71 118.64 118.64
Params (M) 55 33 130 458 675 675
cnoise (t) 1000t 1000t 1000t 1000t 1000t 1000t
2nd time conditioning s s s s s t−s
Flow Trajectory OT-FM OT-FM OT-FM OT-FM OT-FM OT-FM
gθ (xt , s, t) Simple-EDM Euler-FM Euler-FM Euler-FM Euler-FM Euler-FM
σd 0.5 0.5 0.5 0.5 0.5 0.5
Training iter 400K 1.2M 1.2M 1.2M 1.2M 1.2M
Training Setting
Dropout 0.2 0 0 0 0 0
Optimizer RAdam AdamW AdamW AdamW AdamW AdamW
Optimizer ϵ 10−8 10−8 10 −8
10 −8
10−8 10−8
β1 0.9 0.9 0.9 0.9 0.9 0.9
β2 0.999 0.999 0.999 0.999 0.999 0.999
Learning Rate 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
Weight Decay 0 0 0 0 0 0
Batch Size 4096 4096 4096 4096 4096 4096
M 4 4 4 4 4 4
Kernel Laplace Laplace Laplace Laplace Laplace Laplace
r(s, t) max(s, η −1 (ηt − (ηmax2−η
15
min )
)) max(s, η −1 (ηt − (ηmax2−η
12
min )
))
Minimum t, r gap - - - - - 10−4
p(t) U(0.006, 0.994) U(0, 0.994) U(0, 0.994) U(0, 0.994) U(0, 0.994) U(0, 0.994)
a 1 1 1 1 1 2
b 5 4 4 4 4 4
EMA Rate 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
Label Dropout - 0.1 0.1 0.1 0.1 0.1
x-flip True False False False False False
EDM augment (Karras et al., 2022) True False False False False False
Inference Setting
Sampler Type Pushforward Pushforward Pushforward Pushforward Pushforward Pushforward
Number of Steps 2 8 8 8 8 8
Schedule Type t1 = 1.4 Uniform Uniform Uniform Uniform Uniform
FID-50K (w = 0) 1.98 - - - - -
FID-50K (w = 1.0, i.e. no guidance) - 42.28 26.02 9.33 7.25 6.69
FID-50K (w = 1.5) - 20.36 9.69 2.80 2.13 1.99

Table 5. Experimental settings for different architectures and datasets.

ImageNet into latents without flipping, and calculate the channel-wise mean and std for normalization. We use Stable
Diffusion VAE2 and rescale the latents by channel mean [0.86488, −0.27787343, 0.21616915, 0.3738409] and channel std
[4.85503674, 5.31922414, 3.93725398, 3.9870003]. After this normalization transformation, we further multiply the latents
by 0.5 so that the latents roughly have std 0.5. For DiT architecture of different sizes, we use the same hyperparameters for
all experiments.
Choices for T and ϵ. By default assuming we are using mapping function r(s, t) by constant decrement in ηt , we keep
ηmax ≈ 160. This implies that for time distribution of the form U(ϵ, T ), we set T = 0.996 for cosine diffusion and
T = 0.994 for OT-FM. For ϵ, we set it differently for pixel-space and latent-space model. For pixel-space on CIFAR-10,
we follow Nichol & Dhariwal (2021) and set ϵ to a small positive constant because pixel quantization makes smaller
noise imperceptible. We find 0.006 to work well. For latent-space on ImageNet-256×256, we have no such intuition as in
pixel-space. We simply set ϵ = 0 in this case.
Exceptions occur when we ablate other choices of r(s, t), e.g. constant decrement in λt in which case we set ϵ = 0.001 to
prevent r(s, t) for being too close to t when t is small.
Injecting time s. The design for additionally injecting s can be categorized into 2 types – injecting s directly and injecting
stride size (t − s). In both cases, architectural designs exactly follow the time injection of t. We simply extract positional
time embedding of s (or t − s) fed through 2-layer MLP (same as for t) before adding this new embedding to the embedding
of t after MLP. The summed embedding is then fed through all the Transformer blocks as in standard DiT architecture.
2
https://huggingface.co/stabilityai/sd-vae-ft-mse

31
Inductive Moment Matching

1-step 2-step 4-step 8-step


1-step 2-step 4-step 8-step
TF32 w/ a = 1 7.97 4.01 2.61 2.13
a=1 7.97 4.01 2.61 2.13
FP16 w/ a = 1 8.73 4.54 3.03 2.38
a=2 8.28 4.08 2.60 2.01
FP16 w/ a = 2 8.05 3.99 2.51 1.99
Table 6. Ablation of exponent a in the weight-
ing function on ImageNet-256×256. We see Table 7. Ablation of lower precision training on
that a = 2 excels at multi-step generation ImageNet-256×256. For lower precision training, we
while lagging slightly behind in 1-step gener- employ both minimum gap ∆ = 10−4 and (t − r)
ation. conditioning.

Improved CT baseline. For ImageNet-256×256, we implement iCT baseline by using our improved parameterization with
Simple-EDM and OT-FM schedule. We use the proposed pseudo-huber loss for training but find training often collapses
using the same r(s, t) schedule as ours. We carefully tune the gap to achieve reasonable performance without collapse and
present our results in Table 2.

I.2. Inference Settings


Inference schedules. For all one-step inference, we directly start from ϵ ∼ N (0, σd2 I) at time T to time ϵ through
pushforward sampling. For all 2-step methods, we set the intermediate timestep t1 such that ηt1 = 1.4; this choice is purely
empirical which we find to work well. For N ≥ 4 steps we explore two types of time schedules: (1) uniform decrement in t
with η0 < η1 · · · < ηN where

N −i
ti = T + (ϵ − T ) (117)
N

and (2) EDM (Karras et al., 2022) time schedule. EDM schedule specifies η0 < η1 · · · < ηN where
 1
ρ N − i ρ1 1
ρ

ηi = ηmax + (ηmin − ηmax ) and ρ = 7 (118)
N

We slightly modify the schedule so that η0 = ηmin is the endpoint instead of η1 = ηmin and η0 = 0 as originally proposed,
since our η0 can be set to 0 without numerical issue.
We also specify the time schedule type used for our best runs in Table 5 and their results.

I.3. Scaling Settings


Model GFLOPs. We reuse numbers from DiT (Peebles & Xie, 2023) for each model architecture.
Training compute. Following Peebles & Xie (2023), we use the formula model GFLOPs · batch size · training steps · 4 for
training compute where, different from DiT, we have constant 4 because for each iteration we have 2 forward pass and 1
backward pass, which is estimated as twice the forward compute.
Inference compute. We calculate inference compute via model GFLOPs · number of steps.

I.4. Ablation on exponent a


We compare the performance between a = 1 and a = 2 on full DiT-XL architecture in Table 6, which shows how a affects
results of different sampling steps. We observe that a = 2 causes slightly higher 1-step sampling FID but outperforms a = 1
in the multi-step regime.

I.5. Caveats for Lower-Precision Training


For all experiments, we follow the original works (Song et al., 2020b; Peebles & Xie, 2023) and use the default TF32
precision for training and evaluation. When switching to lower precision such as FP16, we find that our mapping function,
i.e. constant decrement in ηt , can cause indistinguishable time embedding after some MLP layers when t is large. To
mitigate this issue, we simply impose a minimum gap ∆ between any t and r, for example, ∆ = 10−4 . Our resulting

32
Inductive Moment Matching

mapping function becomes


!
 
−1
r(s, t) = max s, min t − ∆, η (η(t) − ϵ)

Optionally, we can also increase distinguishability between nearby time-steps inside the network by injecting (r − s) instead
of s as our second time condition. We use this as default for FP16 training. With these simple changes, we observe minimal
impact on generation performance.
Lastly, if training from scratch with lower precision, we recommend FP16 instead of BF16 because of higher precision that
is needed to distinguish between nearby t and r.
We show results in Table 7. For FP16, a = 1 causes slight performance degradation because of the small gap issue at large
t. This is effectively resolved by a = 2 which downweights losses at large t by focusing on smaller t instead. At lower
precision, while not necessary, a = 2 is an effective solution to achieve good performance that matches or even surpasses
that of TF32.

J. Additional Visualization
We present additional visualization results in the following page.

33
Inductive Moment Matching

Figure 12. Uncurated samples on CIFAR-10, unconditional, 2 steps.

34
Inductive Moment Matching

Figure 13. Uncurated samples on ImageNet-256×256 using DiT-XL/2 architecture. Guidance w = 1.5, 8 steps.

35
Inductive Moment Matching

Figure 14. Uncurated samples on ImageNet-256×256 using DiT-XL/2 architecture. Guidance w = 1.5, 8 steps.

36

You might also like