0% found this document useful (0 votes)
216 views32 pages

tt3d Gen

Uploaded by

hatimmos1412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
216 views32 pages

tt3d Gen

Uploaded by

hatimmos1412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Preprint

C ONSISTENT F LOW D ISTILLATION FOR T EXT- TO -3D


G ENERATION
Runjie Yan∗ Yinbo Chen∗ Xiaolong Wang
UC San Diego UC San Diego UC San Diego

A BSTRACT

Score Distillation Sampling (SDS) has made significant strides in distilling image-
arXiv:2501.05445v1 [cs.CV] 9 Jan 2025

generative models for 3D generation. However, its maximum-likelihood-seeking


behavior often leads to degraded visual quality and diversity, limiting its effec-
tiveness in 3D applications. In this work, we propose Consistent Flow Distillation
(CFD), which addresses these limitations. We begin by leveraging the gradient of
the diffusion ODE or SDE sampling process to guide the 3D generation. From
the gradient-based sampling perspective, we find that the consistency of 2D im-
age flows across different viewpoints is important for high-quality 3D genera-
tion. To achieve this, we introduce multi-view consistent Gaussian noise on the
3D object, which can be rendered from various viewpoints to compute the flow
gradient. Our experiments demonstrate that CFD, through consistent flows, sig-
nificantly outperforms previous methods in text-to-3D generation. Project page:
https://runjie-yan.github.io/cfd/.

1 I NTRODUCTION

3D content generation has been gaining increasing attention in recent years for its wide range of ap-
plications. However, it is expensive to create high-quality 3D assets or scan objects in the real world.
The scarcity of 3D data has been a primary challenge in 3D generation. On the other hand, image
synthesis has witnessed great progress, particularly with diffusion models trained on large-scale
datasets with massive high-quality and diverse images. Leveraging the 2D generative knowledge for
3D generation by model distillation has become a research direction of key importance.
Score Distillation Sampling (Poole et al., 2023) (SDS) pioneered the paradigm. It uses a pretrained
text-to-image diffusion model to optimize a single 3D representation such that the rendered views
seek a maximum likelihood objective. Several subsequent efforts (Zhu et al., 2024; Liang et al.,
2023; Katzir et al., 2024; Huang et al., 2024; Tang et al., 2023; Wang et al., 2023b; Armandpour
et al., 2023) have been made to improve SDS, while the maximum-likelihood-seeking behavior
remains, which has a detrimental effect on the visual quality and diversity. Variational Score Distil-
lation (Wang et al., 2024a) (VSD) tackles this issue by treating the 3D representation as a random
variable instead of a single point as in SDS. However, the random variable is simulated by particles
in VSD. Single-particle VSD is theoretically equivalent to SDS (Wang et al., 2023b), assuming the
LoRA network in VSD is always trained to optimal. While the optimization-based sampling of VSD
is k times slower with k particles.
In this work, we propose Consistent Flow Distillation (CFD), which distills 3D representations
through gradient-based diffusion sampling of consistent 2D image probability flows across different
views. We provide theoretical analysis of this process and extend it to a wide range of deterministic
and stochastic diffusion sampling processes. In the distillation process, we identify that a key is to
apply consistent flows to the 3D representation. Intuitively, in 2D image generation, the same region
is always associated with the same fixed noise for the correct flow sampling. Analogously, in 3D
generation, the 2D image flows from different camera views should also use the noise patterns that
are consistent on the object surface with correct correspondence. To achieve this, we design a multi-
view consistent Gaussian noise based on Noise Transport Equation (Chang et al., 2024), which can
compute the multi-view consistent noise with negligible cost. During the distillation process, the

Equal contribution

1
Preprint

“A pirate galleon with a bioluminescent hull


that glows faintly in the dark ocean waters,
illuminating the ship's intricate carvings and
“A cute cat covered by snow” “A futuristic space station”
sails as it silently navigates the waves”

(a) NeRFs generated by CFD from scratch.

“A steampunk owl with mechanical


“An astronaut is riding a horse” “A polar bear surfing a big wave”
wings”

(b) 3D textured meshes generated by CFD from scratch.

“A treasure chest full of gold coins and


“A 3D model of a toy fighter plane, sharp”
jewels, high resolution, sharp”
(c) CFD can generate diverse and high-quality 3D samples from scratch.

Figure 1: Text-to-3D samples of CFD. CFD can generate diverse 3D samples by distilling text-to-
image diffusion models. See videos in our project page for additional generation results.

2
Preprint

multi-view consistent Gaussian noise is rendered from different views to compute the gradient of
2D image flow. Finally, our method can create high quality and diverse 3D objects by following the
diffusion ODE or SDE sampling process.
We evaluate our method with different types of pretrained 2D image diffusion models, and compare
it with state-of-the-art text-to-3D score distillation methods. Both qualitative and quantitative exper-
iments show the effectiveness of our approach compared with prior works. Our method generates
3D assets with realistic appearance and shape (Fig. 1(a), 1(b)) and can sample diverse 3D objects
for the same text prompt (Fig. 1(c)) with negligible extra computation cost compared with SDS.
In summary, our main contributions are:

• An in-depth discussion about using image diffusion PF-ODE or SDE to directly guide
3D generation. We present equivalent forms of the ODE and SDE so that their random
variables are clean images at any time in the diffusion process, and identified that flow
consistency is a key in this process.
• A multi-view consistent Gaussian noise on the 3D object, that keeps pixel i.i.d. Gaussian
property in any single view and has correct correspondence on the object surface between
different views.
• A method to distill image diffusion models for 3D generation. It is as simple and efficient
as SDS while having significantly better quality and diversity.

2 P RELIMINARIES
2.1 D IFFUSION MODELS AND P ROBABILITY F LOW O RDINARY D IFFERENTIAL E QUATION
(PF-ODE)

A forward diffusion process (Sohl-Dickstein et al., 2015; Ho et al., 2020) gradually adds noise to
a data point x0 ∼ p0 (x0 ), such that the intermediate distribution pt0 (xt |x0 ) conditioned on initial
sample x0 at diffusion timestep t is N (αt x0 , σt2 I), which can be equivalently written as
xt = αt x0 + σt ϵ, ϵ ∼ N (0, I), (1)
where α0 = 1, σ0 = 0 at the beginning, and αT ≈ 0, σT ≈ 1 in the end, such that pT (xT ) is
approximately the standard Gaussian N (0, σT2 I). A diffusion model ϵϕ is learned to reverse such
process, typically with the following denoising training objective (Ho et al., 2020):
LDM (ϕ) = Ex0 ,ϵ,t [wt ||ϵϕ (xt , t) − ϵ||22 ]. (2)
After training, ϵϕ (xt , t) ≈ −σt ∇xt log pt (xt ), where ∇xt log pt (xt ) is termed score function.
A Probability Flow Ordinary Differential Equation (PF-ODE) has the same marginal distribution as
the forward diffusion process at any time t (Song et al., 2021b). The PF-ODE can be written as:
d(xt /αt ) d(σt /αt )
= (−σt ∇xt log pt (xt )) (3)
dt dt
d(σt /αt )
= ϵϕ (xt , t), xT ∼ pT (xT ). (4)
dt

A data point x0 can be sampled by starting from a Gaussian noise xT ∼ N (0, σT2 I) and following
the PF-ODE trajectory from t = T to t = 0, typically with discretized timesteps and an ODE solver.

2.2 D IFFERENTIABLE 3D R EPRESENTATIONS

Differentiable 3D representations are typically parameterized by the learnable parameters θ and a


differentiable rendering function gθ (c) to render images corresponding to the camera views c. In
many tasks, the gradient is first obtained on the rendered images gθ (c) and then backpropagated
through the Jacobian matrix ∂g∂θθ (c)
of the renderer to the learnable parameters θ.
Common 3D neural representations include Neural Radiance Field (NeRF) (Mildenhall et al., 2021;
Müller et al., 2022; Wang et al., 2021; Barron et al., 2021; Xu et al., 2022), 3D Gaussian Splatting

3
Preprint

(3DGS) (Kerbl et al., 2023), and Mesh (Laine et al., 2020; Shen et al., 2021). In this work, we
perform experiments on various 3D representations and validate that our method is applicable for
generation across a wide range of 3D representations.

3 C ONSISTENT F LOW D ISTILLATION


We present Consistent Flow Distillation (CFD), which takes a pretrained and frozen text-to-image
diffusion model and distills a 3D representation by the gradient from the probability flow of the 2D
image diffusion model. We propose to guide 3D generation with 2D clean flow gradients operating
jointly on a 3D object. We identify that a key in this process is to make the flow guidance consistent
across different camera views (see Sec. 3.1). We further propose an SDE, a generalization of the
clean flow ODE, that incorporates noise injection during optimization to enhance generation quality
(see Sec. 3.2). To achieve the consistent flow, we propose an algorithm to compute a multi-view con-
sistent Gaussian noise, which provides noise for different views with noise texture exactly aligned
on the surface of the 3D object (see Sec. 3.3). Finally, we draw connections between CFD and other
score distillation methods (see Sec. 3.4).

3.1 3D G ENERATION WITH 2D C LEAN F LOW G RADIENT

Given a pretrained text-to-image diffusion model ϵϕ (xt , t, y), let y denote the condition (text
prompt), the conditional distribution p(x0 |y) can be sampled from the PF-ODE (Song et al., 2021b)
trajectory from t = T to t = 0, which takes the form
xt σt
d( ) = d( ) · ϵϕ (xt , t, y). (5)
αt α
| {zt }
| {z }
∇L
−lr
By following the diffusion PF-ODE, pure Gaussian noise is transformed to an image in the target
distribution p(x0 |y). Thus PF-ODE can be interpreted as guiding the refinement of a noisy image
to a realistic image. Can we use image PF-ODE to directly guide the generation of a differentiable
3D representation θ through the refining process, with θ as its learnable parameters and gθ as its
differentiable rendering function?
A direct implementation can be substituting the noisy images in Eq. 5 with the rendered images
xt
gθ (c) at the camera view c by letting α t
= gθ (c). By viewing d( ασtt ) as the learning rate lr of an
xt
optimizer and ϵϕ (xt , t, y) as the loss gradient to α t
, the gradient can be backpropagated through the
Jacobian matrix of the renderer gθ (c) to update θ according to
∂gθ (c)
∆θ = −lr · ϵϕ (αt gθ (c), t, y) . (6)
∂θ
However, such a direct attempt may not work (see Fig. 5 (a)), since the image xt at diffusion
timestep t contains Gaussian noise. It is hard for the images rendered by a 3D representation to
xt
match the noisy images α t
in an image PF-ODE, particularly around the beginning t = T , where
xT is per-pixel independent Gaussian noise. It is generally impossible for a continuous 3D represen-
tation to be rendered as per-pixel independent Gaussian noise from all camera views simultaneously.
As a result, the rendered views may be out-of-distribution (OOD) as the input to the pretrained image
diffusion model, and therefore cannot get meaningful gradient as guidance.
To resolve the OOD issue, we use a change-of-variable (Gu et al., 2023; Yan et al., 2024) to trans-
form the original noisy variable xt in PF-ODE (Eq. 5) to a new variable that is free of Gaussian
noise at any time t ∈ [0, T ]. For each trajectory {xt }t∈[0,T ] of the xt in the original PF-ODE, the
new variable x̂ct is defined as
xt − σt ϵ̃
x̂ct ≜ , (7)
αt
where ϵ̃ is set as the initial noise ϵ̃ = xσTT and is a constant for each ODE trajectory {xt }t∈[0,T ] . By
Eq. 5 and Eq. 7, the evolution of the new variable x̂ct is derived as follows:
σt
dx̂ct = d( ) · ϵϕ (αt x̂ct + σt ϵ̃, t, y) − ϵ̃ .

(8)
α
| {zt } | ∇L
{z }
−lr

4
Preprint

Consistent Flow Distillation (CFD) Gradient

𝑡 Text Prompt
Clean Flow
(Annealing Time)
𝑑𝒈𝜃 (𝑐2 ) 3D Object 𝜃 Rendered Image

𝒈𝜃 (𝑐2 ) 𝒈𝜃 (𝑐) Add Noise Text-to-Image


𝝐(𝜃, 𝑐) Diffusion 𝝐𝜙
𝑡

𝜕𝒈𝜃 (𝑐)
𝑑𝒈𝜃 (𝑐1 ) 𝛻𝜃 𝐿 = (𝝐𝜙 (𝑥𝑡 , 𝑡, 𝑦) − 𝝐)
𝜕𝜃
𝒈𝜃 (𝑐1 )

Figure 2: Overview of CFD. The 3D representation θ is generated with decreasing timesteps. At


each timestep t, different views gθ (c) are rendered. The 2D image clean flow provides the gradient
at timestep t to the views and backpropagates to θ. The right shows the gradient computation in
detail: we add a multi-view consistent noise (see Fig. 3) to the rendered image and pass it into the
frozen text-to-image diffusion model, gradient is calculated using the model prediction and then
backpropagated to θ.

Changing the variable xt of the original diffusion PF-ODE to the variable x̂ct makes directly using
PF-ODE as a 3D guidance possible by providing the following properties (the proof is in Appx. G.3):
(i) x̂ct are clean images for all t ∈ [0, T ] (see Appx. Fig. 17), therefore, it can be substituted with the
rendered clean images gθ (c). (ii) x̂ct is initialized from zero: x̂cT = 0, which can be consistent with
the 3D representation initialization (e.g. NeRF, where the entire scene is initialized to a uniform
gray). (iii) The endpoint of the new ODE trajectory x̂c0 = x0 is a sample following the target
distribution p0 (x0 ) and is completely determined by the constant ϵ̃ (thus ϵ̃ can be viewed as the
identity of the trajectory). The new variable x̂ct is therefore termed clean variable. Note that x̂ct is
xt −σt ϵϕ (xt ,t,y)
different from the “sample prediction” x̂gt t ≜ αt of diffusion network for xt , which is
not directly usable in this framework and we discuss for more details in Appx. H. We use clean flow
to denote the ODE (Eq. 8) of the clean variable x̂ct .
Similar to Eq. 6, we use the following gradient to update the 3D representation θ:
 
 ∂gθ (c)
∇θ LCFD (θ) = Ec ϵϕ (αt gθ (c) + σt ϵ̃(θ, c), t, y) − ϵ̃(θ, c) , (9)
∂θ

where t = t(τ ) is a predefined monotonically decreasing timestep annealing function of the op-
timization time τ , and ϵ̃(θ, c) is a multi-view consistent Gaussian noise function, we discuss its
design details in Sec. 3.3. We let ϵ̃(θ, c) be a deterministic function of θ and c, ensuring that the
noise remains constant for a fixed camera view and geometry, given that ϵ̃ is constant for a single
flow trajectory in clean flow ODE. Since we have a set of 2D image flows jointly operating on a 3D
object, the gradient updates from different camera views in Eq. 9 may interfere with each other. We
identify that a key in the 3D sampling process is to make the 2D image flows consistent on the 3D
object surface. This requires a multi-view consistent Gaussian noise function ϵ̃(θ, c) that is not only
view-dependent but also provides the correct local correlation on the object surface. The multi-view
consistent Gaussian noise function should apply a similar noise pattern to the same region of the
object surface, even from different camera views. This corresponds to that the fixed noise pattern
is always added to the same region for the clean variable in 2D image clean flow ODE. The overall
process of CFD is summarized in Fig. 2.

3.2 G UIDING 3D G ENERATION WITH D IFFUSION SDE

Despite that PF-ODE and diffusion SDE can recover the same marginal distributions in theory, SDE-
based stochastic sampling may result in better generation quality as reported in prior works (Song

5
Preprint

Rasterize & Aggregate


Reference Space 𝑬𝒓𝒆𝒇

𝑛 𝑛
1 1
𝑥= 𝑥=
𝑛 𝑛

Warping 𝒯 −1

Camera-to-World Camera-to-World
Projection Projection

Camera View 𝒄𝟏 3D World Space 𝑬𝒘𝒐𝒓𝒍𝒅 Camera View 𝒄𝟐

Figure 3: Warping consistent noise for query views. To obtain a query view noise map, for
each pixel, its vertices are projected onto the object surface, then wrapped to the coordinates in a
high-resolution noise map. The values within the region specified by the coordinates on the high-
resolution noise map are summed and normalized as the return pixel value in the query view noise.

et al., 2021b;a; Karras et al., 2022). Motivated by this, we also propose to use image diffusion SDE
to guide 3D generation.
To achieve this, we propose a reverse-time SDE with a form similar to the clean flow ODE (Eq. 8):

c σt σt  c

dx̂t = d( αt ) + αt βt dt · |ϵϕ (αt x̂t + σ{z
t ϵ̃t , t, y) − ϵ̃t ,


}
| {z } ∇L (10)
−lr

 √
dϵ̃t = ϵ̃t βt dt + 2βt dw̄t ,

with initial condition x̂cT = 0 and ϵ̃T ∼ N (0, I), where w̄t is a standard Wiener process in the
reverse time from T to 0. It can be further proved that this SDE and its forward-time form are
equivalent to the diffusion SDE presented by Song et al. (Song et al., 2021b) and EDM (Karras
et al., 2022). When we set βt = 0, the SDE becomes deterministic and becomes the clean flow
ODE. When βt ̸= 0, new Gaussian noise will be injected into ϵ̃t during the diffusion process, but ϵ̃t
is still of unit variance throughout the whole process from T to 0. Furthermore, x̂ct in this SDE still
retains the “clean properties” of x̂ct in the clean flow ODE. Thus, we also use clean flow to refer to
this SDE. We provide detailed discussions and proofs about this SDE in Appx. G
The clean flow SDE implies that a simple modification on Eq. 9 can make ∇θ LCFD (θ) correspond
to using SDE guidance. As detailed in Appx. G.4.1, we only need to inject new Gaussian noise into
ϵ̃(θ, c) during optimization by:
p √
ϵ̃(τ + 1) = 1 − γ ϵ̃(τ ) + γϵ, (11)
where γ is a predefined noise injection rate, τ is the optimization step, and ϵ ∼ N (0, I) is sampled
at each optimization step.

3.3 M ULTI - VIEW C ONSISTENT G AUSSIAN N OISE ϵ̃

To get consistent flow, a multi-view consistent Gaussian noise function ϵ̃(θ, c) is required, which
(i) is a per-pixel independent Gaussian noise for all camera views c; (ii) the noise patterns from
different views have the correct correspondence according to the 3D object surface. It is non-trivial
to satisfy all these properties with common warping and interpolation methods. The query rays from
camera views c take continuous coordinates, simply using common interpolation methods such as
bilinear may break the per-pixel independent property and result in bad quality (see Fig. 5 (b)).

6
Preprint

Inspired by Integral Noise (Chang et al., 2024), we develop an algorithm that implements the multi-
view consistent Gaussian noise with Noise Transport Equation. The Noise Transport Equation was
originally proposed for warping noise between two frames in a video (Chang et al., 2024). To use
it in the 3D task, we generalize the Noise Transport Equation to the warping between two different
manifolds and compute the warping from different query camera views to the same reference space
Eref . As shown in Fig. 3, given a camera view c, the query pixel p is first projected onto the surface
of the object as camera-to-world ctwc (p) in the world space Eworld , then we map those points from
the surface to a reference space Eref through a predefined mapping function T −1 (design details
are in Appx. D). We define a high-resolution Gaussian noise map W on Eref . Finally, we aggregate
and return the noise value G(p) for the query pixel p according to
1 X
G(p) = p W (Ai ), (12)
|Ωp | Ai ∈Ωp

where Ωp = T −1 (ctwc (p)) is the area covered by p after p being warped to Eref , Ai is a noise cell
in Eref , and W (Ai ) is the noise value of unit variance at Ai . By first projecting query pixels from
different camera views to the object surface in the world space Eworld , two query pixels p1 , p2 from
two different camera views that look at the same region on the object will be projected to overlapped
regions ctwc1 (p1 ), ctwc2 (p2 ) on the object. After being warped by the same function T −1 , they
cover overlapped regions Ωp1 , Ωp2 and get correct correlation in noise maps G(p1 ), G(p2 ).
Our method can be also viewed as deriving a rendering function for the noisy variable xt in the
original form of PF-ODE (Eq. 5) by
xt (θ, c) = αt gθ (c) + σt ϵ̃(θ, c). (13)
As discussed in Integral Noise (Chang et al., 2024), the warping of an image gθ (c) follows the
transport
p equation that takes a similar form of Eq. 12, but with a different denominator |Ωp |, instead
of |Ωp | for ϵ̃(θ, c), thus common 3D representation is incapable of rendering Gaussian Noise
ϵ̃(θ, c), and it is needed to disentangle the noisy variable into the clean part gθ (c) and noisy part
ϵ̃(θ, c). By disentanglement, we can handle the two parts that follow different rendering equations
separately and achieve the rendering of the noisy variable for using image PF-ODE (or diffusion
SDE) as the guidance for 3D generation.

3.4 C OMPARISON WITH OTHER S CORE D ISTILLATION M ETHODS

Comparison with SDS. Both SDS and our CFD share a similar gradient form ϵϕ (xt , t, y) −
ϵ̃ ∂g∂θ
 θ (c)
to update the 3D representation θ from a sampled rendered view. In SDS, t is typically ran-
domly sampled from a range [tmin , tmax ], and ϵ̃ is a noise randomly sampled at each step. In contrast
to SDS, our CFD uses an annealing timestep t(τ ) that decreases from tmax to tmin , the deterministic
noise ϵ̃(θ, c) depends on both the object surface and the camera view, it is designed to let the noise
from different views have correct correspondence according to the object surface. Notably, SDS
with annealing timestep schedule can be viewed as setting γ = 1 in CFD, where significant stochas-
ticity is injected in the optimization. As a comparison, for typical diffusion sampling processes,
γ ≈ 0.00024 in DDPM, and γ = 0 in DDIM (see Appx. G.4.2). In our CFD, the definition of γ
requires that γ < 1 (Appx. Eq. 37), which implies a difference between CFD and SDS.
Theoretically, when restricted
to 2D image generation where
loss gradient noising method
x = gθ (c), SDS is equivalent to
seeking the maximum likelihood SDS ϵϕ (xt ) − ϵ ϵ ∼ N (0, I)
point in the noisy distribution VSD ϵϕ (xt ) − ϵlora (xt ) ϵ ∼ N (0, I)
pt with a Gaussian distribution ISM ϵϕ (xt ) − ϵϕ (xs ) DDIM inversion(gθ (c))
N (αt x, σt2 I) centered at the CFD (ours) ϵϕ (xt ) − ϵ̃ ϵ̃ = ϵ̃t (θ, c)
image x. When the optimization
of SDS loss is near optimal, their Table 1: Comparison between score distillation gradients.
generation results are centered
around a few modes (Poole et al.,
2023). In contrast, our CFD is sampling from the whole distribution p0 and equivalent to a diffusion
ODE or SDE sampling process with first-order discretization. Thus, CFD can generate more diverse
results with better quality.

7
Preprint

DreamFusion (SDS) ProlificDreamer (VSD) HiFA LucidDreamer (ISM) CFD (ours)

Figure 4: Visual comparison to baseline methods. We compare rendered images of our method
with baselines include DreamFusion (Poole et al., 2023), ProlificDreamer (Wang et al., 2024a),
HiFA (Zhu et al., 2024), LucidDreamer (Liang et al., 2023). The images of baselines are from their
official implementations. Prompts: “A 3D model of an adorable cottage with a thatched roof” (top)
and “A DSLR photo of an ice cream sundae” (bottom).

Comparison with other score distillation methods. We list the loss and noising of different meth-
ods in Tab. 1. ISM (Liang et al., 2023) incorporates DDIM inversion noising in their score distil-
lation. While this approach can yield finer details than SDS, computing the inversion significantly
increases computational costs. We discuss the connection between our method and ISM in Appx. H.
We also list the difference between proposed pipeline and different baseline methods in Appx. E.2.

4 E XPERIMENTS
In comparisons to prior methods, we distill Stable Diffusion (Rombach et al., 2022) and use the
same codebase threestudio (Guo et al., 2023). We compare CFD with various prior state-of-the-art
methods, including SDS (Poole et al., 2023; Wang et al., 2023a), VSD (Wang et al., 2024a) and
ISM (Liang et al., 2023). Specifically, VSD incorporates LoRA network training in their score dis-
tillation, ISM incorporates DDIM inversion in their score distillation. Since timestep annealing (Zhu
et al., 2024; Wang et al., 2024a; Huang et al., 2024) has been shown to help improve generation qual-
ity (Wang et al., 2024a; Zhu et al., 2024; Huang et al., 2024), we also apply timestep annealing to all
baseline methods. We use results from the official implementation of other baselines in qualitative
comparisons if not specified. In addition, we show results of a 2-stage pipeline in Fig. 1(a), 1(b),
where we first distill MVDream (Shi et al., 2024), then distill Stable Diffusion, which alleviates
the multi-face issue (Poole et al., 2023; Armandpour et al., 2023; Hong et al., 2023a). We provide
implementation details in Appx. A and details of experiment metrics in Appx. B.

4.1 C OMPARISON WITH BASELINES

We compute 3D-FID following VSD (Wang et al.,


2024a) to evaluate the quality and diversity of dif-
3D-FID ↓ 3D-CLIP ↑
ferent score distillation methods, and compute 3D-
CLIP to evaluate prompt alignment for different SDS 88.06 35.07±0.20
methods. We provide qualitative comparison in ISM 86.00 34.99±0.26
Fig. 4 and quantitative results in Tab. 2, 3, and Appx. VSD 83.02 35.10±0.20
Tab. 5. We also provide additional comparisons with CFD (ours) 78.13 35.16±0.23
VSD in Appx. Fig. 9, ISM in Appx. Fig. 10, and
SDS in Appx. Fig. 11. As shown in both quantitative Table 2: Comparison with baselines on
and qualitative results, CFD outperforms all baseline quality, diversity and prompt alignment.
methods and has better generation quality (Fig. 4 and We report averaged clip score of different
Appx. Fig. 9, 10, 11) and diversity (Appx. Fig. 9, verison of CLIP backbones. We use 10 seeds
10, 11). Our method produces rich details and the for each of the 10 different prompts, respec-
results are more photorealistic. Addition results and tively.
comparisons are in Appx. C.

8
Preprint

(a) Original PF-ODE (b) w/ bilinear noise (c) w/ random noise (d) w/ consistent noise

Figure 5: Ablation on the noise design and the flow space. (a) Directly training θ with original
PF-ODE using Eq. 6 with noisy variable. (b) Distilling with bilinear-interpolated noise map. (c)
Distilling with random noise. (d) Distilling with our multi-view consistent Gaussian noise, which
has the best visual quality.

Ranker Aesthetics PickScore


Ours vs. SDS 0.54 0.64
Ours vs. VSD 0.60 0.68
Ours vs. ISM 0.56 0.66
Ours vs. FSD 0.54 0.78

Table 3: Automated win rates comparison under reward models. We compare the perfor-
mance of our CFD method against baseline models using Aesthetics Scores (Schuhmann, 2022)
and PickScores (Kirstain et al., 2023). Our method consistently achieves a winning rate higher than
0.5, which demonstrates its effectiveness.

4.2 A BLATION S TUDIES

Ablation on the flow space. As shown in Fig. 5: (a) When directly training θ with original PF-
ODE using Eq. 6 with noisy variable, the training fails after several iterations. (b) Simply using
bilinear interpolation instead of Noise Transport Equation leads to correlated pixel noise and gener-
ates blurry results. (c) When using the random noise as in SDS, the results are over-smoothed. (d)
Our consistent flow distillation with multi-view consistent Gaussian noise generates high-quality re-
sults. By using a multi-view consistent Gaussian noise, the flow for a fixed camera is more aligned
with a diffusion sampling process, and the quality improves. We also provide additional ablations
on our design choices in Appx. E.

Ablation on noise injection rate γ. Noise injection rate γ in Eq. 11 determines the rate at which
new noise will be injected into the noise function. When γ = 0, no noise will be injected, ϵ̃ will
be fixed constant if the geometry and camera view is fixed and CFD corresponds to using ODE
guidance. When γ > 0, new noise will be injected, and ϵ̃(θ, c) will gradually change. In this
case, CFD corresponds to using SDE guidance. Using SDE-based stochastic samplers may help to
improve image generation quality as reported in prior works (Song et al., 2021b;a; Karras et al.,
2022). In Tab. 4. We also observe that use a small nonzero γ helps to improve the performance of
CFD. In practice, we found that using a γ larger than 0.0001 could result in over-smoothed texture,
therefore we set γ = 0.0001 by default in our experiments for CFD. As a reference, we calculated a
typical equivalent γ value of DDPM to be γ ≈ 0.00024 (see Appx. G.4.2).

5 R ELATED W ORK

Diffusion models Diffusion models (Sohl-Dickstein et al., 2015; Sharma et al., 2018; Ho et al.,
2020; Song et al., 2021b; Changpinyo et al., 2021; Schuhmann et al., 2022) are generative models
that are learned to reverse a diffusion process. A diffusion process gradually adds noise to a data
distribution, and the diffusion model is trained to reverse such an iterative process based on the score
function. Denoise Diffusion Implicit Models (DDIM) (Song et al., 2021a) proposed a determinis-

9
Preprint

γ 0.0 0.0001 0.001 0.01 1.0


3D-IS (↑) 2.24±0.12 2.60±0.21 2.47±0.39 2.08±0.04 1.77±0.13

Table 4: Ablation on noise injection rate γ. We ablate the impact of γ on 3D generation diversity
and quality. We generate samples with 16 random seeds.

tic sampling method to speed up the sampling. Meanwhile, it is proved that a diffusion process
corresponds to a Probability Flow Ordinary Differential Equation (PF-ODE) (Song et al., 2021b),
which yields the same marginal distributions as the forward diffusion process at any timestep. Later
works (Salimans & Ho, 2022; Karras et al., 2022; Lu et al., 2022) demonstrate that DDIM can be
viewed as the first-order discretization of the PF-ODE.

Score distillation sampling The score distillation sampling (SDS) paradigm for distilling 2D text-
to-image diffusion models for 3D generation is proposed in DreamFusion (Poole et al., 2023) and
SJC (Wang et al., 2023a). During the distillation process, the learnable 3D representation with dif-
ferentiable rendering is optimized by the gradient to make the rendered view match the given text.
Many recent works follow the SDS paradigm and studied for various aspects, including timestep
annealing (Huang et al., 2024; Wang et al., 2024a; Zhu et al., 2024), coarse-to-fine training (Lin
et al., 2023; Wang et al., 2024a; Chen et al., 2023), analyzing the components (Katzir et al., 2024),
formulation refinement (Zhu et al., 2024; Wang et al., 2024a; Liang et al., 2023; Tang et al., 2023;
Wang et al., 2023b; Yu et al., 2024; Armandpour et al., 2023; Wu et al., 2024b; Yan et al., 2024),
geometry-texture disentanglement (Chen et al., 2023; Ma et al., 2023; Wang et al., 2024a), ad-
dressing multi-face Janus problem replacing the text-to-image diffusion with novel view synthesis
diffusion (Liu et al., 2023; Long et al., 2023; Liu et al., 2024b; Weng et al., 2023; Ye et al., 2023;
Wang & Shi, 2023) or multi-view diffusion (Shi et al., 2024).

Reconstruction models Another prevailing paradigm for 3D generation is to reconstruct the 3D


shape given an input image. A typical pipeline is to first generate sparse-view images and then
reconstruct the 3D shapes using reconstruction methods (Wu et al., 2024a; Li et al., 2024) or mod-
els (Hong et al., 2023b; Liu et al., 2024a; Wang et al., 2024b; Tang et al., 2024). By directly training
on relatively large scale 3D dataset like Objaverse (Deitke et al., 2023), these methods are usually
capable of generating plausible shapes with a fast speed, but the performance of these models are
usually limited when facing out of domain input images.

6 C ONCLUSION

In this paper, we proposed Consistent Flow Distillation. We begin by leveraging the gradient of the
diffusion ODE or SDE sampling process to guide the 3D generation. From a sampling perspective,
we identified that using consistent flow to guide the 3D generation is the key to this process. We
developed a multi-view consistent Gaussian noise with correct correspondence on the object surface
and used it to implement the consistent flow. Our method can generate high-quality 3D represen-
tations by distilling 2D image diffusion models and shows improvement in quality and diversity
compared with prior score distillation methods.

Limitations and broader impact. Although CFD can generate 3D assets of high fidelity and
diversity, similar to prior works SDS, ISM, and VSD, the generation can take one to a few hours,
and when distilling a text-to-image diffusion model, due to the properties of the teacher models, the
distilled 3D representation sometimes may have multi-face Janus problem and may not be good for
complex prompt. Besides, due to 3D representation flexibility and interference from other views, it
is very hard to guarantee that the sampling process from a rendered view of the 3D object is exactly
the same as sampling for 2D images given text in practice. While our 3D consistent noise can reduce
the interference and achieve better results, the flow for 3D rendered views may not be exactly the
same as 2D flows of the initial noise. Also, like other generative models, it needs to pay attention to
avoid generating fake and malicious content.

10
Preprint

R EFERENCES
Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan
Zhou. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus
problem and beyond. arXiv preprint arXiv:2304.04968, 2023.

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas
Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843–852, 2023.

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and
Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,
2021.

Pascal Chang, Jingwei Tang, Markus Gross, and Vinicius C. Azevedo. How i warped your noise:
a temporally-correlated noise prior for diffusion models. In The Twelfth International Confer-
ence on Learning Representations, 2024. URL https://openreview.net/forum?id=
pzElnMrgSD.

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing
web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568, 2021.

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and
appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), October 2023.

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig
Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of anno-
tated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 13142–13153, 2023.

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free dis-
tillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured
Probabilistic Inference {\&} Generative Modeling, 2023.

Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian Laforte, Vikram Voleti, Guan Luo, Chia-
Hao Chen, Zi-Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang. threestudio: A unified
framework for 3d content generation. https://github.com/threestudio-project/
threestudio, 2023.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint


arXiv:2207.12598, 2022.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
neural information processing systems, 33:6840–6851, 2020.

Susung Hong, Donghoon Ahn, and Seungryong Kim. Debiasing scores and prompts of 2d diffusion
for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023a.

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli,
Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint
arXiv:2311.04400, 2023b.

Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, and Lei Zhang. Dreamtime: An
improved optimization strategy for diffusion-guided 3d generation. In The Twelfth International
Conference on Learning Representations, 2024. URL https://openreview.net/forum?
id=1bAUywYJTU.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-
based generative models. Advances in Neural Information Processing Systems, 35:26565–26577,
2022.

11
Preprint

Oren Katzir, Or Patashnik, Daniel Cohen-Or, and Dani Lischinski. Noise-free score distillation.
In The Twelfth International Conference on Learning Representations, 2024. URL https:
//openreview.net/forum?id=dlIMcmlAdk.
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splat-
ting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-
a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural
Information Processing Systems, 36:36652–36663, 2023.
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular
primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6),
2020.
Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang
Zhang, Wenhan Luo, Ping Tan, et al. Era3d: High-resolution multiview diffusion using efficient
row-wise attention. arXiv preprint arXiv:2405.11616, 2024.
Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Lucid-
dreamer: Towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint
arXiv:2311.11284, 2023.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten
Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d con-
tent creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 300–309, 2023.
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and
sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of
computer vision, pp. 5404–5411, 2024.
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-
2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in
Neural Information Processing Systems, 36, 2024a.
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick.
Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 9298–9309, 2023.
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang.
Syncdreamer: Generating multiview-consistent images from a single-view image. In The Twelfth
International Conference on Learning Representations, 2024b. URL https://openreview.
net/forum?id=MN3yH2ovHb.
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma,
Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d
using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast
ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural
Information Processing Systems, 35:5775–5787, 2022.
Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini,
Timur Bagautdinov, Vincent Sitzmann, and Justin Solomon. Score distillation via reparametrized
ddim. arXiv preprint arXiv:2405.15891, 2024.
Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Geo-
dream: Disentangling 2d and geometric priors for high-fidelity and consistent 3d generation.
arXiv preprint arXiv:2311.17971, 2023.
David McAllister, Songwei Ge, Jia-Bin Huang, David W Jacobs, Alexei A Efros, Aleksander Holyn-
ski, and Angjoo Kanazawa. Rethinking score distillation as a bridge between image distributions.
arXiv preprint arXiv:2406.09417, 2024.

12
Preprint

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and
Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications
of the ACM, 65(1):99–106, 2021.

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics prim-
itives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15,
2022.

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d
diffusion. In The Eleventh International Conference on Learning Representations, 2023. URL
https://openreview.net/forum?id=FjNys5c7VyY.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, pp. 10684–10695, 2022.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In
International Conference on Learning Representations, 2022. URL https://openreview.
net/forum?id=TIdIXIpzhoI.

Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/


laion-aesthetics/, 2022. Accessed: 2024-11-22.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi
Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An
open large-scale dataset for training next generation image-text models. Advances in Neural
Information Processing Systems, 35:25278–25294, 2022.

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
2556–2565, 2018.

Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra:
a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information
Processing Systems (NeurIPS), 2021.

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view
diffusion for 3d generation. In The Twelfth International Conference on Learning Representa-
tions, 2024. URL https://openreview.net/forum?id=FUgrjq2pbB.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. In International conference on machine learn-
ing, pp. 2256–2265. PMLR, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Interna-
tional Conference on Learning Representations, 2021a. URL https://openreview.net/
forum?id=St1giarCHLP.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole. Score-based generative modeling through stochastic differential equations. In Interna-
tional Conference on Learning Representations, 2021b. URL https://openreview.net/
forum?id=PxTIG12RRHS.

Boshi Tang, Jianan Wang, Zhiyong Wu, and Lei Zhang. Stable score distillation for high-quality 3d
generation. arXiv preprint arXiv:2312.09305, 2023.

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm:
Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint
arXiv:2402.05054, 2024.

13
Preprint

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam,
Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using
direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 8228–8238, 2024.
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jaco-
bian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629, 2023a.
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan,
Yilei Li, Qiang Liu, Zhangyang Wang, et al. Taming mode collapse in score distillation for text-
to-3d generation. arXiv preprint arXiv:2401.00909, 2023b.
Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation.
arXiv preprint arXiv:2312.02201, 2023.
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang.
Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),
Advances in Neural Information Processing Systems, volume 34, pp. 27171–27183. Curran
Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/
paper/2021/file/e41e164f7485ec4a28741a2d0ea41c74-Paper.pdf.
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro-
lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.
Advances in Neural Information Processing Systems, 36, 2024a.
Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li,
Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction
model. arXiv preprint arXiv:2403.05034, 2024b.
Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang.
Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint
arXiv:2310.08092, 2023.
Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and
Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image.
arXiv preprint arXiv:2405.20343, 2024a.
Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, and Hanwang Zhang. Consistent3d: Towards
consistent high-fidelity text-to-3d generation with deterministic sampling prior. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9892–9902, 2024b.
Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neu-
mann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 5438–5448, 2022.
Runjie Yan, Kailu Wu, and Kaisheng Ma. Flow score distillation for diverse text-to-3d generation,
2024.
Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, and Heng Wang. Consistent-1-to-3: Consistent im-
age to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020,
2023.
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free
energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp. 23174–23184, 2023.
Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, and XIAOJUAN QI. Text-
to-3d with classifier score distillation. In The Twelfth International Conference on Learning Rep-
resentations, 2024. URL https://openreview.net/forum?id=ktG8Tun1Cy.
Junzhe Zhu, Peiye Zhuang, and Sanmi Koyejo. HIFA: High-fidelity text-to-3d generation with ad-
vanced diffusion guidance. In The Twelfth International Conference on Learning Representations,
2024. URL https://openreview.net/forum?id=IZMPWmcS3H.

14
Preprint

A PPENDIX
A I MPLEMENTATION D ETAILS

In this paper, we conduct experiments primarily on a single NVIDIA-GeForce-RTX-3090 or


NVIDIA-L40 GPU (the latter only for soft shading rendering). In the quantitative experiments,
we adopt similar pipelines (including the choice of 3D representation, training steps, shape initial-
ization, teacher diffusion model, etc.) across methods. We apply timestep annealing for all methods
and use the same negative prompts in the quantitative experiments. The main differences between
methods lie in the loss functions used.
We use CFG (Ho & Salimans, 2022) scale of 75 for CFD in quantitative experiments. In practice, We
found CFD works the best with CFG scale of 50-75. We apply the same fixed negative prompts (Shi
et al., 2024; Katzir et al., 2024; McAllister et al., 2024) for different text prompts.
For simple prompts, we directly use CFD to distill Stable Diffusion v2.1 (Fig. 4, 12 and 13).
For mesh generation, we first use CFD to generate coarse shapes with MVDream (Shi et al., 2024).
Then we use CFD and follow the geometry and mesh refinement stages in VSD (Wang et al., 2024a)
with Stable Diffusion v2.1 to generate the mesh results in Fig. 1(b).
For complex prompts, we adopt a 2 stage pipeline (Fig. 1(a), 1(c), 6, 7, 8, 9, 10 and 11). We first
generate coarse shape by distilling MVDream to avoid multi-face problems. Then we distill Stable
Diffusion v2.1 to refine the details and colors (stage 2). We randomly replace the rendered image
with normal map with 0.2 probability to regularize the geometry in stage 2. The total training time
is approximately 3 hours on A100 GPU.

B E XPERIMENT D ETAILS

3D-FID We compute the FID score between the rendered images for the generated 3D samples and
the images generated by the teacher diffusion models following the evaluation setting of VSD (Wang
et al., 2024a). For the experiments with 10 prompts in Tab. 2, we sampled 5,000 images for each
prompt from Stable Diffusion, creating a real image set with a total of 50,000 images. We generated
3D objects using different score distillation methods, with 10 different seeds per prompt for each
method. We rendered 60 views for each 3D object, resulting in a fake image set of 6,000 images.
We use FID implementation from torchmetrics package with feature=2048.

3D-IS We compute the Inception Score (IS) for the front-view images to measure the quality and
diversity. We set split=2 to compute the standard variance of the IS metric. Due to limited compute
budget, we use 16 random seeds for each parameter setting of γ and then use the rendered front
view to compute IS metric. The IS implementation used in our experiments is from the torchmetrics
package.

3D-CLIP We compute the CLIP cosine similarity between the rendered images of the 3D samples
and the corresponding text prompt. For one sample, we render 120 views and take the maximum
CLIP score. Then we average the CLIP score across different seeds and prompts (and CLIP models).
We use CLIP socre implementation from torchmetrics package.

Aesthetic evaluation Following Diffusion-DPO (Wallace et al., 2024), we conduct an automated


win rate comparison under reward models in Tab. 3. The performance of our CFD method is evalu-
ated against baseline models using Aesthetics Scores (Schuhmann, 2022) and PickScores (Kirstain
et al., 2023). We calculate the scores on rendered images generated from 50 samples, each corre-
sponding to a randomly selected prompt.

C A DDITIONAL Q UALITATIVE C OMPARISON

We present more comparison between baseline methods and CFD in Fig. 9, Fig. 10, and Fig. 11. We
present additional generation results of CFD in Fig. 6, Fig. 7, Fig. 8, and Fig. 12.

15
Preprint

“A steampunk owl with mechanical wings”

“An astronaut riding a horse”

“A llama in a tuxedo at a fancy gala”

“A cowboy raccoon with a lasso”

Figure 6: Diverse NeRF results of CFD distilling MVDream then Stable Diffusion (Rombach
et al., 2022) on complex prompts.

16
Preprint

“A manga magical girl with magic wand” “A 3D anime-style dragon girl with shimmering
scales, horns, and a confident expression”

“A samurai panda with a bamboo sword” “A painter hedgehog with a palette”

“A knight fox in shining armor” “A wizard frog with a spellbook”

“A 3D model of a medieval house with grass, “A 3D model of a DSLR camera, photography,


vines, stone, wood, and medieval decor” box modeling, Maya”

Figure 7: NeRF results of CFD distilling MVDream then Stable Diffusion (Rombach et al.,
2022) on complex prompts.

17
Preprint

“A squirrel playing the guitar” “A pig wearing a backpack”

“A weathered brass compass with a cracked glass face, resting “A cracked ceramic mug, chipped along the rim and faded from
on an old, map. The compass is slightly tarnished, showing signs years of use, resting on a rustic wooden table, with morning
of age, bathed in soft, diffused sunlight.” sunlight casting soft shadows across its surface.”

“An old leather suitcase, its corners frayed and its surface “A delicate porcelain teacup with a gold-rimmed edge, resting on an
marked with age, labeled with vintage travel tags, placed on a embroidered tablecloth. Soft light gleams off the fine china, revealing
its intricate floral design and subtle cracks from age.”
wooden floor bathed in soft light.”

Figure 8: NeRF results of CFD distilling MVDream then Stable Diffusion (Rombach et al.,
2022) on complex prompts. CFD successfully generated multiple objects and most align with long
prompts.

D A LGORITHMS
We provide pseudo algorithms for CFD in Algorithm 1. Algorithm 2 presents how to compute the
multi-view consistent Gaussian noise ϵ̃(θ, c).

Choices of warping function T −1 and reference space Eref Generally speaking, correct corre-
spondence of noise map between different camera views can be achieved with any choice of con-
tinuous warping function T −1 and reference space Eref . In this work, we choose Eref to be a 2D
square space Eref = [−1, 1]2 to utilize existing fast rasterization algorithms, so that Algorithm 2
can be efficiently computed. We design a warping function T −1 to map points in 3D world space
Eworld to 2D reference space Eref . Specifically, to compute the warping T −1 we first convert the

18
Preprint

VSD CFD (ours)

“A rotary telephone carved out of wood”

“A sliced loaf of fresh bread”

“A plush dragon toy”

“A tarantula, highly detailed”

Figure 9: Additional comparison with ProlificDreamer (VSD) (Wang et al., 2024a). We show
generation results of different methods with different seeds in the last row.

19
Preprint

ISM CFD (ours)

“A wooden car”

“A DSLR photo of A Rugged, vintageinspired hiking boots with a weathered leather finish, best quality, 4K, HD”

“Saber from Fate stay Night, 3D, girl, anime”

“Zombie JOKER, head, HDR, photorealistic, 8K”

Figure 10: Additional comparison with LucidDreamer (ISM) (Liang et al., 2023). We show
generation results of different methods with different seeds in the last row.

20
Preprint

CFD
SDS
(ours)

“A squirrel knitting a scarf in a cozy living room”

CFD
SDS
(ours)

“A teapot shaped like a toy car”

CFD
SDS
(ours)

“A highly detailed 3D model of sand castle”

Figure 11: Comparison with SDS. We distill MVDream (Shi et al., 2024) and Stable Diffu-
sion (Rombach et al., 2022) in this experiment. We first generate coarse shape by distilling MV-
Dream using SDS and CFD, then distill Stable Diffusion to refined the color with SDS and CFD,
respectively. In this figure, the only difference between two methods is the noise function used by
SDS and CFD. We use 4 different seeds for each methods in this figure. SDS trends to generate
oversmoothed textures and identical simple shapes. CFD outperforms SDS with better diversity and
fidelity.

3D-CLIP ↑
B16 B32 L14 L14-336
SDS 36.30 35.99 31.82 32.42
VSD 36.58 36.27 31.97 32.67
CFD (ours) 36.79 36.32 32.44 33.10

Table 5: Comparison with baselines on prompt alignment. We use 1 random seed for each of the
128 prompts. B16, B32, L14, L14-336 denote different versions of CLIP backbones. We observe
that CFD is competitive or outperform SDS and VSD on prompt alignment.

points at (xp , yp , zp ) to spherical coordinates (rp , θp , ϕp ). For simplicity, we only present the case
when ϕp ∈ [0, π2 ). The point is then mapped to (xr , yr ) ∈ Eref , where
( p
xr = 1 − cos θp ,
p ϕp (14)
yr = 1 − cos θp · (2 · π − 1).
2

21
Preprint

“A plate piled high with “A ripe strawberry” “A baby bunny sitting on top of
chocolate chip cookies” a stack of pancakes”

“A car made out of sushi” “A bagel filled with cream “A hotdog in a tutu skirt”
cheese and lox”

“A marble bust of a mouse” “A delicious croissant” “A small saguaro cactus


planted in a clay pot”

Figure 12: NeRF results of CFD distilling Stable Diffusion (Rombach et al., 2022).

∂(xr ,yr )
Under this mapping function, one can verify that dxr dyr = | ∂(θ p ,ϕp )
|dθp dϕp = π2 · sin θp dθp dϕp .
So points uniformly scattered on the sphere in 3D space Eworld will remain uniform after being
mapped to the reference 2D space Eref . This design helps to improve the fairness of Algorithm 2
so that we can use a lower resolution reference space while keeping most of the warped triangles
covering enough area in the reference space Eref . Notably, two different triangles could overlap
with the warping defined by Eq. 14, resulting in correlations across the pixels of the computed
noise function ϵ̃(θ, c) in the same camera view. This overlap occurs only when the surface of the
3D object intersects the radius of a sphere centered at the origin of the Eworld more than once.
However, we do not observe the destructive effects seen in other interpolation methods that can lead
to correlation between pixels (as in Fig. 5 (b)) in our experiments, and we believe it is unnecessary
to find a warping function that avoids such overlapping completely.

Reference space Eref resolution We use reference space with resolution of 2048 × 2048 in most
of our experiments. This will only introduce 8.1% computation overhead to our training (tested on
RTX-3090 GPU). The teacher model Stable Diffusion represents the whole object with latent at 64
resolution and MVDream (Shi et al., 2024) 32, so noise map with 2048 resolution is sufficient. We
also observe the quality is similar with noise map resolutions from 512 to 2048.

22
Preprint

SDS VSD FSD CFD (ours)

“A zoomed out DSLR photo of 3d model of an adorable cottage with a thatched roof, high resolution, sharp”

“A highly detailed DSLR photo of a 3d model of historical stone castle”

“A DSLR photo of 3D model of a treasure chest full of gold coins and jewels, high resolution, sharp”

Figure 13: Comparison with SDS, VSD and FSD. We distill Stable Diffusion (Rombach et al.,
2022) with different score distillation methods in this experiment. CFD outperforms SDS, VSD and
FSD with better visual quality, geometry and has richer details.

E A DDITIONAL A BLATIONS

E.1 A BLATION ON THE D ESIGN S PACE

We ablate our proposed improvement step by step in this section. Timestep annealing (Wang et al.,
2024a; Zhu et al., 2024; Huang et al., 2024) is helpful for forming finer details. Adding negative
prompts (Shi et al., 2024; Katzir et al., 2024; McAllister et al., 2024) helps to improve generation
styles. We also find that adding negative prompts is crucial when timestep t(τ ) is small. With-
out negative prompts, the color of samples will become unnatural during the optimization at small
timesteps. In this work, we apply negative prompts by directly replacing the unconditional predic-
tion of the diffusion model with prediction conditioned on negative prompts. Finally, by changing
the random sampled noise in SDS with our multi-view consistent Gaussian noise, the generated
samples can form much richer details and are more diverse. We visualize this ablation in Fig. 15.
We propose utilizing CFD to distill the multi-view diffusion model, MVDream, in Stage 1 as shape
initialization for complex prompts. This decision is based on our observation that both baseline
methods and our CFD can experience multi-face issues when solely distilling SDv2.1 (Fig. 14(a)
and Fig. 14(b)). However, distilling only MVDream produces low-quality results (Fig. 14(c)). To
address these issues, we adopt a two-stage pipeline in our complete method, where Stage 1 initializes
the shape using MVDream, and Stage 2 refines it by distilling SDv2.1. This approach effectively
mitigates the challenges identified above, as illustrated in Fig. 14(d).

E.2 C OMPARE THE PIPELINE OF DIFFERENT METHODS

We list the differences between the pipelines of different baseline methods in Tab. 7.

23
Preprint

Algorithm 1 CFD
1: Input: 3D representation parameter θ, prompt y, pretrained diffusion model ϵϕ (xt , t, y), render
gθ (c), annealing time-schedule t(τ ), learning rate lr.
2: Output: 3D representation parameter θ.
3: for τ from 0 to τend do
4: Sample camera view c
5: Render image gθ (c), depth map Depth(c), and opacity map Opacity(c)
6: Get diffusion timestep t(τ )
7: Compute 3D Consistent Noise ϵ̃(θ, c) ▷ Refer to Algorithm 2
8: xt ← αt gθ (c) + σt ϵ̃(θ, c)
∂gθ (c)
9: θ ← θ − lr · (ϵϕ (xt , t(τ ), y) − ϵ̃(θ, c)) ∂θ
10: end for

Algorithm 2 Computing 3D Consistent Noise


1: Initialization: Noise background ϵbg , high resolution noise ϵref in reference space Eref , opac-
ity threshold oth , noise injection rate γ.
2: Input: Depth map Depth(c), opacity map Opacity(c).
3: Output: ϵ̃(θ, c) = ϵout .
4: Triangulate the pixels to p
5: Project those triangles to the surface ctw(p) in world space Eworld according to Depth(c)
6: Warp the triangles from world space Eworld to reference space Eref as T −1 (ctwc (p))
7: Rasterize and aggregate the noise values on ϵref corvered by the trangles
8: ϵout ← ϵbg Pn
9: ϵout [Opacity(c) > oth ]p ← √1n (x,y)i covered by the rasterized triangle T −1 (ctwc (p)) ϵref [x, y]
10: if γ > 0 then
√ √
11: ϵbg ← 1 − γϵbg + γ · randn like(ϵbg ) ▷ SDE noise injection
√ √
12: ϵref ← 1 − γϵref + γ · randn like(ϵref )
13: end if
14: Return ϵout

(a) SDS (SDv2.1) (b) CFD (SDv2.1) (c) CFD (MVDream) (d) CFD (2 stage)

Figure 14: Ablation on the pipeline stages. Prompt: “A bear playing an electric bass”.

E.3 C OMPARISON ON NOISE METHODS

We list the differences between the noising methods of different baseline methods in Tab. 6. Con-
current work FSD (Yan et al., 2024) also employs a deterministic, view-dependent noising function
and can therefore be considered a special case of our CFD with γ = 0. The noise of FSD is aligned
on a shpere independent of the 3D object surface. However, this noise design can still lead to over-
smoothed textures, and the misalignment of noise with the 3D object surface can sometimes result
in suboptimal geometry (see Fig. 13). The noise design of FSD is inferior to ours when the 3D
object shape is nearly formed. Gradient consistency is essential for accurately constructing geome-
try in differentiable 3D representations like NeRF. Aligning noise in 3D space independently of the
object surface can lead to deviations from the original geometry, even when a relatively good shape

24
Preprint

(a) random timesteps (b) + annealing timesteps (c) + negative prompts (d) + consistent noise

Figure 15: Ablation on the proposed improvements.

VSD ISM CFD (ours)


Timestep schedule (sample t ∼ U(tmin , tmax ))
tmax = tmin False False True
tmax Abrupt decrease Linearly decrease (till t0 ) Linearly decrease (multi-stage)
tmin Fixed Fixed Linearly decrease (multi-stage)
Noise
Noise type Random Inversion Consistent
3D representation
Shape initialization Stable Diffusion(+VSD) point-e MVDream(+CFD)
Representation NeRF→Mesh point cloud→3DGS NeRF(→Mesh)
Uncond prompt
LoRA network True False False
Negative prompt False True True

Table 6: Comparison between pipelines of VSD, ISM and CFD.

is formed, as a highly consistent region may be located away from the surface. In contrast, our noise
design, which aligns with the object surface, avoids such issues.
Notably, the object surface can slowly change during the generation process, so the noise for the
same view in CFD is not strictly fixed even when γ = 0 in Eq. 11.

F G RADIENT VARIANCE

We compare the gradient variance of different methods during training. We compute the scaled
gradient variance by taking Exponential Moving Average parameters v̂t , m̂t from Adam optimizer
for convenience. We report the scaled gradient variance σ on the parameters of nerf hash encoding
with 10 seeds for each of the noising methods. σ was calculated according to (where gt is the
gradient):


m̂t ≈ E[gt ],

E[gt2 ],
v̂t ≈ q (15)
q
σ = sum(v̂t −m̂2t ) ≈ sum(Var(gt )) .

sum(v̂t ) sum(v̂t )

We report the gradient variance in training for VSD (Wang et al., 2024a), SDS (Poole et al., 2023;
Wang et al., 2023a), FSD (Yan et al., 2024) and our methods in Tab. 8.

25
Preprint

SDS FSD CFD (ours, when γ = 0)


Timestep schedule Random Annealing Annealing
Same view noise Random Fixed Mostly fixed (surface-dependent)
Different views noise Independent Aligned on sphere Aligned on object surface

Table 7: Comparision between SDS, FSD, and CFD.

VSD SDS FSD CFD (ours)


σ (↓) 5.165± 0.458 4.670±0.066 4.580±0.081 4.521±0.090

Table 8: Scaled Gradient Variance. Our CFD has the lowest gradient variance.

G C LEAN F LOW SDE

G.1 BACKGROUND

Song et al. (Song et al., 2021b) presented a SDE that has the same marginal distribution pt (xt ) as
the forward diffusion process (Eq. 1). EDM (Karras et al., 2022) presented a more general form of
this SDE, and the SDE corresponds to forward process defined in Eq. 1 takes the following form:
x±  σt  σt  p σt 
d = −σt ∇x± log pt (x± )d ± βt σt ∇x± log pt (x± )dt + 2βt dwt (16)
αt αt αt αt
σt  σt  p σt 
= d ∓ βt dt · ϵϕ (x± , t, y) + 2βt dwt , (17)
αt αt | {z } αt
−σt ∇x log pt (x)

where dwt is the standard Wiener process. If we set αt = 1 for all t ∈ [0, T ], Eq. 17 will become
the same SDE in EDM (Karras et al., 2022). The initial condition for the forward process is x+ ∼
pts (x+ ) at t = ts (ts is small enough but ts > 0 to avoid numerical issues), and for the reverse
process, it is x− ∼ N (0, σT2 I) at t = T (Note that we also let αT be a small number but αT > 0 to
avoid numerical issues).

G.2 C LEAN F LOW SDE

The clean flow SDE takes the following form:


(
dx̂c± = d ασtt ∓ ασtt βt dt · ϵϕ (αt x̂c± + σt ϵ̃± , t, y) − ϵ̃± ,
  
√ (18)
dϵ̃± = ∓ϵ̃± βt dt + 2βt dwt ,

where dwt is the standard Wiener process. For the forward process, the initial condition at t = ts is
x̂c+ ∼ p0 (x+ ), ϵ̃+ ∼ N (0, I), and x̂c+ and ϵ̃+ are independent. For the reverse process, the initial
condition at t = T is x̂c− = 0 and ϵ̃− ∼ N (0, I).
Proposition 1 (Clean flow SDE is equivalent to diffusion SDE). In Eq. 18, if we define a new
variable x′± according to

x′± = αt x̂c± + σt ϵ̃± , (19)


then x′± and x± in Eq. 17 have the same law (probability distribution) for all t ∈ [ts , T ]. i.e. Eq. 18
and Eq. 17 are equivalent.

proof. We prove the equivalence by showing that the initial conditions and dynamics for x′± and
x± are identical.
Initial conditions. For the forward process of Eq. 18 at t = ts , x′+ = αts x̂c+ + σts ϵ̃+ . Thus,
x′± ∼ pts (x′± ) according to the definition of a forward diffusion process (Eq. 1). For the reverse
process of Eq. 18 at t = T , x′− = αT · 0 + σT ϵ̃− = σT ϵ̃− . So x′− ∼ N (0, σT2 I).

26
Preprint

Dynamics. The dynamic of x′± can be derived according to:

x′±  σt 
d =d x̂c± + ϵ̃±
αt αt
c σt  σt
=dx̂± + d ϵ̃± + dϵ̃±
αt αt
σt  σt σt  σt
∓ βt dt · ϵϕ (αt x̂c± + σt ϵ̃± , t, y) − ϵ̃± + d
 
= d ϵ̃± + dϵ̃±
αt αt αt αt
σt  σt  ′
 σt  σt
= d ∓ βt dt · ϵϕ (x± , t, y) − ϵ̃± + d ϵ̃± + dϵ̃±
αt αt αt αt (20)
σt  σt  σt  σt σt  σt
= d ∓ βt dt · ϵϕ (· · · ) − d ϵ̃± ± βt ϵ̃± dt + d ϵ̃± + dϵ̃±
αt αt αt αt αt αt
σt  σt  σt σt
= d ∓ βt dt · ϵϕ (· · · ) ± βt ϵ̃± dt + dϵ̃±
αt αt αt αt
σt  σt  σt σt p σt 
= d ∓ βt dt · ϵϕ (· · · ) ± βt ϵ̃± dt ∓ ϵ̃± βt dt + 2βt dwt
αt αt αt αt αt
σt  σt p σt 
∓ βt dt · ϵϕ (x′± , t, y) + 2βt

= d dwt .
αt αt αt

So x′± and x± follow the same dynamics.

We present a stochastic sampler in Algo. 3 that is equivalent Algorithm 2 in EDM (Karras et al.,
2022) to show a practice implementation of Eq. 18 for sampling.

C
G.3 P ROPERTIES OF x̂±

G.3.1 x̂tC ARE C LEAN I MAGES FOR ALL t ∈ [ts , T ]

Lemma 1 (Sample predictions are non-noisy images). The sample prediction of the diffusion model

xt − σt ϵϕ (xt , t, y)
x̂gt
t ≜ (21)
αt

is a weighted average of images in the target distribution p0 (x0 ):

x̂gt
t = E[x0 |xt ]. (22)

Thus, x̂gt
t are non-noisy images. Furthermore,

xt − αt E[x0 |xt ]
ϵϕ (xt , t, y) = . (23)
σt

27
Preprint

proof.
xt − σt ϵϕ (xt , t, y)
x̂gt
t =
αt
1
xt + σt2 ∇xt log pt (xt )

=
αt
1 σt2 
= xt + ∇x pt (xt )
αt pt (xt ) t
σt2
Z
1 
= xt + ∇x p(xt |x0 )p0 (x0 )dx0
αt pt (xt ) t
σt2
Z
1 
= xt + p(xt |x0 )∇xt log p(xt |x0 )p0 (x0 )dx0
αt pt (xt )
σt2 (xt − αt x0 )2  (24)
Z
1 
= xt + p(xt |x0 )∇xt − 2 p0 (x0 )dx0
αt pt (xt ) 2σt
Z
1 1 
= xt − p(xt |x0 )(xt − αt x0 )p0 (x0 )dx0
αt pt (xt )
R
p(xt |x0 )p0 (x0 )dx0 p(xt |x0 )p0 (x0 )
Z
1 
= xt − xt + αt x0 dx0
αt pt (xt ) pt (xt )
Z
1 
= xt − xt + αt x0 p(x0 |xt )dx0
αt
Z
= x0 p(x0 |xt )dx0

=E[x0 |xt ].

Thus,
xt − αt x̂gt
t xt − αt E[x0 |xt ]
ϵϕ (xt , t, y) = = . (25)
σt σt

Algorithm 3 A SDE sampler that is equivalent to Algorithm 2 in EDM (Karras et al., 2022)
1: Input: Diffusion model (sample prediction) Dϕ , ti∈{0,··· ,N } , γi∈{0,··· ,N −1} , Snoise .
2: Output: x̂cN .
3: Initialize ϵ̃0 ∼ N (0, I), x̂c0 = 0
4: for i ∈ {0, · · · , N − 1} do
2
5: Sample ϵi ∼ N (0, Snoise I)
6: t̂i ← ti + γi ti q
7: ϵ̃i+1 ← tt̂i ϵ̃i + 1 − ( tt̂i )2 ϵi
i i

8: di ← (x̂ci − Dϕ (x̂ci + t̂i ϵ̃i+1 , t̂i ))/t̂i


9: x̂ci+1 ← x̂ci + (ti+1 − t̂i )di
10: if ti+1 ̸= 0 then
11: d′i ← (x̂ci+1 − Dϕ (x̂ci+1 + ti+1 ϵ̃i+1 , ti+1 ))/ti+1
12: x̂ci+1 ← x̂ci + (ti+1 − t̂i )( 12 di + 12 d′i ) ▷ Apply 2nd order correction
13: end if
14: end for
15: Return x̂cN

Proposition 2 (x̂c± are non-noisy images). x̂c± in Eq. 18 are non-noisy images for all t ∈ [ts , T ].

proof. Since the initial conditions of x̂c± (x̂c− = 0 for reverse process and x̂c+ ∼ p0 (x0 ) for forward
process) implies x̂c± are initialized as non-noisy images, we only need to show that the dynamic of
x̂c± will not introduce Gaussian noise into x̂c± .

28
Preprint

The dynamic of x̂c± can be reformulated as:


σt  σt
dx̂c± = d ∓ βt dt · ϵϕ (αt x̂c± + σt ϵ̃± , t, y) − ϵ̃± ,
 
αt αt
σt  σt  αt x̂c± + σt ϵ̃± − αt E[x0 |αt x̂c± + σt ϵ̃± ] − σt ϵ̃±
= d ∓ βt dt · , (26)
αt αt σt
σt 
∓ βt dt · (x̂c± − E[x0 |αt x̂c± + σt ϵ̃± ]).

= d log
αt
As Eq. 26 shows that x̂c± is always moving towards non-noisy sample prediction x̂gt
t = E[x0 |xt ]
for all t ∈ [ts , T ], x̂c± will be non-noisy for all t ∈ [ts , T ].

We also visualize x̂c± at random timestep t ∈ [0, T ] of Stable Diffusion (Rombach et al., 2022)
sampling processes in Appx. Fig. 17 to show that they are visually clean (non-noisy). We use clean
variable to refer to x̂c± in this work since it is always non-noisy.

G.3.2 I NITIALIZATION OF x̂tC


The initial condition of reverse-time clean flow SDE (Eq. 18) is given by x− = 0 and ϵ̃− ∼ N (0, I).
This is consistent with a typical initialization of NeRF: the whole scene of the NeRF being all grey.
When we set x̂c− = 0 as the initial condition for the clean flow SDE, it corresponds to the initial
condition x− ∼ N (0, σT2 I) (Karras et al., 2022) in the diffusion SDE (Eq. 17). However, since
we set a small nonzero αT at the beginning, the strict initial condition of the diffusion SDE should
be pT (xT ), which is slightly different from N (0, σT2 I). In this case, we should set x̂c− ∼ p0 (x0 )
in the clean flow SDE to make the initial condition of the two SDE identical. Prior works usually
ignore the small difference between pT (xT ) and N (0, σT2 I) and starts from pure noise when sam-
pling (Lin et al., 2024), and from our practical observation, given different initial x̂c− ̸= 0 but the
same ϵ̃− , clean flow SDE will yield almost identical outputs (given the same seeds), which implies
the endpoints of x̂ct are not sensitive to initialization of x̂ct . So we choose to set x̂c− = 0 in this work
as the initial condition.

G.3.3 E NDPOINTS OF x̂tC


At the end of the reverse-time clean flow SDE, x̂c− = x0 ∼ p0 (x0 ). So x̂ct also ends as a sample in
the target distribution p0 (x0 ) as x0 in the reverse-time diffusion SDE.

G.4 P ROPERTIES OF ϵ̃±

ϵ̃± can be seen as the “pure noise” part in the clean flow SDE (Eq. 18). Notably, the evolution of ϵ̃±
does not depend on x̂ct and has a closed-form solution. The dynamic of ϵ̃± is given by
p
dϵ̃± = ∓ϵ̃± βt dt + 2βt dwt . (27)
The initial condition for ϵ̃± in both the forward and reverse process are ϵ̃± ∼ N (0, I).

G.4.1 C LOSED - FORM S OLUTIONS


For the forward process,
Rt Rt Rt
βs ds
ϵ̃+ = e 0 βs ds dϵ̃+ + ϵ̃+ βt e 0 βs ds dt

d e 0
Rt p Rt Rt
= e 0 βs ds 2βt dwt − ϵ̃+ βt e 0 βs ds dt + ϵ̃+ βt e 0 βs ds dt (28)
Rt p
= e 0 βs ds 2βt dwt .
Integral on both side of Eq. 28, we have
Rt
Z tp Rs
βs ds
e 0 ϵ̃+ − ϵ̃0 = 2βs e 0 βr dr dws . (29)
0

29
Preprint

Thus, we obtain the solution of ϵ̃+ :


Rt Rt
Z tp Rs
− βs ds − βs ds
ϵ̃+ = e 0 · ϵ̃0 + e 0 2βs e 0 βr dr dws . (30)
0
Similarly, we can obtain the solution of ϵ̃− :
RT RT
Z tp RT
− βs ds − βs ds βr dr
ϵ̃− = e t · ϵ̃T + e t 2βs e s dw̄s . (31)
T
Specifically, we can derive a closed-form formulation to compute ϵ̃+ (t) given ϵ̃+ (t′ ) for t′ < t from
Eq. 30, which takes the following form:
q
− tt′ βs ds
R Rt

ϵ̃+ (t) =e ϵ̃+ (t ) + 1 − e−2 t′ βs ds ϵ, (32)
p ′ √
= 1 − γ ϵ̃+ (t ) + γϵ, (33)
where Rt
γ = 1 − e−2 t′ βs ds , ϵ ∼ N (0, I). (34)
For ϵ̃− (t) and t′ > t,
R t′
q
R t′
− βs ds ′
ϵ̃− (t) =e ϵ̃− (t ) + 1 − e−2
t t
βs ds
ϵ, (35)

= 1 − γ ϵ̃− (t′ ) + γϵ,
p
(36)
where
R t′
γ = 1 − e−2 t
βs ds
, ϵ ∼ N (0, I). (37)
G.4.2 S PECIAL CASE SOLUTION OF DDPM
DDPM (Ho et al., 2020) corresponds to a special choice of βt , where βt = d(σσt /α t )/dt
t /αt
(Karras
et al., 2022). We present the solution of Eq. 35 when βt corresponds to the choice of DDPM in the
following:

s  2
σt /αt σt /αt
ϵ̃− (t) = ϵ̃T + 1− ϵ. (38)
σT /αT σT /αT
Assuming a designed schedule such that a k-step DDPM has a constant γ in two consecutive steps
k 1
as in Eq. 11. We get ϵ̃− (k) = (1 − γ) 2 ϵ̃− (0) + (1 − (1 − γ)k ) 2 ϵ. Thus, we obtain a value of γ in
Eq. 11 that corresponds to DDPM:

σt /αt 2 2 log σσTt /α


/αT
t
γ =1−( )k ≈ . (39)
σT /αT k
Putting a typical parameter configuration in our experiments with Stable Diffusion (DDPM sampler)
into Eq. 39, where t ≈ 0.212, σt /αt ≈ 0.60, T ≈ 0.974, σT /αT ≈ 12.59 and k = 25000, we get
γ ≈ 0.00024.

G.4.3 VARIANCE OF ϵ̃±


All vector components of ϵ̃± are of unit variance for all t ∈ [0, T ]:
Z t
−2 0t βs ds −2 0t βs ds
R R Rs
Var(ϵ̃+,i ) = e +e 2βs e2 0 βr dr ds
0
Rt
Z t Rs
= e−2 0 βs ds (1 + 2βs e2 0 βr dr ds)
0
Rt
Z t Rs
(40)
= e−2 0 βs ds (1 + de2 0 βr dr )
0
Rt
2 0t βr dr
R
−2 βs ds
=e 0 (1 + e − 1)
= 1,

30
Preprint

Figure 16: Visualization of noisy variable xt . Figure 17: Visualization of clean variable x̂ct .

RT RT
Z T RT
Var(ϵ̃−,i ) = e−2 t
βs ds
+ e−2 t
βs ds
2βs e2 s
βr dr
ds
t
RT
Z T RT
= e−2 t
βs ds
(1 + 2βs e2 s
βr dr
ds)
t
RT
Z T RT
(41)
−2 βs ds 2 βr dr
=e t (1 − de s )
t
RT RT
= e−2 t
βs ds
(1 − 1 + e2 t
βr dr
)
= 1.

G.5 C LEAN F LOW ODE

When we set βt = 0 in clean flow SDE (Eq. 18), it becomes determined and changes to an ODE
(Eq. 8). Furthermore, dϵ̃± = 0 and thus ϵ̃± will become a constant ϵ̃. This ODE is the same ODE
presented in FSD (Yan et al., 2024). It is also equivalent to the signal-ODE presented in BOOT (Gu
et al., 2023) when the diffusion model is changed to sample-prediction.

H D ISCUSSION ON THE C HOICE OF THE VARIABLE S PACE


H.1 G ROUND - TRUTH VARIABLE

Apart from the clean variable x̂ct , FSD (Yan et al., 2024) also defined another variable space that is
visually clean, which is the ground-truth variable x̂gt gt
t . x̂t is defined by
xt − σt ϵϕ (xt , t, y)
x̂gt
t ≜ . (42)
αt
x̂gt gt
t is also known as the “sample prediction” of the diffusion model. The ODE on x̂t is given by:
σt
dx̂gt
t = −( ) · dϵϕ (xt , t, y). (43)
αt
Concurrent work SDI (Lukoianov et al., 2024) shares an insight similar to ours by also using ren-
dered images to replace the “non-noisy variables” to guide 3D generation. The difference between
SDI (Lukoianov et al., 2024) and our method is that SDI replaced the ground-truth variable x̂gt
t with
rendered image gθ (c) but we replace the clean variable x̂ct with gθ (c).
Theoretically speaking, if it’s just to solve the OOD problem when using image PF-ODE as a guid-
ance for 3D generation, we think it’s both reasonable to replace x̂gt c
t and x̂t with rendered images,

31
Preprint

since they are both non-noisy throughout the diffusion process (Lemma 1 and Proposition 2). How-
ever, it’s difficult to exactly compute the update rule in Eq. 43 since xt is required on right hand
side of Eq. 43. In order to recover xt given x̂gt
t , SDI needs to solve a fixed point equation, which is
hard to be solved (Lukoianov et al., 2024). In practice, SDI use a loss gradient similar to ISM. SDI
interpret the DDIM inversion as the approximated solution of the fixed point equation. Difficulties
also appear in works that attempt to apply guidance on the ground-truth variable x̂gtt for conditional
image generation, as seen in UGD (Bansal et al., 2023) and FreeDoM (Yu et al., 2023). In contrast,
we can compute the evolution of x̂c± exactly according to Eq. 18 without the need to solve a fixed
point equation.
Additionally, another recent work ISM (Liang et al., 2023) can also be viewed as replacing the
ground-truth variable x̂gt
t as discussed in SDI (Lukoianov et al., 2024), since the main difference
between ISM and SDI loss is whether to apply text condition when computing DDIM inversion.

H.2 C OMPARISON WITH C ONSISTENT 3D

Consistent3D (Wu et al., 2024b) introduced the Consistency Distillation Sampling (CDS) loss by
modifying the consistency training loss within the score distillation framework. Their insights into
the connection between SDS and Diffusion SDE align closely with ours. However, their CDS loss
stems from the consistency model training loss, similar to how SDS is derived from the diffusion
model training loss, disregarding the Jacobian term (Poole et al., 2023). In contrast, our CFD loss
directly follows the principles of diffusion model sampling through the ODE/SDE formulation. The
image rendered from a specific camera view corresponds directly to a point on the ODE/SDE tra-
jectory, resulting in distinct final training losses that differ from their CDS loss. Furthermore, our
approach integrates a multiview consistent noising strategy, enhancing both consistency and robust-
ness.
From a theoretical perspective, our work provides a more rigorous mathematical connection between
score distillation and diffusion sampling compared with Consistent3D. Specifically: (i) While Con-
sistent3D suggests that SDS can be interpreted as a form of SDE sampling, their proof relies on
approximating the diffusion process by assuming optimal training at each step, an assumption that
may not hold in practical experiments. In contrast, our approach does not rely on optimal training at
every step. Additionally, our theory (Eq. 10 in our paper) covers a broader range of diffusion SDEs,
including EDM (Karras et al., 2022) and PF-ODE as a special case. (ii) The CDS approach lacks
a direct correspondence to a probability flow ODE trajectory, while our interpretation establishes a
direct mapping between rendered images and points on the ODE/SDE trajectory.

H.3 P ROPERTIES OF C LEAN VARIABLE

Since clean flow ODE is a special case of clean flow SDE when βt = 0, x̂gt
t in the ODE also
maintains the “clean properties” discussed in Appx. G.3.

32

You might also like