0% found this document useful (0 votes)
40 views17 pages

PROPOSAL TO Text To 3-D Images Generation Using Interval Score Matching For Their High Fidelity

Uploaded by

haadii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views17 pages

PROPOSAL TO Text To 3-D Images Generation Using Interval Score Matching For Their High Fidelity

Uploaded by

haadii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Text to 3-D Images Generation Using Interval Score

Matching for their High Fidelity

By

Mr. XYZ

1
Table of Contents
Abstract......................................................................................................................................................3
Chapter 1:..................................................................................................................................................4
Introduction...............................................................................................................................................4
1.1 Overview.....................................................................................................................................4
1.2 Rationale of the Research..........................................................................................................4
1.3 Objectives of the Paper..............................................................................................................5
Chapter 2:..................................................................................................................................................6
Related Works...........................................................................................................................................6
2.1 Text-to-3D Generation....................................................................................................................6
2.2 Representations of Differentiable 3D.............................................................................................7
2.3 Models of Diffusion..........................................................................................................................7
Chapter 3....................................................................................................................................................8
Research Methodology..............................................................................................................................8
3.1 Revisiting the SDS...........................................................................................................................8
3.2 SDS Analysis....................................................................................................................................9
3.3 Interval Score Matching................................................................................................................10
3.3.1 DDIM Inversion......................................................................................................................10
3.3.2 Interval Score Matching.........................................................................................................11
3.4 The Advanced Generation Pipeline..............................................................................................15
3.4.1 3D Gaussian Splatting............................................................................................................15
3.4.2 Initialization............................................................................................................................16

2
Abstract

Recent developments in text-to-3D generation represent an important turning point for


generative models, opening up new avenues for the creation of creative 3D elements for a variety
of real-world applications. Though recent developments in text-to-3D generation have showed
promise, they frequently fail to produce 3D models that are accurate and detailed. This issue is
particularly common because a lot of techniques rely on Score Distillation Sampling (SDS). This
research points out a significant flaw in SDS: it provides the 3D model with inconsistent, poor-
quality update directions, which results in the over-smoothing effect. We suggest a unique
strategy termed Interval Score Matching (ISM) to address this. To combat over-smoothing, ISM
uses interval-based score matching in conjunction with deterministic diffusing trajectories.
Additionally, we include 3D Gaussian Splatting into our workflow for text-to-3D creation.
Numerous tests reveal that our model performs significantly better in terms of quality and
training efficiency than the state-of-the-art.

3
Chapter 1:
Introduction

1.1 Overview
In the digital age, digital 3D assertions have become essential for the viewing,
understanding, and interaction with intricate things and surroundings that mimic our experiences
in real life. Their influence is felt in many other fields, such as virtual and augmented reality,
gaming, architecture, animation, and retail, as well as online conferencing and education. The
widespread use of 3D technologies presents a huge difficulty in that producing high-quality 3D
material requires a lot of time, effort, and specialized knowledge [1].

1.2 Rationale of the Research


This encourages the quick advancement of methods for creating 3D content. Text-to-3D
generation is notable among them due to its capacity to generate creative 3D models from text
descriptions alone. This is accomplished by supervising the training of a neural parameterized
3D model using a pertained text-to-image diffusion model as a strong image, which enables the
rendering of 3D consistent images that correspond with the text [2]. The application of Score
Distillation Sampling (SDS) is the key foundation for this amazing capacity. SDS serves as the
fundamental process that converts diffusion model 2D results into 3D, allowing 3D models to be
trained without the use of images.

1.3 Objectives of the Paper


The goal of this work is to get beyond those constraints. We demonstrate that there were
two sources of inadequate pseudo-GTs. First off, the diffusion models' one-step reconstruction
results—which have large reconstruction errors—are what these pseudo-GTs are. Furthermore,
these pseudo-GTs are semantically variable due to the underlying unpredictability in the
diffusion trajectory, which results in an averaging effect and over-smoothing of the data. We
suggest a unique strategy termed interval score matching (ISM) to deal with these problems. ISM
enhances SDS by two efficient methods. First, by using DDIM inversion, ISM reduces the
average effect brought on by pseudo-GT inconsistency and creates an invertible diffusion
trajectory. Second, ISM matches between two interval steps in the diffusion trajectory, avoiding
one-step reconstruction that results in substantial reconstruction error, as opposed to matching
the pseudo-GTs with images produced by the 3D model.
We demonstrate with incredibly realistic and thorough findings that our ISM loss
routinely beats SDS by a significant margin. Lastly, we demonstrate that our ISM is not just
4
compatible with as comparison to the state-of-the-art methods, such as Magic3D, Fantasia3D,
and the original 3D model presented in, our model obtains better results by adopting a more
sophisticated model – 3D Gaussian Splatting. Prolific Dreamer as well. Interestingly, our model
does not require multi-stage training, but these competitors do. This keeps our training pipeline
straightforward while simultaneously lowering our training expenses. Our contributions can be
summed up as follows overall.
• We offer a thorough examination of Score Distillation Sampling (SDS), the essential
element in text-to-3D generation, and pinpoint its main drawbacks in terms of producing
inconsistent and subpar pseudo-GTs. This offers an explanation for the phenomenon of over-
smoothing that is present in numerous methodologies.
• We suggest the Interval Score Matching (ISM) as a solution to the shortcomings of
SDS. ISM performs substantially better than SDS with results that are incredibly realistic and
detailed, thanks to invertible diffusion trajectories and interval-based matching.
• Our model achieves state-of-the-art performance by integrating with 3D Gaussian
Splatting, outperforming previous approaches with lower training costs.

5
Chapter 2:
Related Works

2.1 Text-to-3D Generation


Text-to-3D creation can be grouped under one. As a trailblazer, [3] accomplished text-to-
3D distillation by first training [4] under paper direction. However, because of the inadequate
supervision from CLIP loss, the outcomes are not good. [5] Presents Score Distillation Sampling
(SDS) as a way to extract 3D assets from 2D text-to-image diffusion models that have already
been trained. By searching for particular modes in a text-guide diffusion model, SDS makes 3D
distillation easier and enables the training of a 3D model using the 2D diffusion model
information. This soon serves as inspiration for a large number of subsequent pieces [6] and
integrates them critically. These efforts enhance text-to-3D functionality in a number of ways.
For instance, some of them [7] alter NeRF or add additional sophisticated 3D representations to
enhance the textto-3D distillation's visual quality. While [8] suggests fine-tuning the pre-trained
diffusion models to make them 3D aware, [9] offers a novel technique by adding a 3D diffusion
model for joint optimization. The remaining few [9] concentrate on solving the Janus challenges.
All these techniques, meanwhile, mostly depend on the Score Distillation Sampling. SDS has
demonstrated over smoothing effects in numerous literatures, despite its promising nature [10].
Furthermore, coupling with a broad conditional guiding scale is required, which might result in
over-saturation. Additionally, a few relatively recent works [11] aim to enhance SDS. VSD is
suggested by [12] as a paradigm for 3D representation as a distribution. Iterative methods are
suggested by [13] to estimate a better sampling direction. Even though there has been a
noticeable improvement, these works still need a much longer training phase. Two
contemporaneous works, CSD [14] and [15], examine the SDS's components in order to derive
empirical fixes that enhance the original SDS. Our research is fundamentally unique in that it
offers a methodical examination of the inconsistent data and poor pseudo-ground truths in SDS.
Additionally, it produces better results without adding to the computing load by introducing the
Interval Score Matching.

2.2 Representations of Differentiable 3D


An essential component of text-guided 3D synthesis is differentiable 3D representation.
A differentiable rendering equation g(θ, c) is utilized to produce an image in camera pose c of a
3D representation, given a trainable parameter θ. We could use back propagation to train the 3D
representation to suit our condition because the procedure is differentiable. Text-to-3D
generations have previously been introduced to a variety of representations [16]. [17] Is the most
often used representation among them for jobs involving the generation of text to 3D.

6
[18] Finds it difficult to generate high-resolution images during distillation that match the
resolution of the diffusion due to the laborious rendering process of implicit representations. As a
result, this restriction produces less than ideal results. In order to solve this, complex 3D assets
are currently created in this field using textual meshes [18], which are renowned for their
effective explicit rendering [19], improving performance. Concurrently, 3D Gaussian Splatting
[20], an additional potent explicit representation, exhibits exceptional efficacy in reconstruction
assignments. In this work, we examine 3D Gaussian Splatting [21] as our framework's 3D
representation.

2.3 Models of Diffusion


The diffusion model is another essential part of text-to-3D creation; it oversees the 3D
model. Here, we give a quick overview of it and cover a few notations. Because of its extensive
capabilities, the DE noising Diffusion Probabilistic Model (DDPM) [22] has been frequently
used for text-guided 2D picture production. DDPMs consider p(xt|xt−1) to be a diffusion process
that follows a preset timetable βt on time step t, meaning that:

p(xt|xt−1) = N (xt; √ 1 − βtxt−1, βtI) (1)

And the posterior pϕ(xt−1|xt) is modelled with a neural network ϕ, where:

pϕ(xt−1|xt) = N (xt−1; √ α¯t−1µϕ(xt),(1 − α¯t−1)Σϕ(xt)) (2)

where α¯t := (Qt 1 1 − βt), and µϕ(xt), Σϕ(xt) denote the predicted mean and variance
given xt, respectively.

7
Chapter 3
Research Methodology

3.1 Revisiting the SDS


As discussed in Sec. 2, SDS [21] searches the DDPM latent space for modes
corresponding to the conditional post prior, therefore pioneering text-to-3D generation. With
x0 := g(θ, c) representing the 2D views produced by θ, the posterior of noisy latent xt can be
expressed as follows:

q θ (xt) = N (xt; √ α¯tx0,(1 − α¯t)I) (3)

In the meantime, SDS models the conditional posterior of pϕ(xt|y) using pretrained
DDPMs. Next, SDS seeks modes for such conditional posteriors in order to distill 3D
representation θ. This can be done by reducing the KL divergence for all t in the following way:

minθ∈Θ LSDS(θ) := Et,c ω(t)DKL(q θ (xt) ∥ pϕ(xt|y)) (4)

Moreover, Eq. (4) is reparameterized as follows by applying the weighted denoising


score matching objective to DDPM training.

minθ∈Θ LSDS(θ) := Et,c ω(t)||ϵϕ(xt, t, y) − ϵ||2 2 (5)

where ϵ ∼ N (0, I) is the ground truth denoising direction of xt in timestep t. And the
ϵϕ(xt, t, y) is the predicted denoising direction with given condition y. Ignoring the [23], the
gradient of SDS loss on θ is given by:

∇θLSDS(θ) ≈ Et,ϵ,c [ω(t)(ϵϕ(xt, t, y) − ϵ | [5] SDS update direction ) ∂g(θ,c) ∂θ ]


(6)

8
3.2 SDS Analysis
To lay a clearer foundation for the upcoming discussion, we denote γ(t) = √ √ 1−α¯t α¯t
and equivalently transform Eq. (5) into an alternative form as follows:

θ∈Θ LSDS(θ) := Et,ϵ,c ω(t) γ(t) ||γ(t)(ϵϕ(xt, t, y) − ϵ) + (xt − xt) √ α¯t ||2 2 ∂g(θ, c)
∂θ = Et,ϵ,c ω(t) γ(t) ||x0 − xˆ t 0 ||2 2 ∂g(θ, c) ∂θ (7)

where xt ∼ q θ (xt) and xˆ t 0 = xt− √ 1−α¯tϵϕ(xt,t,y) √ α¯t . Consequently, we can also


rewrite the gradient of SDS loss as:

∇θLSDS(θ) = Et,ϵ,c [ ω(t) γ(t) (x0 − xˆ t 0 ) ∂g(θ,c) ∂θ ] (8)

This means that the SDS target can be thought of as matching the 3D model's perspective
x0 with xʆ t 0 (i.e., the pseudoGT) that DDPM calculates in a single step from xt. But we've
found that this distillation paradigm leaves out several important DDPM components. Figure 2
illustrates how the pretrained DDPM predicts feature-inconsistent pseudo-GTs, which are
occasionally of poor distillation quality. But under such unfavorable conditions, all updating
directions provided by Eq. (8) would be updated to the θ, which would unavoidably result in
over-smoothed outcomes. We derive the causes of these occurrences from two main angles.
First, it's crucial to understand a fundamental concept of SDS: by using the input view x0, it
creates pseudo-GTs with 2D DDPM. Subsequently, SDS utilizes these pseudo-GTs for x0
optimization. SDS does this by first perturbing x0 to xt with random noises, as shown by Eq. (8),
and then estimating xˆ t 0 as the pseudo-GT. But we also see that the DDPM is highly sensitive
to its input, meaning that even little changes in xt would have a big impact on the pseudo-GT's
characteristics. We discover that these fluctuations are unavoidable during the distillation process
and that they may be caused by both the randomness in the camera posture of x0 and the
randomness in the noise component of xt. Ultimately, optimizing x0 towards inconsistent
pseudo-GTs produces feature-averaged results, as the final column of Figure 2 illustrates.
Second, Eq. (8) suggests that SDS produces pseudoGTs with a single-step prediction for
every t, ignoring the fact that single-step-DDPM is typically unable to generate findings of a high
caliber. Such single-step projected pseudo-GTs are occasionally hazy or lack clarity, as we can
also see in the middle columns of Fig. 2, which clearly impedes the distillation process. As a
result, we think it might not be as good to distill 3D assets with the SDS goal. Inspired by these
findings, we seek to resolve the aforementioned problems in order to improve the outcomes.

9
3.3 Interval Score Matching
It should be noted that the previously described issues stem from the inconsistent and
occasionally low-quality pseudo-ground-truth xˆ t 0 that matches with x0 = g(θ, c). In this
section, we offer a substitute for SDS that considerably reduces these issues.
There are two parts to our main premise. Initially, our goal is to achieve more reliable
pseudo-GTs through distillation, irrespective of noise variability and camera alignment. Then,
we produce extremely well-looking pseudo-GTs.

3.3.1 DDIM Inversion


Our goal is to generate more consistent pseudo-GTs that are in line with x0, as was
previously said.
Therefore, we use the DDIM inversion to forecast the noisy latent xt rather than
generating xt stochastically with Eq. (3). In particular, DDIM inversion iteratively predicts an
invertible noisy latent trajectory [x3], x2δT},..., xt}:

xt = √ α¯txˆ s 0 + √ 1 − α¯tϵϕ(xs, s, ∅) = √ α¯t(xˆ s 0 + γ(t)ϵϕ(xs, s, ∅)) (9)

where s = t − δT , and xˆ s 0 = √ 1 α¯s xs − γ(s)ϵϕ(xs, s, ∅). With some simple


computation, we organize xˆ s 0 as:

xˆ s 0 = x0−γ(δT )[ϵϕ(xδT , δT , ∅) − ϵϕ(x0, 0, ∅)] − · · · −γ(s)[ϵϕ(xs, s, ∅) −


ϵϕ(xs−δT , s − δT , ∅)] (10)

We greatly improve the consistency of the pseudo-GT (i.e., the xˆ t 0) with x0 for all t,
which is crucial for our next operations, because of the invertibility of DDIM inversion. Please
refer to our supplement for analysis in order to conserve space.

3.3.2 Interval Score Matching


SDS also has the drawback of producing pseudo-GTs that only forecast xt for every t,
which makes it difficult to ensure high-quality pseudo-GTs. We also want to enhance the
pseudo-GTs' visual quality based on this foundation. This can be accomplished intuitively by
substituting the single-step estimated pseudoGT xˆ t 0 = √1 α¯t xt − γ(t)ϵϕ(xt, t, y) with a multi-
step one, designated as x˜ t 0:= x˜0, using the multi-step DDIM denoising procedure, that is,
iterating.

x˜t−δT = p α¯t−δT (xˆ t 0 + γ(t − δT )ϵϕ(xt, t, y)) (11)

10
Until x˜0. Note that different from the DDIM inversion (Eq. (9)), this denoising process
is conditioned on y. This matches the behavior of SDS (Eq. (6)), i.e., SDS imposes unconditional
noise ϵ during forwarding and denoise the noisy latent with a conditional model ϵϕ(xt, t, y).
Intuitively, by replacing xˆ t 0 in Eq. (8) with x˜ t 0 , we conclude a naive alternative of the SDS,
where:

∇θL(θ) = Ec [ ω(t) γ(t) (x0 − x˜ t 0 ) ∂g(θ,c) ∂θ ]. (12)

While x˜ t 0 may yield better advice, its computational overhead is prohibitively large,
thus restricting the algorithm's applicability. This encourages us to investigate the issue more
thoroughly and look for a more effective solution.
First, we look into the inversion procedure in conjunction with the denoising of x˜ t 0.
First, we combine the iterative procedure in Equation (11) as

x˜ t 0 = xt √ α¯t − γ(t)ϵϕ(xt, t, y) + γ(s)[ϵϕ(xt, t, y) − ϵϕ(x˜s, s, y)] + · · · + γ(δT )


[ϵϕ(x˜2δT , 2δT , y) − ϵϕ(x˜δT , δT , y)]. (13)

Then, combining Eq. (9) with Eq. (13), we could transform Eq. (12) as follows:

∇θL(θ) = Et,c [ ω(t) γ(t) (γ(t)[ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅) | [5] interval scores ] + ηt)
∂g(θ,c) ∂θ ] (14)

where we summarize the bias term ηt as:

ηt = + γ(s)[ϵϕ(x˜s, s, y) − ϵϕ(xs−δT , s − δT , ∅)] − γ(s)[ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅)]


+ ... + γ(δT )[ϵϕ(x˜δT , δT , y) − ϵϕ(x0, 0, ∅)] − γ(δT )[ϵϕ(x˜2δT , 2δT , y) − ϵϕ(xδT , δT ,
∅)] (15)

Interestingly, ηt consists of a set of nearby interval scores with opposing scales that are
supposed to cancel one other out. Furthermore, since ηt consists of a number of score residuals
more closely associated with δT, a hyperparameter unrelated to 3D representation, decreasing it

11
is beyond our goal. Therefore, we suggest ignoring ηt in order to increase training efficiency
without sacrificing the:

Figure 3: LucidDreamer's overview. The pretrained text-to-3D generator [24] is used in


our paper to first initialize the 3D representation (i.e., Gaussian Splatting [20]) θ with prompt y.
We use DDIM inversion to disrupt random views x0 = g(θ, c) to unconditional noisy latent
trajectories (x0,..., xs, xt) in conjunction with pretrained 2D DDPM [25]. Next, we use the
interval score to update θ. Refer to Section 3.2 for further information.
Thus, by ignoring the bias element ·t and concentrating on decreasing the interval score,
we present an effective substitute for Equation (12) that we call Interval Score Matching (ISM).
In particular, the ISM loss is defined as follows given a prompt y and the noisy latents xs and xt
produced via DDIM inversion from x0:

minθ∈Θ LISM(θ) := Et,c ω(t)||ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅)||2 . (16)

Following, the gradient of ISM loss over θ is given by:

12
∇θLISM(θ) := Et,c [ω(t)(ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅) | [5] ISM update direction ) ∂g(θ,c)
∂θ ]. (17)

The main focus of optimizing the ISM aim remains updating x0 towards high-quality,
computationally-friendly, feature-consistent pseudo-GTs, even with \t removed from Equation
[10]. Therefore, ISM is consistent with the basic ideas of SDS-like goals [26], but in a more
sophisticated way.
Consequently, ISM offers a number of benefits over earlier approaches. First off, we get
high-fidelity distillation results with rich features and fine structure because ISM consistently
produces high-quality pseudo-GTs. This removes the need for a large conditional guiding scale
[12] and increases the flexibility for creating 3D material. Second, there is a little computational
overhead when switching from SDS to ISM, in contrast to previous efforts [27]. Meanwhile, as
3D distillation using ISM typically converges in fewer iterations, it does not impair overall
efficiency even if it requires additional computation costs for DDIM inversion. Kindly consult
our supplement for further information.

Figure 4: Text-to-3D generation comparison with baseline approaches. The experiment


demonstrates that our method can produce 3D information with detailed details and high
resolution that aligns well with the supplied text instructions. Our approach's execution time is
evaluated using a single A100 GPU and a view batch size of 4, δS = 200. For further
information, please enlarge.

13
In the meantime, the cost of trajectory estimation rises linearly with increasing t
because the normal DDIM inversion often takes a set stride. At higher timesteps, however, it is
usually advantageous to supervise θ. Therefore, we suggest accelerating the process by
predicting xs with bigger step sizes δS, rather than estimating the latent trajectory with a uniform
stride. We discover that this kind of approach significantly shortens the training period without
sacrificing the quality of the distillation. Furthermore, we provide a quantitative examination of
the effects of δT and δS in Sect. 4.1. Overall, Fig. 3 and Algorithm 1 provide a summary of our
suggested ISM.

3.4 The Advanced Generation Pipeline


We also investigate the variables that could impact the text-to-3D generation's visual
quality and suggest an enhanced pipeline using our ISM. In particular, we provide 3D Guassians
Splatting (3DGS) as our 3D model and 3D models for the creation of point clouds for
initialization.

14
3.4.1 3D Gaussian Splatting
The visual quality would be considerably enhanced by raising the rendering resolution
and batch size for training, according on empirical observations of previous efforts. Nonetheless,
the majority of learnable 3D representations utilized in the text-to-3D generation [27] require a
significant amount of effort and memory. On the other hand, 3D Gaussian Splatting [27] offers
extremely effective rendering and optimization. This motivates our approach to provide huge
batch sizes and high-resolution rendering even with constrained computational resources.

3.4.2 Initialization
The majority of earlier techniques [28] often start their 3D representation with
constrained geometries like cylinders, boxes, and spheres, which may provide unfavorable
outcomes on objects that are not axially symmetric. We can easily use a number of text-to-point
generative models [29] to produce the coarse initialization with human prior since we introduce
the 3DGS as our 3D representation. As demonstrated in Sec. 4.1, this initialization method
significantly increases the convergence speed.

15
Uncategorized References
1. Adhikari, K., et al., High‐resolution 3‐D mapping of soil texture in Denmark. Soil Science Society
of America Journal, 2013. 77(3): p. 860-876.
2. Feng, F., et al., Deep learning-enabled orbital angular momentum-based information encryption
transmission. ACS Photonics, 2022. 9(3): p. 820-829.
3. Qian, G., et al., Magic123: One image to high-quality 3d object generation using both 2d and 3d
diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
4. Qian, G., Towards Scalable Deep 3D Perception and Generation. 2023.
5. Yu, J., et al., PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score
Distillation. arXiv preprint arXiv:2310.09458, 2023.
6. Poole, B., et al., Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,
2022.
7. Shi, Y., et al., Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512,
2023.
8. Li, W., et al., DW-GAN: Toward High-Fidelity Color-Tones of GAN-Generated Images With
Dynamic Weights. IEEE Transactions on Neural Networks and Learning Systems, 2023.
9. Song, S., et al., Near Field 3-D Millimeter-Wave SAR Image Enhancement and Detection with
Application of Antenna Pattern Compensation. Sensors, 2022. 22(12): p. 4509.
10. Bosc, E., et al., Towards a new quality metric for 3-D synthesized view assessment. IEEE Journal
of Selected Topics in Signal Processing, 2011. 5(7): p. 1332-1343.
11. Li, Z., et al., MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-
to-3D Generation. arXiv preprint arXiv:2311.14494, 2023.
12. Zhuang, J., et al. Dreameditor: Text-driven 3d scene editing with neural fields. in SIGGRAPH Asia
2023 Conference Papers. 2023.
13. Chen, Y., et al., It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint
arXiv:2308.11473, 2023.
14. Long, X., et al., Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint
arXiv:2310.15008, 2023.
15. Jakab, T., et al., Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion. arXiv
preprint arXiv:2304.10535, 2023.
16. Li, W., et al., Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d.
arXiv preprint arXiv:2310.02596, 2023.
17. Liu, Y.-T., et al., threestudio: a modular framework for diffusion-guided 3D generation.
18. Gu, J., et al. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware
diffusion. in International Conference on Machine Learning. 2023. PMLR.
19. Bahmani, S., et al., 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling. arXiv
preprint arXiv:2311.17984, 2023.
20. Han, X., et al. Headsculpt: Crafting 3d head avatars with text. in Thirty-seventh Conference on
Neural Information Processing Systems. 2023.
21. Yu, C., et al. Points-to-3d: Bridging the gap between sparse points and shape-controllable text-
to-3d generation. in Proceedings of the 31st ACM International Conference on Multimedia. 2023.

16
22. Sella, E., et al. Vox-e: Text-guided voxel editing of 3d objects. in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023.
23. Hertz, A., K. Aberman, and D. Cohen-Or. Delta denoising score. in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023.
24. Yu, X., et al., Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415, 2023.
25. Xing, X., et al., DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion
Models. arXiv preprint arXiv:2306.14685, 2023.
26. Court, C.J., et al., 3-D inorganic crystal structure generation and property prediction via
representation learning. Journal of Chemical Information and Modeling, 2020. 60(10): p. 4518-
4535.
27. Song, J.H., et al., Automated 3-D mapping of single neurons in the standard brain atlas using
single brain slices. bioRxiv, 2018: p. 373134.
28. Tian, Q., et al., TFGAN: Time and frequency domain based generative adversarial network for
high-fidelity speech synthesis. arXiv preprint arXiv:2011.12206, 2020.
29. Choi, E., et al., A high-fidelity phantom for the simulation and quantitative evaluation of
transurethral resection of the prostate. Annals of biomedical engineering, 2020. 48: p. 437-446.

17

You might also like