PROPOSAL TO Text To 3-D Images Generation Using Interval Score Matching For Their High Fidelity
PROPOSAL TO Text To 3-D Images Generation Using Interval Score Matching For Their High Fidelity
By
Mr. XYZ
1
Table of Contents
Abstract......................................................................................................................................................3
Chapter 1:..................................................................................................................................................4
Introduction...............................................................................................................................................4
1.1 Overview.....................................................................................................................................4
1.2 Rationale of the Research..........................................................................................................4
1.3 Objectives of the Paper..............................................................................................................5
Chapter 2:..................................................................................................................................................6
Related Works...........................................................................................................................................6
2.1 Text-to-3D Generation....................................................................................................................6
2.2 Representations of Differentiable 3D.............................................................................................7
2.3 Models of Diffusion..........................................................................................................................7
Chapter 3....................................................................................................................................................8
Research Methodology..............................................................................................................................8
3.1 Revisiting the SDS...........................................................................................................................8
3.2 SDS Analysis....................................................................................................................................9
3.3 Interval Score Matching................................................................................................................10
3.3.1 DDIM Inversion......................................................................................................................10
3.3.2 Interval Score Matching.........................................................................................................11
3.4 The Advanced Generation Pipeline..............................................................................................15
3.4.1 3D Gaussian Splatting............................................................................................................15
3.4.2 Initialization............................................................................................................................16
2
Abstract
3
Chapter 1:
Introduction
1.1 Overview
In the digital age, digital 3D assertions have become essential for the viewing,
understanding, and interaction with intricate things and surroundings that mimic our experiences
in real life. Their influence is felt in many other fields, such as virtual and augmented reality,
gaming, architecture, animation, and retail, as well as online conferencing and education. The
widespread use of 3D technologies presents a huge difficulty in that producing high-quality 3D
material requires a lot of time, effort, and specialized knowledge [1].
5
Chapter 2:
Related Works
6
[18] Finds it difficult to generate high-resolution images during distillation that match the
resolution of the diffusion due to the laborious rendering process of implicit representations. As a
result, this restriction produces less than ideal results. In order to solve this, complex 3D assets
are currently created in this field using textual meshes [18], which are renowned for their
effective explicit rendering [19], improving performance. Concurrently, 3D Gaussian Splatting
[20], an additional potent explicit representation, exhibits exceptional efficacy in reconstruction
assignments. In this work, we examine 3D Gaussian Splatting [21] as our framework's 3D
representation.
where α¯t := (Qt 1 1 − βt), and µϕ(xt), Σϕ(xt) denote the predicted mean and variance
given xt, respectively.
7
Chapter 3
Research Methodology
In the meantime, SDS models the conditional posterior of pϕ(xt|y) using pretrained
DDPMs. Next, SDS seeks modes for such conditional posteriors in order to distill 3D
representation θ. This can be done by reducing the KL divergence for all t in the following way:
where ϵ ∼ N (0, I) is the ground truth denoising direction of xt in timestep t. And the
ϵϕ(xt, t, y) is the predicted denoising direction with given condition y. Ignoring the [23], the
gradient of SDS loss on θ is given by:
8
3.2 SDS Analysis
To lay a clearer foundation for the upcoming discussion, we denote γ(t) = √ √ 1−α¯t α¯t
and equivalently transform Eq. (5) into an alternative form as follows:
θ∈Θ LSDS(θ) := Et,ϵ,c ω(t) γ(t) ||γ(t)(ϵϕ(xt, t, y) − ϵ) + (xt − xt) √ α¯t ||2 2 ∂g(θ, c)
∂θ = Et,ϵ,c ω(t) γ(t) ||x0 − xˆ t 0 ||2 2 ∂g(θ, c) ∂θ (7)
This means that the SDS target can be thought of as matching the 3D model's perspective
x0 with xʆ t 0 (i.e., the pseudoGT) that DDPM calculates in a single step from xt. But we've
found that this distillation paradigm leaves out several important DDPM components. Figure 2
illustrates how the pretrained DDPM predicts feature-inconsistent pseudo-GTs, which are
occasionally of poor distillation quality. But under such unfavorable conditions, all updating
directions provided by Eq. (8) would be updated to the θ, which would unavoidably result in
over-smoothed outcomes. We derive the causes of these occurrences from two main angles.
First, it's crucial to understand a fundamental concept of SDS: by using the input view x0, it
creates pseudo-GTs with 2D DDPM. Subsequently, SDS utilizes these pseudo-GTs for x0
optimization. SDS does this by first perturbing x0 to xt with random noises, as shown by Eq. (8),
and then estimating xˆ t 0 as the pseudo-GT. But we also see that the DDPM is highly sensitive
to its input, meaning that even little changes in xt would have a big impact on the pseudo-GT's
characteristics. We discover that these fluctuations are unavoidable during the distillation process
and that they may be caused by both the randomness in the camera posture of x0 and the
randomness in the noise component of xt. Ultimately, optimizing x0 towards inconsistent
pseudo-GTs produces feature-averaged results, as the final column of Figure 2 illustrates.
Second, Eq. (8) suggests that SDS produces pseudoGTs with a single-step prediction for
every t, ignoring the fact that single-step-DDPM is typically unable to generate findings of a high
caliber. Such single-step projected pseudo-GTs are occasionally hazy or lack clarity, as we can
also see in the middle columns of Fig. 2, which clearly impedes the distillation process. As a
result, we think it might not be as good to distill 3D assets with the SDS goal. Inspired by these
findings, we seek to resolve the aforementioned problems in order to improve the outcomes.
9
3.3 Interval Score Matching
It should be noted that the previously described issues stem from the inconsistent and
occasionally low-quality pseudo-ground-truth xˆ t 0 that matches with x0 = g(θ, c). In this
section, we offer a substitute for SDS that considerably reduces these issues.
There are two parts to our main premise. Initially, our goal is to achieve more reliable
pseudo-GTs through distillation, irrespective of noise variability and camera alignment. Then,
we produce extremely well-looking pseudo-GTs.
We greatly improve the consistency of the pseudo-GT (i.e., the xˆ t 0) with x0 for all t,
which is crucial for our next operations, because of the invertibility of DDIM inversion. Please
refer to our supplement for analysis in order to conserve space.
10
Until x˜0. Note that different from the DDIM inversion (Eq. (9)), this denoising process
is conditioned on y. This matches the behavior of SDS (Eq. (6)), i.e., SDS imposes unconditional
noise ϵ during forwarding and denoise the noisy latent with a conditional model ϵϕ(xt, t, y).
Intuitively, by replacing xˆ t 0 in Eq. (8) with x˜ t 0 , we conclude a naive alternative of the SDS,
where:
While x˜ t 0 may yield better advice, its computational overhead is prohibitively large,
thus restricting the algorithm's applicability. This encourages us to investigate the issue more
thoroughly and look for a more effective solution.
First, we look into the inversion procedure in conjunction with the denoising of x˜ t 0.
First, we combine the iterative procedure in Equation (11) as
Then, combining Eq. (9) with Eq. (13), we could transform Eq. (12) as follows:
∇θL(θ) = Et,c [ ω(t) γ(t) (γ(t)[ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅) | [5] interval scores ] + ηt)
∂g(θ,c) ∂θ ] (14)
Interestingly, ηt consists of a set of nearby interval scores with opposing scales that are
supposed to cancel one other out. Furthermore, since ηt consists of a number of score residuals
more closely associated with δT, a hyperparameter unrelated to 3D representation, decreasing it
11
is beyond our goal. Therefore, we suggest ignoring ηt in order to increase training efficiency
without sacrificing the:
12
∇θLISM(θ) := Et,c [ω(t)(ϵϕ(xt, t, y) − ϵϕ(xs, s, ∅) | [5] ISM update direction ) ∂g(θ,c)
∂θ ]. (17)
The main focus of optimizing the ISM aim remains updating x0 towards high-quality,
computationally-friendly, feature-consistent pseudo-GTs, even with \t removed from Equation
[10]. Therefore, ISM is consistent with the basic ideas of SDS-like goals [26], but in a more
sophisticated way.
Consequently, ISM offers a number of benefits over earlier approaches. First off, we get
high-fidelity distillation results with rich features and fine structure because ISM consistently
produces high-quality pseudo-GTs. This removes the need for a large conditional guiding scale
[12] and increases the flexibility for creating 3D material. Second, there is a little computational
overhead when switching from SDS to ISM, in contrast to previous efforts [27]. Meanwhile, as
3D distillation using ISM typically converges in fewer iterations, it does not impair overall
efficiency even if it requires additional computation costs for DDIM inversion. Kindly consult
our supplement for further information.
13
In the meantime, the cost of trajectory estimation rises linearly with increasing t
because the normal DDIM inversion often takes a set stride. At higher timesteps, however, it is
usually advantageous to supervise θ. Therefore, we suggest accelerating the process by
predicting xs with bigger step sizes δS, rather than estimating the latent trajectory with a uniform
stride. We discover that this kind of approach significantly shortens the training period without
sacrificing the quality of the distillation. Furthermore, we provide a quantitative examination of
the effects of δT and δS in Sect. 4.1. Overall, Fig. 3 and Algorithm 1 provide a summary of our
suggested ISM.
14
3.4.1 3D Gaussian Splatting
The visual quality would be considerably enhanced by raising the rendering resolution
and batch size for training, according on empirical observations of previous efforts. Nonetheless,
the majority of learnable 3D representations utilized in the text-to-3D generation [27] require a
significant amount of effort and memory. On the other hand, 3D Gaussian Splatting [27] offers
extremely effective rendering and optimization. This motivates our approach to provide huge
batch sizes and high-resolution rendering even with constrained computational resources.
3.4.2 Initialization
The majority of earlier techniques [28] often start their 3D representation with
constrained geometries like cylinders, boxes, and spheres, which may provide unfavorable
outcomes on objects that are not axially symmetric. We can easily use a number of text-to-point
generative models [29] to produce the coarse initialization with human prior since we introduce
the 3DGS as our 3D representation. As demonstrated in Sec. 4.1, this initialization method
significantly increases the convergence speed.
15
Uncategorized References
1. Adhikari, K., et al., High‐resolution 3‐D mapping of soil texture in Denmark. Soil Science Society
of America Journal, 2013. 77(3): p. 860-876.
2. Feng, F., et al., Deep learning-enabled orbital angular momentum-based information encryption
transmission. ACS Photonics, 2022. 9(3): p. 820-829.
3. Qian, G., et al., Magic123: One image to high-quality 3d object generation using both 2d and 3d
diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
4. Qian, G., Towards Scalable Deep 3D Perception and Generation. 2023.
5. Yu, J., et al., PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score
Distillation. arXiv preprint arXiv:2310.09458, 2023.
6. Poole, B., et al., Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,
2022.
7. Shi, Y., et al., Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512,
2023.
8. Li, W., et al., DW-GAN: Toward High-Fidelity Color-Tones of GAN-Generated Images With
Dynamic Weights. IEEE Transactions on Neural Networks and Learning Systems, 2023.
9. Song, S., et al., Near Field 3-D Millimeter-Wave SAR Image Enhancement and Detection with
Application of Antenna Pattern Compensation. Sensors, 2022. 22(12): p. 4509.
10. Bosc, E., et al., Towards a new quality metric for 3-D synthesized view assessment. IEEE Journal
of Selected Topics in Signal Processing, 2011. 5(7): p. 1332-1343.
11. Li, Z., et al., MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-
to-3D Generation. arXiv preprint arXiv:2311.14494, 2023.
12. Zhuang, J., et al. Dreameditor: Text-driven 3d scene editing with neural fields. in SIGGRAPH Asia
2023 Conference Papers. 2023.
13. Chen, Y., et al., It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint
arXiv:2308.11473, 2023.
14. Long, X., et al., Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint
arXiv:2310.15008, 2023.
15. Jakab, T., et al., Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion. arXiv
preprint arXiv:2304.10535, 2023.
16. Li, W., et al., Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d.
arXiv preprint arXiv:2310.02596, 2023.
17. Liu, Y.-T., et al., threestudio: a modular framework for diffusion-guided 3D generation.
18. Gu, J., et al. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware
diffusion. in International Conference on Machine Learning. 2023. PMLR.
19. Bahmani, S., et al., 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling. arXiv
preprint arXiv:2311.17984, 2023.
20. Han, X., et al. Headsculpt: Crafting 3d head avatars with text. in Thirty-seventh Conference on
Neural Information Processing Systems. 2023.
21. Yu, C., et al. Points-to-3d: Bridging the gap between sparse points and shape-controllable text-
to-3d generation. in Proceedings of the 31st ACM International Conference on Multimedia. 2023.
16
22. Sella, E., et al. Vox-e: Text-guided voxel editing of 3d objects. in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023.
23. Hertz, A., K. Aberman, and D. Cohen-Or. Delta denoising score. in Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023.
24. Yu, X., et al., Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415, 2023.
25. Xing, X., et al., DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion
Models. arXiv preprint arXiv:2306.14685, 2023.
26. Court, C.J., et al., 3-D inorganic crystal structure generation and property prediction via
representation learning. Journal of Chemical Information and Modeling, 2020. 60(10): p. 4518-
4535.
27. Song, J.H., et al., Automated 3-D mapping of single neurons in the standard brain atlas using
single brain slices. bioRxiv, 2018: p. 373134.
28. Tian, Q., et al., TFGAN: Time and frequency domain based generative adversarial network for
high-fidelity speech synthesis. arXiv preprint arXiv:2011.12206, 2020.
29. Choi, E., et al., A high-fidelity phantom for the simulation and quantitative evaluation of
transurethral resection of the prostate. Annals of biomedical engineering, 2020. 48: p. 437-446.
17