0% found this document useful (0 votes)
46 views13 pages

Sync Dreamer

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views13 pages

Sync Dreamer

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SyncDreamer:

Generating Multiview-consistent Images from a Single-view Image

Yuan Liu1,2∗ Cheng Lin2,† Zijiao Zeng2 Xiaoxiao Long1


3 1
Lingjie Liu Taku Komura Wenping Wang4,†
1 2
The University of Hong Kong Tencent Games 3 University of Pennsylvania 4
Texas A&M University
https://liuyuan-pal.github.io/SyncDreamer/
arXiv:2309.03453v1 [cs.CV] 7 Sep 2023

Input Image Generated Multiview-consistent Images Mesh

Figure 1. SyncDreamer is able to generate multiview-consistent images from a single-view input image of arbitrary objects. The generated
multiview images can be used for mesh reconstruction by neural reconstruction methods like NeuS [74] without using SDS [49] loss.

∗ Work done during internship at Tencent Games.


† Corresponding authors.
Abstract accurately represent an image using a single word embed-
ding, which results in a decrease in the quality of 3D shapes
In this paper, we present a novel diffusion model called reconstructed by the distillation method.
SyncDreamer that generates multiview-consistent images Instead of distillation, some recent works [5, 14, 25, 70,
from a single-view image. Using pretrained large-scale 2D 72, 78, 80, 86, 89, 91] apply 2D diffusion models to di-
diffusion models, recent work Zero123 [40] demonstrates rectly generate multiview images for the 3D reconstruction
the ability to generate plausible novel views from a single- task. The key problem is how to maintain the multiview
view image of an object. However, maintaining consistency consistency when generating images of the same object. To
in geometry and colors for the generated images remains a improve the multiview consistency, these methods allow the
challenge. To address this issue, we propose a synchronized diffusion model to condition on the input, previously gener-
multiview diffusion model that models the joint probabil- ated images [40, 72, 78, 86, 91] or renderings from a neu-
ity distribution of multiview images, enabling the genera- ral field [5, 25, 70]. Although some impressive results are
tion of multiview-consistent images in a single reverse pro- achieved for specific object categories from ShapeNet [6]
cess. SyncDreamer synchronizes the intermediate states of or Co3D [53], how to design a diffusion model to gener-
all the generated images at every step of the reverse process ate multiview-consistent images for arbitrary objects still
through a 3D-aware feature attention mechanism that cor- remains unsolved.
relates the corresponding features across different views. In this paper, we propose a simple yet effective frame-
Experiments show that SyncDreamer generates images with work to generate multiview-consistent images for the
high consistency across different views, thus making it well- single-view 3D reconstruction of arbitrary objects. The key
suited for various 3D generation tasks such as novel-view- idea is to extend the diffusion framework [28] to model the
synthesis, text-to-3D, and image-to-3D. joint probability distribution of multiview images. We show
that modeling the joint distribution can be achieved by intro-
ducing a synchronized multiview diffusion model. Specif-
1. Introduction ically, for N target views to be generated, we construct N
Humans possess a remarkable ability to perceive 3D shared noise predictors respectively. The reverse diffusion
structures from a single image. When presented with an im- process simultaneously generates N images by N corre-
age of an object, humans can easily imagine the other views sponding noise predictors, where information across differ-
of the object. Despite great progress [44, 69, 74, 81, 83] ent images is shared among noise predictors by attention
brought by neural networks in computer vision or graphics layers on every denoising step. Thus, we name our frame-
fields for extracting 3D information from images, generat- work SyncDreamer which synchronizes intermediate states
ing novel view images with multiview consistency from a of all noise predictors on every step in the reverse process.
single-view image of an object is still a challenging problem SyncDreamer has the following characteristics that make
due to the limited 3D information available in an image. it a competitive tool for lifting 2D single-view images
Recently, diffusion models [28, 54] have demonstrated to 3D. First, SyncDreamer retains strong generalization
huge success in 2D image generation, which unlocks new ability by initializing its weights from the pretrained
potential for 3D generation tasks. However, directly train- Zero123 [40] model which is finetuned from the Stable Dif-
ing a generalizable 3D diffusion model [30, 45, 46, 75] fusion model [54] on the Objaverse [13] dataset. Thus,
usually requires a large amount of 3D data while exist- SyncDreamer is able to reconstruct shapes from both pho-
ing 3D datasets are insufficient for capture the complexity torealistic images and hand drawings as shown in Fig. 1.
of arbitrary 3D shapes. Therefore, recent methods [8, 38, Second, SyncDreamer makes the single-view reconstruc-
49, 73, 77] resort to distilling pretrained text-to-image dif- tion easier than the distillation methods. Because the gen-
fusion models for creating 3D models from texts, which erated images are consistent in both geometry and appear-
shows impressive results on this text-to-3D task. Some ance, we can simply run a vanilla NeRF [44] or a vanilla
works [43, 52, 66, 82] extend such a distillation process to NeuS [74] without using any special losses for reconstruc-
train a neural radiance field [44] (NeRF) for the image-to- tion. Given the generated images, one can easily reckon
3D task. In order to utilize pretrained text-to-image models, the final reconstruction quality while it is hard for distilla-
these methods have to perform textual inversion [21] to find tion methods to know the output reconstruction quality be-
a suitable text description of the input image. However, the forehand. Third, SyncDreamer maintains creativity and di-
distillation process along with the textual inversion usually versity when inferring 3D information, which enables gen-
takes a long time to generate a single shape and requires te- erating multiple reasonable objects from a given image as
dious parameter tuning for satisfactory quality. Moreover, shown in Fig. 5. In comparison, previous distillation meth-
due to the abundance of specific details in an image, such as ods can only converge to one single shape.
object category, appearance, and pose, it is challenging to We quantitatively compare SyncDreamer with baseline

2
methods on the Google Scanned Object [16] dataset. The Fixed viewpoints Generated views
results show that, in comparison with baseline methods,
SyncDreamer is able to generate more consistent images
and reconstruct better shapes from input single-view im-
ages. We further demonstrate that SyncDreamer supports
various styles of 2D input like cartoons, sketches, ink paint-
ings, and oil paintings for generating consistent views and
reconstructing 3D shapes, which verifies the effectiveness
of SyncDreamer in lifting 2D images to 3D shapes.
Input
view
2. Related Work
2.1. Diffusion models
Diffusion models [11, 28, 54] have shown impressive re- Figure 2. Given an input view of an object, SyncDreamer gener-
ates multiview-consistent images on fixed viewpoints.
sults on 2D image generation. Concurrent work MVDiffu-
sion [67] also adopts the multiview diffusion formulation
to synthesize textures or panoramas with known geome-
concurrent works [5, 70] generate new images in an autore-
try. We propose similar formulations in SyncDreamer but
gressive render-and-generate manner, which demonstrates
with unknown geometry. MultiDiffusion [3] and SyncDif-
good performances on specific object categories or scenes.
fusion [35] correlate multiple diffusion models for different
In comparison, SyncDreamer is targeted to reconstruct ar-
regions of a 2D image. Many recent works [1, 7, 10, 17,
bitrary objects and generates all images in one reverse pro-
23, 27, 30, 31, 32, 34, 42, 45, 46, 48, 75, 87, 88] try to re-
cess. The concurrent work Viewset Diffusion [65] shares a
peat the success of diffusion models on the 3D generation
similar idea to generate a set of images. The differences be-
task. However, the scarcity of 3D data makes it difficult
tween SyncDreamer and Viewset Diffusion are that Sync-
to directly train diffusion models on 3D and the resulting
Dreamer does not require predicting a radiance field like
generation quality is still much worse and less generaliz-
Viewset Diffusion but only uses attention to synchronize the
able than the counterpart image generation models, though
states among views and SyncDreamer fixes the viewpoints
some works [1, 7, 32] are trying to only use 2D images for
of generated views for better training convergence.
training 3D diffusion models.
2.3. Other single-view reconstruction methods
2.2. Using 2D diffusion models for 3D
Single-view reconstruction is a challenging ill-posed
Instead of directly learning a 3D diffusion model, many problem. Before the prosperity of generative models used in
works resort to using high-quality 2D diffusion models [54, 3D reconstruction, there are many works [19, 20, 33, 37, 68]
55] for 3D tasks. Pioneer works DreamFusion [49] and that reconstruct 3D shapes from single-view images by re-
SJC [73] propose to distill a 2D text-to-image genera- gression [37] or retrieval [68], which have difficulty in gen-
tion model to generate 3D shapes from texts. Follow-up eralizing to real data or new categories. Recent NeRF-GAN
works [2, 8, 9, 29, 38, 59, 60, 71, 77, 79, 85, 92] im- methods [4, 15, 22, 24, 47, 58] learn to generate NeRFs for
prove such text-to-3D distillation methods in various as- specific categories like human or cat faces. These NeRF-
pects. Many works [43, 50, 52, 61, 66, 82] also apply such GANs achieve impressive results on single-view image re-
a distillation pipeline in the single-view reconstruction task. construction but fail to generalize to arbitrary objects. Al-
Though some impressive results are achieved, these meth- though some recent works also attempt to generalize NeRF-
ods usually require a long time for textual inversion [39] GAN to ImageNet [56, 62], training NeRF-GANs for arbi-
and NeRF optimization and they do not guarantee to get trary objects is still challenging.
satisfactory results.
Other works [5, 14, 25, 36, 41, 65, 67, 70, 72, 78, 80, 84, 3. Method
86, 91] directly apply the 2D diffusion models to generate
multiview images for 3D reconstruction. [72, 86] are condi- Given an input view y of an object, our target is to gen-
tioned on the input image by attention layers for novel-view erate multiview images of the object. We assume that the
synthesis in indoor scenes. Our method also uses attention object is located at the origin and is normalized inside a
layers but is intended for object reconstruction. [80, 89] re- cube of length 1. The target images are generated on N
sort to estimated depth maps to warp and inpaint for novel- fixed viewpoints looking at the object with azimuths evenly
view image generation, which strongly relies on the per- ranging from 0◦ to 360◦ and elevations of 30◦ , as shown in
formance of the external single-view depth estimator. Two Fig. 2. To improve the multiview consistency of generated

3
images, we formulate this generation process as a multiview noises are added to every view independently by
diffusion model to correlate the generation of each image.
T
In the following, we begin with a review of diffusion mod- (1:N ) (1:N )
Y (1:N ) (1:N )
els [28, 63]. q(x1:T |x0 )= q(xt |xt−1 )
t=1
(5)
3.1. Diffusion T Y
N
(n) (n)
Y
= q(xt |xt−1 ),
Diffusion modelsR [28, 63] aim to learn a probability t=1 n=1
model pθ (x0 ) = pθ (x0:T )dx1:T where x0 is the data and
x1:T := x1 , ..., xT are latent variables. The joint distribu- (n) (n) (n) √ (n)
where q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I). Simi-
tion is characterized by a Markov Chain (reverse process) larly, following Eq. 1, the reverse process is constructed as
T
Y T
(1:N ) (1:N ) (1:N ) (1:N )
Y
pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ), (1) pθ (x0:T ) = p(xT ) pθ (xt−1 |xt )
t=1 t=1
(6)
T Y
N
where p(xT ) = N (xT ; 0, I) and pθ (xt−1 |xt ) = (1:N ) (n) (1:N )
Y
= p(xT ) pθ (xt−1 |xt ),
N (xt−1 ; µθ (xt , t), σt2 I). µθ (xt , t) is a trainable compo- t=1 n=1
nent while the variance σt2 is untrained time-dependent con-
stants [28]. The target is to learn the µθ for the generation. (n)
where pθ (xt−1 |xt
(1:N )
) = N (xt−1 ; µθ (xt
(n) (n) (1:N )
, t), σt2 I).
To learn µθ , a Markov chain called forward process is con- Note that the second equation in Eq. 6 holds because we as-
structed as sume a diagonal variance matrix. However, the mean µθ
(n)

(n)
T
Y of n-th view xt−1 depends on the states of all the views
q(x1:T |x0 ) = q(xt |xt−1 ), (2) xt
(1:N ) (n)
. Similar to Eq. 3, we define µθ and the training
t=1
loss by

where q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I) and βt are 1

βt

(n) (1:N ) (n) (n) (1:N )
all constants. DDPM [28] shows that by defining µθ (xt , t) = √ xt − √ ϵ (xt , t) .
αt 1 − ᾱt θ
  (7)
1 βt h i
µθ (xt , t) = √ xt − √ ϵθ (xt , t) , (3) (n) (n) (1:N )
ℓ = Et,x(1:N ) ,n,ϵ(1:N ) ∥ϵ − ϵθ (xt , t)∥2 , (8)
αt 1 − ᾱt 0

(1:N )
where αt and ᾱt are constants derived from βt and ϵθ is a where ϵ is the standard Gaussian noise of size N ×H ×
noise predictor, we can learn ϵθ by W added to all N views, ϵ(n) is the noise added to the n-th
(n)
view, and ϵθ is the noise predictor on the n-th view.
 √ √ 
ℓ = Et,x0 ,ϵ ∥ϵ − ϵθ ( ᾱt x0 + 1 − ᾱt ϵ, t)∥2 , (4) Training procedure. In one training step, we first ob-
(1:N )
tain N images x0 of the same object from the dataset.
where ϵ is a random variable sampled from N (0, I). Then, we sample a timestep t and the noise ϵ(1:N ) which is
(1:N ) (1:N )
added to all the images x0 to obtain xt . After that,
3.2. Multiview diffusion
we randomly select a view n and apply the corresponding
(n)
Applying the vanilla DDPM model to generate novel- noise predictor ϵθ on the selected view to predict the noise.
view images separately would lead to difficulty in maintain- Finally, the L2 distance between the sampled noise ϵ(n) and
ing multiview consistency across different views. To ad- the predicted noise is computed as the loss for the training.
dress this problem, we formulate the generation process as Synchronized N -view noise predictor. The proposed
a multiview diffusion model that correlates the generation multiview diffusion model can be regarded as N synchro-
of each view. Let us denote the N images that we want to (n)
nized noise predictors {ϵθ |n = 1, ..., N }. On each time
(1) (N )
generate on the predefined viewpoints as {x0 , ..., x0 } step t, each noise predictor ϵ(n) is in charge of predicting
where suffix 0 means the time step 0. We want to learn (n) (n)
noise on its corresponding view xt to get xt−1 . Mean-
(1:N )
the joint distribution of all these views pθ (x0 |y) := while, these noise predictors are synchronized because, on
(1) (N )
pθ (x0 , ..., x0 |y). In the following discussion, all the every denoising step, every noise predictor exchanges in-
(1:N )
probability functions are conditioned on the input view y formation with each other by correlating the states xt
so we omit y for simplicity. of all the other views. In practical implementation, we use
The forward process of the multiview diffusion model a shared UNet for all N noise predictors and put the view-
is a direct extension of the vanilla DDPM in Eq. 2, where point difference between the input view and the n-th target

4
Synchronized Multiview Noise Predictor

Spatial volume View frustum volume


Unproject (𝐹, 𝐷, 𝐻, 𝑊)
features (𝐹, 𝑉, 𝑉, 𝑉)
Interpolate

3D CNN

(1:𝑁)
Target views 𝐱 𝑡
Text attention
(𝑛) Depth-wise
CLIP +∆𝐯 attention

SyncDreamer
Modules

Pretrained
Zero-123 Concat

UNet
(𝑛)
Input view y Target view 𝐱 𝑡 (𝑛)
Target view 𝐱 𝑡−1

(n)
Figure 3. The pipeline of a synchronized multiview noise predictor to denoise the target view xt for one step. First, a spatial feature
(1:N ) (n)
volume is constructed from all the noisy target views xt . Then, we construct a view frustum feature volume for xt by interpolating
(n) (n)
the features of spatial feature volume. The input view y, current target view xt and viewpoint difference ∆v are fed into the backbone
UNet initialized from Zero123 [40]. On the intermediate feature maps of the UNet, new depth-wise attention layers are applied to extract
(n) (n)
features from the view frustum feature volume. Finally, the output of the UNet is used to denoise xt to obtain xt−1 .

(1:N )
view ∆v(n) , and the states xt of all views as condi- tures in 3D space when generating the current image. To
(n) (1:N )
tions to this shared noise predictor, i.e., ϵθ (xt , t) = achieve this, we first construct a 3D volume with V 3 ver-
(n) (1:N
ϵθ (xt ; t, ∆v(n) , xt
)
). tices and then project the vertices onto all the target views
to obtain the features. The features from each target view
3.3. 3D-aware feature attention for denoising are concatenated to form a spatial feature volume. Next,
a 3D CNN is applied to the feature volume to capture and
In this section, we discuss how to implement the syn- process spatial relationships. In order to denoise n-th target
(n) (1:N )
chronized noise predictor ϵθ (xt ; t, ∆v(n) , xt , y) by view, we construct a view frustum that is pixel-wise aligned
correlating the multiview features using a 3D-aware atten- with this view, whose features are obtained by interpolating
tion scheme. The overview is shown in Fig. 3. the features from the spatial volume. Finally, on every in-
Backbone UNet. Similar to previous works [28, 54], termediate feature map of the current view in the UNet, we
our noise predictor ϵθ contains a UNet which takes a noisy apply a new depth-wise attention layer to extract features
image as input and then denoises the image. To ensure the from the pixel-wise aligned view-frustum feature volume
generalization ability, we initialize the UNet from the pre- along the depth dimension.
trained weights of Zero123 [40] given that it is based on Sta- Discussion. There are two primary design considera-
ble Diffusion [54] which has seen billions of images and can tions in this 3D-aware feature attention UNet. First, the spa-
generalize to images of various domains. Zero123 concate- tial volume is constructed from all the target views and all
nates the input view with the noisy target view as the input the target views share the same spatial volume for denois-
to UNet. Then, to encode the viewpoint difference ∆v(n) in ing, which implies a global constraint that all target views
UNet, Zero123 reuses the text attention layers of Stable Dif- are looking at the same object. Second, the added new at-
fusion to process the concatenation of ∆v(n) and the CLIP tention layers only conduct attention along the depth dimen-
feature [51] of the input image. We follow the same design sion, which enforces a local epipolar line constraint that the
as Zero123 and empirically freeze the UNet and the text at- feature for a specific location should be consistent with the
tention layers when training SyncDreamer. Experiments to corresponding features on the epipolar lines of other views.
verify these choices are presented in Sec. 4.6.
3D-aware feature attention. The remaining problem 4. Experiments
(1:N )
is how to correlate the states xt of all the target views
(n) 4.1. Implementation details
for the denoising of the current noisy target view xt . To
enforce consistency among multiple generated views, it is We train SyncDreamer on the Objaverse [13] dataset
desirable for the network to perceive the corresponding fea- which contains about 800k objects. We set the viewpoint

5
Input View Ours Zero123 [40] RealFusion [43]
Figure 4. Qualitative comparison with Zero123 [40] and RealFusion [43] in multiview consistency.

Input View Generated Instance A Generated Instance B


Figure 5. Different plausible instances generated by SyncDreamer from the same input image.

number N = 16. The elevation of the target views is set Method PSNR↑ SSIM↑ LPIPS↓ #Points↑
to 30◦ and the azimuth evenly distributes in [0◦ , 360◦ ]. Be-
Realfusion [43] 15.26 0.722 0.283 4010
sides these target views, we also render 16 random views Zero123 [40] 18.93 0.779 0.166 95
as input views on each object for training, which have the Ours 20.05 0.798 0.146 1123
same azimuths but random elevations. We always assume
that the azimuth of both the input view and the first target Table 1. The quantitative comparison in novel view synthesis.
view is 0◦ . We train the SyncDreamer for 80k steps (∼4 We report PSNR, SSIM [76], LPIPS [90] and reconstructed point
days) with 8 40G A100 GPUs using a total batch size of number by COLMAP [57] on the GSO [16] dataset.
192. The learning rate is annealed from 5e-4 to 1e-5. Since
we need an elevation of the input view to compute the view-
point difference ∆v(n) , we use the rendering elevation in Then, we train the vanilla NeuS [74] for 2k steps to recon-
training while we roughly estimate an elevation angle as in- struct the shape, which costs about 10 mins.
put in inference. To obtain surface meshes, we predict the
foreground masks of the generated images using CarveKit.
4.2. Experiment protocol
Evaluation dataset. Following [39, 40], we adopt
https://github.com/OPHoperHPO/image-background-remove-tool the Google Scanned Object [16] dataset as the evaluation

6
Input View Ours Zero123 [40] Magic123 [50] One-2-3-45 [39] Point-E [46] Shap-E [30]
Figure 6. Qualitative comparison of surface reconstruction from single view images with different methods.

Method Chamfer Dist.↓ Volume IoU↑ many optimization strategies to achieve better reconstruc-
tion quality than the original Zero123 implementation. Re-
Realfusion [43] 0.0819 0.2741 alFusion [43] is based on Stable Diffusion [54] and the
Magic123 [50] 0.0516 0.4528 SDS loss for single-view reconstruction. Magic123 [50]
One-2-3-45 [39] 0.0629 0.4086 combines Zero123 [40] with RealFusion [43] to further im-
Point-E [46] 0.0426 0.2875 prove the reconstruction quality. One-2-3-45 [39] directly
Shap-E [30] 0.0436 0.3584 regresses SDFs from the output images of Zero123 and we
Zero123 [40] 0.0339 0.5035 use the official hugging face online demo [18] to produce
Ours 0.0261 0.5421 the results. Point-E [46] and Shap-E [30] are 3D genera-
tive models trained on a large internal OpenAI 3D dataset,
Table 2. Quantitative comparison with baseline methods. We re-
both of which are able to convert a single-view image into
port Chamfer Distance and Volume IoU on the GSO [16] dataset.
a point cloud or a shape encoded in an MLP. For Point-E,
we convert the generated point clouds to SDFs for shape
reconstruction using the official models.
dataset. To demonstrate the generalization ability to arbi-
trary objects, we randomly chose 30 objects ranging from Metrics. We mainly focus on two tasks, novel view
daily objects to animals. For each object, we render an im- synthesis (NVS) and single view 3D reconstruction (SVR).
age with a size of 256×256 as the input view. We addition- On the NVS task, we adopt the commonly used metrics,
ally evaluate some images collected from the Internet and i.e., PSNR, SSIM [76] and LPIPS [90]. To further demon-
the Wiki of Genshin Impact. strate the multiview consistency of the generated images,
Baselines. We adopt Zero123 [40], RealFusion [43], we also run the MVS algorithm COLMAP [57] on the gen-
Magic123 [50], One-2-3-45 [39], Point-E [46] and Shap- erated images and report the reconstructed point number.
E [30] as baseline methods. Given an input image of an Because MVS algorithms rely on multiview consistency to
object, Zero123 [40] is able to generate novel-view im- find correspondences to reconstruct 3D points, more consis-
ages of the same object from different viewpoints. Zero123 tent images would lead to more reconstructed points. On the
can also be incorporated with the SDS loss [49] for 3D re- SVR task, we report the commonly used Chamfer Distances
construction. We adopt the implementation of ThreeStu- (CD) and Volume IoU between ground-truth shapes and re-
dio [26] for reconstruction with Zero123, which includes constructed shapes. Since the shapes generated by Point-

7
Input text Text to image Generated images Mesh
Figure 7. Examples of using SyncDreamer to generate 3D models from texts.

E [46] and Shap-E [30] are defined in a different canon- ily relies on the estimated depth values on the input view,
ical coordinate system, we manually align the generated which may lead to incorrect results when the depth esti-
shapes of these two methods to the ground-truth shapes be- mator is not robust. One-2-3-45 [39] reconstructs meshes
fore computing these metrics. from the multiview-inconsistent outputs of Zero123, which
is able to capture the general geometry but also loses details.
4.3. Consistent novel-view synthesis In comparison, our method achieves the best reconstruction
quality with smooth surfaces and detailed geometry.
For this task, the quantitative results are shown in Table 1
and the qualitative results are shown in Fig. 4. By applying a 4.5. Text-to-image-to-3D
NeRF model to distill the Stable Diffusion model [49, 54],
RealFusion [43] shows strong multiview consistency pro- By incorporating text2image models like Stable Diffu-
ducing more reconstructed points but is unable to produce sion [54] or Imagen [55], SyncDreamer enables generat-
visually plausible images as shown in Fig. 4. Zero123 [40] ing 3D models from text. Examples are shown in Fig. 7.
produces visually plausible images but the generated im- In comparison with existing text-to-3D distillation, our
ages are not multiview-consistent. Our method is able to method gives more flexibility because users can generate
generate images that not only are semantically consistent multiple images with their text2image models and select the
with the input image but also maintain multiview consis- desirable one to feed to SyncDreamer for 3D reconstruction.
tency in colors and geometry. Meanwhile, for the same
input image, Our method can generate different plausible 4.6. Discussions
instances using different random seeds as shown in Fig. 5.
In this section, we further conduct a set of experiments
to evaluate the effectiveness of our designs.
4.4. Single view reconstruction
Generalization ability. To show the generalization abil-
We show the quantitative results in Table 2 and the qual- ity, we evaluate SyncDreamer with 2D designs or hand
itative comparison in Fig. 6. Point-E [46] and Shap-E [30] drawings like sketches, cartoons, and traditional Chinese
tend to produce incompleted meshes. Directly distilling ink paintings, which are usually created manually by artists
Zero123 [40] generates shapes that are coarsely aligned and exhibit differences in lighting effects and color space
with the input image, but the reconstructed surfaces are from real-world images. The results are shown in Fig. 8.
rough and not consistent with input images in detailed parts. Despite the significant differences in lighting and shadow
Magic123 [50] produces much smoother meshes but heav- effects between these images and the real-world images, our

8
Input design Generated multiview-consistent images Mesh
Figure 8. Examples of using SyncDreamer to generate 3D models from 2D designs .

algorithm is still able to perceive their reasonable 3D geom- input images are 2D hand drawings. We speculate that this
etry and produce multiview-consistent images. phenomenon is caused by overfitting, likely due to the nu-
Performance without 3D-aware feature attention. To merous thin-plate objects within the Objaverse dataset and
show how the proposed 3D-aware feature attention im- the fixed viewpoints employed during our training process.
proves multiview consistency, we discard the 3D-aware at- Runtime. SyncDreamer uses about 2.7 minutes to sam-
tention module in SyncDreamer and train this model on the ple 64 images (4 instances) with 200 DDIM [64] sampling
same training set. This actually corresponds to finetuning steps on a 40G A100 GPU. Our runtime is slightly longer
a Zero123 model with fixed viewpoints. As we can see in than Zero123 because we need to construct the spatial fea-
Fig. 9, such a model still cannot produce images with strong ture volume on every step.
consistency, which demonstrates the necessity of the 3D-
aware attention module in generating multiview-consistent
5. Limitations and Conclusion
images.
Initializing from Stable Diffusion instead of Zero123. Limitations and future works. Though SyncDreamer
An alternative strategy is to initialize our model from Stable shows promising performances in generating multiview-
Diffusion [54]. However, the results shown in Fig. 9 indi- consistent images for 3D reconstruction, there are still lim-
cate that initializing from Stable Diffusion exhibits a worse itations that the current framework does not fully address.
generalization ability than from Zero123 [40]. Zero123 al- First, the current SyncDreamer only generates 16 target
ready enables the UNet to infer the relationship between views for an object, while reconstructing objects from such
different views, which thus reduces the difficulty in training a limited number of views is associated with compromised
a multiview image generator. quality. It is possible to train a SyncDreamer to generate
Training UNet. During the training of SyncDreamer, more dense views, which would require more GPUs and
another feasible solution is to not freeze the UNet and the larger GPU memory to train such a model. Second, the gen-
related layers initialized from Zero123 but further finetune erated images are not always plausible and we may need to
them together with the volume condition module. As shown generate multiple instances with different seeds and select
in Fig. 9, the model without freezing these layers tends to a desirable instance for 3D reconstruction. To further in-
predict the input object as a thin plate, especially when the crease the quality, we may need to use a larger object dataset

9
Input SyncDreamer W/O 3D Attn Init SD Train UNet
Figure 9. Ablation studies to verify the designs of our method. “SyncDreamer” means our full model. “W/O 3D Attn” means discarding the
3D-aware attention module in SyncDreamer, which actually results in a Zero123 [40] finetuned on fixed viewpoints on the Objaverse [13]
dataset. “Init SD” means initialize the SyncDreamer noise predictor from Stable Diffusion instead of Zero123. “Train UNet” means we
train the UNet instead of freezing it.

like Objaverse-XL [12] and manually clean the dataset to ing and generation. In CVPR, 2023. 3
exclude some uncommon shapes like complex scene repre- [2] Mohammadreza Armandpour, Huangjie Zheng, Ali
sentation, textureless 3D models, and point clouds. Third, Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re-
the current implementation of SyncDreamer assumes a per- imagine the negative prompt algorithm: Transform 2d
diffusion into 3d, alleviate janus problem and beyond. arXiv
spective image as input but many 2D designs are drawn with
preprint arXiv:2304.04968, 2023. 3
orthogonal projections, which would lead to unnatural dis- [3] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
tortion of the reconstructed geometry. Applying orthogo- Multidiffusion: Fusing diffusion paths for controlled image
nal projection in the volume construction of SyncDreamer generation. arXiv preprint arXiv:2302.08113, 2023. 3
would alleviate this problem. [4] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
Conclusions. In this paper, we present SyncDreamer Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
to generate multiview-consistent images from a single-view Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi-
cient geometry-aware 3d generative adversarial networks. In
image. SyncDreamer adopts a synchronized multiview dif-
CVPR, 2022. 3
fusion to model the joint probability distribution of mul- [5] Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W
tiview images, which thus improves the multiview con- Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini
sistency. We design a novel architecture that uses the De Mello, Tero Karras, and Gordon Wetzstein. Generative
Zero123 as the backbone and a new volume condition mod- novel view synthesis with 3d-aware diffusion models. In
ule to model cross-view dependency. Extensive experi- ICCV, 2023. 2, 3
ments demonstrate that SyncDreamer not only efficiently [6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
generates multiview images with strong consistency, but Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
also achieves improved reconstruction quality compared to Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:
An information-rich 3d model repository. arXiv preprint
the baseline methods. Moreover, it exhibits excellent gen-
arXiv:1512.03012, 2015. 2
eralization to various styles of input images. [7] Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen
Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf:
References A unified approach to 3d generation and reconstruction. In
ICCV, 2023. 3
[1] Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Hen- [8] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia.
derson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Ren- Fantasia3d: Disentangling geometry and appearance for
derdiffusion: Image diffusion for 3d reconstruction, inpaint-

10
high-quality text-to-3d content creation. arXiv preprint arXiv:2304.06700, 2023. 3
arXiv:2303.13873, 2023. 2, 3 [24] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
[9] Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Stylenerf: A style-based 3d-aware generator for high-
Gang Yu, Lei Yang, and Guosheng Lin. It3d: Improved text- resolution image synthesis. In ICLR, 2021. 3
to-3d generation with explicit view synthesis. arXiv preprint [25] Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind,
arXiv:2308.11473, 2023. 3 Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi.
[10] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- Nerfdiff: Single-image view synthesis with nerf-guided dis-
der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal tillation from 3d-aware diffusion. In ICML, 2023. 2, 3
3d shape completion, reconstruction, and generation. In [26] Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian
CVPR, 2023. 3 Laforte, Vikram Voleti, Guan Luo, Chia-Hao Chen, Zi-
[11] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang.
and Mubarak Shah. Diffusion models in vision: A survey. threestudio: A unified framework for 3d content generation.
T-PAMI, 2023. 3 https://github.com/threestudio- project/
[12] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong threestudio, 2023. 7
Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- [27] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar-
tian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. las Oğuz. 3dgen: Triplane latent diffusion for textured mesh
Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint generation. arXiv preprint arXiv:2303.05371, 2023. 3
arXiv:2307.05663, 2023. 10 [28] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[13] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, sion probabilistic models. In NeurIPS, 2020. 2, 3, 4, 5
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana [29] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-
Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: Jun Zha, and Lei Zhang. Dreamtime: An improved optimiza-
A universe of annotated 3d objects. In CVPR, 2023. 2, 5, 10 tion strategy for text-to-3d content creation. arXiv preprint
[14] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, arXiv:2306.12422, 2023. 3
Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. [30] Heewoo Jun and Alex Nichol. Shap-e: Generat-
Nerdi: Single-view nerf synthesis with language-guided dif- ing conditional 3d implicit functions. arXiv preprint
fusion as general image priors. In CVPR, 2023. 2, 3 arXiv:2305.02463, 2023. 2, 3, 7, 8
[15] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan [31] Animesh Karnewar, Niloy J Mitra, Andrea Vedaldi, and
Zhu. 3d-aware conditional image synthesis. In CVPR, 2023. David Novotny. Holofusion: Towards photo-realistic 3d gen-
3 erative modeling. In ICCV, 2023. 3
[16] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- [32] Animesh Karnewar, Andrea Vedaldi, David Novotny, and
man, Ryan Hickman, Krista Reymann, Thomas B McHugh, Niloy J Mitra. Holodiffusion: Training a 3d diffusion model
and Vincent Vanhoucke. Google scanned objects: A high- using 2d images. In CVPR, 2023. 3
quality dataset of 3d scanned household items. In ICRA, [33] Hiroharu Kato and Tatsuya Harada. Learning view priors for
2022. 3, 6, 7 single-view 3d reconstruction. In CVPR, 2019. 3
[17] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, [34] Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten
and Angela Dai. Hyperdiffusion: Generating implicit Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio
neural fields with weight-space diffusion. arXiv preprint Torralba, and Sanja Fidler. Neuralfield-ldm: Scene gener-
arXiv:2303.17015, 2023. 3 ation with hierarchical latent diffusion models. In CVPR,
[18] Hugging Face. One-2-3-45. https://huggingface. 2023. 3
co/spaces/One-2-3-45/One-2-3-45, 2023. 7 [35] Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk
[19] George Fahim, Khalid Amin, and Sameh Zarif. Single-view Sung. Syncdiffusion: Coherent montage via synchronized
3d reconstruction: A survey of deep learning methods. Com- joint diffusions. arXiv preprint arXiv:2306.05178, 2023. 3
puters & Graphics, 94:164–190, 2021. 3 [36] Jiabao Lei, Jiapeng Tang, and Kui Jia. Generative scene syn-
[20] Kui Fu, Jiansheng Peng, Qiwen He, and Hanxiao Zhang. thesis via incremental view inpainting using rgbd diffusion
Single image 3d object reconstruction based on deep learn- models. In CVPR, 2022. 3
ing: A review. Multimedia Tools and Applications, 80:463– [37] Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun
498, 2021. 3 Jampani, Ming-Hsuan Yang, and Jan Kautz. Self-supervised
[21] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- single-view 3d reconstruction via semantic consistency. In
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- ECCV, 2020. 3
Or. An image is worth one word: Personalizing text-to- [38] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
image generation using textual inversion. arXiv preprint Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,
arXiv:2208.01618, 2022. 2 Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution
[22] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, text-to-3d content creation. In CVPR, 2023. 2, 3
Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja [39] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang
Fidler. Get3d: A generative model of high quality 3d tex- Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh
tured shapes learned from images. NeurIPS, 2022. 3 in 45 seconds without per-shape optimization. arXiv preprint
[23] Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, arXiv:2306.16928, 2023. 3, 6, 7, 8
Lingjie Liu, and Josh Susskind. Learning controllable 3d [40] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-
diffusion models from single-view images. arXiv preprint makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:

11
Zero-shot one image to 3d object. In ICCV, 2023. 2, 5, 6, 7, [55] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
8, 9, 10 Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
[41] Xinhang Liu, Shiu-hong Kao, Jiaben Chen, Yu-Wing Tai, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
and Chi-Keung Tang. Deceptive-nerf: Enhancing nerf recon- et al. Photorealistic text-to-image diffusion models with deep
struction using pseudo-observations from diffusion models. language understanding. NeurIPS, 2022. 3, 8
arXiv preprint arXiv:2305.15171, 2023. 3 [56] Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang,
[42] Zhen Liu, Yao Feng, Michael J Black, Derek Charles Herrmann, Pratul Srinivasan, Jiajun Wu, and Deqing
Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdiffu- Sun. Vq3d: Learning a 3d-aware generative model on ima-
sion: Score-based generative 3d mesh modeling. In ICLR, genet. arXiv preprint arXiv:2302.06833, 2023. 3
2023. 3 [57] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
[43] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and and Jan-Michael Frahm. Pixelwise view selection for un-
Andrea Vedaldi. Realfusion: 360deg reconstruction of any structured multi-view stereo. In ECCV, 2016. 6, 7
object from a single image. In CVPR, 2023. 2, 3, 6, 7, 8 [58] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
[44] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Geiger. Graf: Generative radiance fields for 3d-aware image
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: synthesis. NeurIPS, 2020. 3
Representing scenes as neural radiance fields for view syn- [59] Hoigi Seo, Hayeon Kim, Gwanghyun Kim, and Se Young
thesis. In ECCV, 2020. 2 Chun. Ditto-nerf: Diffusion-based iterative text to omni-
[45] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, directional 3d model. arXiv preprint arXiv:2304.02827,
Samuel Rota Bulo, Peter Kontschieder, and Matthias 2023. 3
Nießner. Diffrf: Rendering-guided 3d radiance field [60] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon
diffusion. In CVPR, 2023. 2, 3 Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee,
[46] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela and Seungryong Kim. Let 2d diffusion model know 3d-
Mishkin, and Mark Chen. Point-e: A system for generat- consistency for robust text-to-3d generation. arXiv preprint
ing 3d point clouds from complex prompts. arXiv preprint arXiv:2303.07937, 2023. 3
arXiv:2212.08751, 2022. 2, 3, 7, 8 [61] Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-
[47] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- 3d: Towards single-view anything reconstruction in the wild.
ing scenes as compositional generative neural feature fields. arXiv preprint arXiv:2304.10261, 2023. 3
In CVPR, 2021. 3 [62] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian
[48] Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d
Chaoyang Wang, Luc Van Gool, and Sergey Tulyakov. generation on imagenet. arXiv preprint arXiv:2303.01416,
Autodecoding latent 3d diffusion models. arXiv preprint 2023. 3
arXiv:2307.05445, 2023. 3 [63] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[49] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- and Surya Ganguli. Deep unsupervised learning using
hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, nonequilibrium thermodynamics. In ICML, 2015. 4
2023. 1, 2, 3, 7, 8 [64] Jiaming Song, Chenlin Meng, and Stefano Ermon.
[50] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Denoising diffusion implicit models. arXiv preprint
Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- arXiv:2010.02502, 2020. 9
rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: [65] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea
One image to high-quality 3d object generation using both Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gener-
2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, ative models from 2d data. arXiv preprint arXiv:2306.07881,
2023. 3, 7, 8 2023. 3
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [66] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi,
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- creation from a single image with diffusion prior. In ICCV,
ing transferable visual models from natural language super- 2023. 2, 3
vision. In ICML, 2021. 5 [67] Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and
[52] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Yasutaka Furukawa. Mvdiffusion: Enabling holistic multi-
Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aber- view image generation with correspondence-aware diffusion.
man, Michael Rubinstein, Jonathan Barron, et al. Dream- arXiv preprint arXiv:2307.01097, 2023. 3
booth3d: Subject-driven text-to-3d generation. arXiv [68] Maxim Tatarchenko, Stephan R Richter, René Ranftl,
preprint arXiv:2303.13508, 2023. 2, 3 Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
[53] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, single-view 3d reconstruction networks learn? In CVPR,
Luca Sbordone, Patrick Labatut, and David Novotny. Com- 2019. 3
mon objects in 3d: Large-scale learning and evaluation of [69] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann,
real-life 3d category reconstruction. In CVPR, 2021. 2 Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-
[54] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Brualla, Tomas Simon, Jason Saragih, Matthias Nießner,
Patrick Esser, and Björn Ommer. High-resolution image syn- et al. State of the art on neural rendering. In Computer
thesis with latent diffusion models. In CVPR, 2022. 2, 3, 5, Graphics Forum, 2020. 2
7, 8, 9 [70] Ayush Tewari, Tianwei Yin, George Cazenavette, Semon

12
Rezchikov, Joshua B Tenenbaum, Frédo Durand, William T Wang, and Fan Wang. Points-to-3d: Bridging the gap be-
Freeman, and Vincent Sitzmann. Diffusion with forward tween sparse points and shape-controllable text-to-3d gener-
models: Solving stochastic inverse problems without direct ation. arXiv preprint arXiv:2307.13908, 2023. 3
supervision. arXiv preprint arXiv:2306.11719, 2023. 2, 3 [86] Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis,
[71] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, and Marcus A. Brubaker. Long-term photometric consistent
Michael Niemeyer, and Federico Tombari. Textmesh: Gen- novel view synthesis with diffusion models. In ICCV, 2023.
eration of realistic 3d meshes from text prompts. arXiv 2, 3
preprint arXiv:2304.12439, 2023. 3 [87] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic,
[72] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent
Bin Huang, and Johannes Kopf. Consistent view synthesis point diffusion models for 3d shape generation. In NeurIPS,
with pose-guided diffusion models. In CVPR, 2023. 2, 3 2022. 3
[73] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, [88] Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter
and Greg Shakhnarovich. Score jacobian chaining: Lifting Wonka. 3dshape2vecset: A 3d shape representation for neu-
pretrained 2d diffusion models for 3d generation. In CVPR, ral fields and generative diffusion models. In SIGGRAPH,
2023. 2, 3 2023. 3
[74] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku [89] Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing
Komura, and Wenping Wang. Neus: Learning neural implicit Liao. Text2nerf: Text-driven 3d scene generation with neural
surfaces by volume rendering for multi-view reconstruction. radiance fields. arXiv preprint arXiv:2305.11588, 2023. 2, 3
In NeurIPS, 2021. 1, 2, 6 [90] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
[75] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin and Oliver Wang. The unreasonable effectiveness of deep
Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang features as a perceptual metric. In CVPR, 2018. 6, 7
Wen, Qifeng Chen, et al. Rodin: A generative model for [91] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Dis-
sculpting 3d digital avatars using diffusion. In CVPR, 2023. tilling view-conditioned diffusion for 3d reconstruction. In
2, 3 CVPR, 2023. 2, 3
[76] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P [92] Joseph Zhu and Peiye Zhuang. Hifa: High-fidelity text-
Simoncelli. Image quality assessment: from error visibility to-3d with advanced diffusion guidance. arXiv preprint
to structural similarity. TIP, 2004. 6, 7 arXiv:2305.18766, 2023. 3
[77] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and
diverse text-to-3d generation with variational score distilla-
tion. arXiv preprint arXiv:2305.16213, 2023. 2, 3
[78] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad
Norouzi. Novel view synthesis with diffusion models. arXiv
preprint arXiv:2210.04628, 2022. 2, 3
[79] Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen
Zhao, Haocheng Feng, Jingtuo Liu, and Errui Ding. Hd-
fusion: Detailed text-to-3d generation leveraging multiple
noise estimation. arXiv preprint arXiv:2307.16183, 2023.
3
[80] Jianfeng Xiang, Jiaolong Yang, Binbin Huang, and Xin
Tong. 3d-aware image generation using 2d diffusion mod-
els. arXiv preprint arXiv:2303.17905, 2023. 2, 3
[81] Yiheng Xie, Towaki Takikawa, Shunsuke Saito, Or Litany,
Shiqin Yan, Numair Khan, Federico Tombari, James Tomp-
kin, Vincent Sitzmann, and Srinath Sridhar. Neural fields in
visual computing and beyond. In Computer Graphics Forum,
2022. 2
[82] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang,
and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild
2d photo to a 3d object with 360 views. arXiv e-prints, pages
arXiv–2211, 2022. 2, 3
[83] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Mvsnet: Depth inference for unstructured multi-view stereo.
In ECCV, 2018. 2
[84] Paul Yoo, Jiaxian Guo, Yutaka Matsuo, and Shixiang Shane
Gu. Dreamsparse: Escaping from plato’s cave with 2d frozen
diffusion model given sparse views. CoRR, 2023. 3
[85] Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin

13

You might also like