Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper
Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper
Lukas Höllein1,2 Aljaž Božič2 Norman Müller2 David Novotny2 Hung-Yu Tseng2
Christian Richardt2 Michael Zollhöfer2 Matthias Nießner1
1
Technical University of Munich 2 Meta
https://lukashoel.github.io/ViewDiff/
a stuffed
bear sitting
on a
wooden box
5043
3D assets generated by these methods exhibit compelling 2. Related Work
diversity. However, their visual quality is not consistently
as high as that of the images generated by T2I models. A Text-To-2D. Denoising diffusion probabilistic models
key step to obtaining 3D assets is the ability to generate (DDPM) [14] model a data distribution by learning to invert
consistent multi-view images of the desired objects and their a Gaussian noising process with a deep network. Recently,
surroundings. These images can then be fitted to 3D represen- DDPMs were shown to be superior to generative adversarial
tations like NeRF [25] or NeuS [45]. HoloDiffusion [19] and networks [8], becoming the state-of-the-art framework for
ViewsetDiffusion [38] train a diffusion model from scratch image generation. Soon after, large text-conditioned models
using multi-view images and output 3D-consistent images. trained on billion-scale data were proposed in Imagen [33]
GeNVS [4] and DFM [41] additionally produce object sur- or Dall-E 2 [29]. While [8] achieved conditional generation
roundings, thereby increasing the realism of the generation. via guidance with a classifier, [13] proposed classifier-free
These methods ensure (photo)-realistic results by training guidance. ControlNet [53] proposed a way to tune the diffu-
on real-world 3D datasets [31, 56]. However, these datasets sion outputs by conditioning on various modalities, such as
are orders of magnitude smaller than the 2D dataset used to image segmentation or normal maps. Similar to ControlNet,
train T2I diffusion models. As a result, these approaches pro- our method builds on the strong 2D prior of a pretrained
duce realistic but less-diverse 3D assets. Alternatively, recent text-to-image (T2I) model. We further demonstrate how to
works like Zero-1-to-3 [23] and One-2-3-45 [22] leverage adjust this prior to generate 3D-consistent images of objects.
a pretrained T2I model and fine-tune it for 3D consistency.
These methods successfully preserve the diversity of gener- Text-To-3D. 2D DDPMs were applied to the generation of
ated results by training on a large synthetic 3D dataset [7]. 3D shapes [28, 34, 39, 42, 44, 50, 58] or scenes [10, 15, 40]
Nonetheless, the produced objects can be less photo-realistic from text descriptions. DreamFusion [27] proposed score dis-
and are without surroundings. tillation sampling (SDS) which optimizes a 3D shape whose
renders match the belief of the DDPM. Improved sample
In this paper, we propose a method that leverages the
quality was achieved by a second-stage mesh optimization
2D priors of the pretrained T2I diffusion models to produce
[5, 21], and smoother SDS convergence [35, 47]. Several
photo-realistic and 3D-consistent 3D asset renderings. As
methods use 3D data to train a novel-view synthesis model
shown in the first two rows of Fig. 1, the input is a text
whose multi-view samples can be later converted to 3D, e.g.
description or an image of an object, along with the cam-
conditioning a 2D DDPM on an image and a relative cam-
era poses of the desired rendered images. The proposed
era motion to generate novel views [23, 48]. However, due
approach produces multiple images of the same object in
to no explicit modelling of geometry, the outputs are view-
a single forward pass. Moreover, we design an autoregres-
inconsistent. Consistency can be improved with epipolar
sive generation scheme that allows to render more images
attention [43, 57], or optimizing a 3D shape from multi-view
at any novel viewpoint (Fig. 1, third row). Concretely, we
proposals [22]. Our work fine-tunes a 2D T2I model to gen-
introduce projection and cross-frame-attention layers, that
erate renders of a 3D object; however, we propose explicit
are strategically placed into the existing U-Net architecture,
3D unprojection and rendering operators to improve view-
to encode explicit 3D knowledge about the generated object
consistency. Concurrently, SyncDreamer [24] also add 3D
(see Fig. 2). By doing so, our approach paves the way to fine-
layers in their 2D DDPM. We differ by training on real data
tune T2I models on real-world 3D datasets, such as CO3D
with backgrounds and by showing that autoregressive gener-
[31], while benefiting from the large 2D prior encoded in
ation is sufficient to generate consistent images, making the
the pretrained weights. Our generated images are consistent,
second 3D reconstruction stage expendable.
diverse, and realistic renderings of objects.
To summarize, our contributions are: Diffusion on 3D Representations. Several works model
• a method that utilizes the pretrained 2D prior of text-to- the distribution of 3D representations. While DiffRF [26]
image models and turns them into 3D-consistent image leverages ground-truth 3D shapes, HoloDiffusion [19] is
generators. We train our approach on real-world multi- supervised only with 2D images. HoloFusion [18] extends
view datasets, allowing us to produce realistic and high- this work with a 2D diffusion render post-processor. Im-
quality images of objects and their surroundings (Sec. 3.1). ages can also be denoised by rendering a reconstructing 3D
• a novel U-Net architecture that combines commonly used shape [1, 38]. Unfortunately, the limited scale of existing 3D
2D layers with 3D-aware layers. Our projection and cross- datasets prevents these 3D diffusion models from extrapo-
frame-attention layers encode explicit 3D knowledge into lating beyond the training distribution. Instead, we exploit
each block of the U-Net architecture (Sec. 3.2). a large 2D pretrained DDPM and add 3D components that
• an autoregressive generation scheme that renders images are tuned on smaller-scale multi-view data. This leads to im-
of a 3D object from any desired viewpoint directly with proved multi-view consistency while maintaining the expres-
our model in a 3D-consistent way (Sec. 3.3). sivity brought by pretraining on billion-scale image data.
5044
up/downsample
MSE
Cross-Attention
t~U[0,1000] predicted sampled multi-view
Cross-Frame-Attn
RT, K, I,“ text“ noise noise & images
Projection Layer
MSE
ResNet Block
Figure 2. Method Overview. We augment the U-Net architecture of pretrained text-to-image models with new layers in every U-Net
block. These layers facilitate communication between multi-view images in a batch, resulting in a denoising process that jointly produces
3D-consistent images. First, we replace self-attention with cross-frame-attention (yellow) which compares the spatial features of all views.
We condition all attention layers on pose (RT ), intrinsics (K), and intensity (I) of each image. Second, we add a projection layer (green)
into the inner blocks of the U-Net. It creates a 3D representation from multi-view features and renders them into 3D-consistent features. We
fine-tune the U-Net using the diffusion denoising objective (Eq. 3) at timestep t, supervised from captioned multi-view images.
5045
noise. It is therefore important to keep around 2D layers
that act separately on each image, which we achieve by
finetuning the existing ResNet [11] and ViT [9] blocks. We
Aggregator MLP
CompressNet
summarize our architecture in Fig. 2. In the following, we
(1x1 CNN)
input posed
discuss our two proposed layers in more detail. features
Cross-Frame Attention. Inspired by video diffusion [49,
51], we add cross-frame-attention layers into the U-Net archi-
tecture. Concretely, we modify the existing self-attention lay-
T 3D CNN
ers to calculate CFAttn(Q, K, V )=softmax QK √
d
V with
(1x1 CNN)
ExpandNet
(1x1 CNN)
ScaleNet
Renderer
Volume
\label {eq:cfa} Q = W^Qh_i \text {,~~} K = W^K[h_j]_{j \neq i} \text {,~~} V = W^V[h_j]_{j \neq i} \text {,} (4) output
features
where W Q , W K , W V are the pretrained weights for feature
projection, and hi ∈RC×H×W is the input spatial feature of
each image i∈[1, N ]. Intuitively, this matches features across
all frames, which allows generating the same global style. Figure 3. Architecture of the projection layer. We produce 3D-
Additionally, we add a conditioning vector to all cross- consistent output features from posed input features. First, we
frame and cross-attention layers to inform the network about unproject the compressed image features into 3D and aggregate
the viewpoint of each image. First, we add pose information them into a joint voxel grid with an MLP. Then we refine the voxel
grid with a 3D CNN. A volume renderer similar to NeRF [25]
by encoding each image’s camera matrix p ∈ R4×4 into an
renders 3D-consistent features from the grid. Finally, we apply a
embedding z1 ∈ R4 similar to Zero-1-to-3 [23]. Addition-
learned scale function and expand the feature dimension.
ally, we concatenate the focal length and principal point of
each camera into an embedding z2 ∈ R4 . Finally, we provide
an intensity encoding z3 ∈ R2 , which stores the mean and h0:N ∈ RC×H×W by projecting each voxel into each image
in
variance of the image RGB values. At training time, we set plane. First, we compress h0:N with a 1×1 convolution to
in
z3 to the true values of each input image, and at test time, a reduced feature dimension C ′ =16. We then take the bilin-
we set z3 =[0.5, 0] for all images. This helps to reduce the early interpolated feature at the image plane location and
view-dependent lighting differences contained in the dataset place it into the voxel. This way, we create a separate voxel
(e.g., due to different camera exposure). We construct the grid per view, and merge them into a single grid through an
conditioning vector as z=[z1 , z2 , z3 ], and add it through a aggregator MLP. Inspired by IBRNet [46], the MLP predicts
LoRA-linear-layer [16] W ′Q to the feature projection matrix per-view weights followed by a weighted feature average.
Q. Concretely, we compute the projected features as: We then run a small 3D CNN on the voxel grid to refine
\label {eq:cfa-lora} Q = W^Qh_i + s \cdot W'^Q[h_i; z] \text {,} (5) the 3D feature space. Afterwards, we render the voxel grid
C ′ ×H×W
into output features h0:N
out ∈ R with volumetric ren-
where we set s=1. Similarly, we add the condition via W ′K dering similar to NeRF [25]. We dedicate half of the voxel
to K, and W ′V to V . grid to foreground and half to background and apply the
background model from MERF [30] during ray-marching.
Projection Layer. Cross-frame attention layers are helpful
to produce globally 3D-consistent images. However, the We found it is necessary to add a scale function after the
objects do not precisely follow the specified poses, which volume rendering output. The volume renderer typically uses
leads to view-inconsistencies (see Fig. 5 and Tab. 3). To this a sigmoid activation function as the final layer during ray-
end, we add a projection layer into the U-Net architecture marching [25]. However, the input features are defined in an
(Fig. 3). The idea of this layer is to create 3D-consistent arbitrary floating-point range. To convert h0:N
out back into the
features that are then further processed by the next U-Net same range, we non-linearly scale the features with 1×1 con-
layers (e.g. ResNet blocks). By repeating this layer across all volutions and ReLU activations. Finally, we expand h0:N out to
stages of the U-Net, we ensure that the per-image features the input feature dimension C. We refer to the supplemental
are in a 3D-consistent space. We do not add the projection material for details about each component’s architecture.
layer to the first and last U-Net blocks, as we saw no benefit
3.3. Autoregressive Generation
from them at these locations. We reason that the network
processes image-specific information at those stages and Our method takes as input multiple samples x0:N
t at once
thus does not need a 3D-consistent feature space. and denoises them 3D-consistently. During training, we set
Inspired by multi-view stereo literature [3, 17, 37], we N =5, but can increase it at inference time up to memory
create a 3D feature voxel grid from all input spatial features constraints, e.g., N =30. However, we want to render an
5046
object from any possible viewpoint directly with our network. Table 1. Quantitative comparison of unconditional image gener-
To this end, we propose an autoregressive image generation ation. We report average FID [12] and KID [2] per category and
scheme, i.e., we condition the generation of next viewpoints improve by a significant margin. This signals that our images are
more similar to the distribution of real images in the dataset. We
on previously generated images. We provide the timesteps
mask away the background for our method and the real images to
t0:N of each image as input to the U-Net. By varying t0:N , ensure comparability of numbers with the baselines.
we can achieve different types of conditioning.
Unconditional Generation. All samples are initialized to HF [18] VD [38] Ours
Gaussian noise and are denoised jointly. The timesteps t0:N Category
FID↓ KID↓ FID↓ KID↓ FID↓ KID↓
are kept identical for all samples throughout the reverse pro-
cess. We provide different cameras per image and a single Teddybear 81.93 0.072 201.71 0.169 49.39 0.036
text prompt. The generated images are 3D-consistent, show- Hydrant 61.19 0.042 138.45 0.118 46.45 0.033
ing the object from the desired viewpoints (Figs. 4 and 5). Donut 105.97 0.091 199.14 0.136 68.86 0.054
Apple 62.19 0.056 183.67 0.149 56.85 0.043
Image-Conditional Generation. We divide the total num-
ber of samples N =nc +ng into a conditional part nc and
generative part ng . The first nc samples correspond to im-
We fine-tune the model on 2× A100 GPUs for 60K it-
ages and cameras that are provided as input. The other ng
erations (7 days) with a total batch size of 64. We set the
samples should generate novel views that are similar to the
learning rate for the volume renderer to 0.005 and for all
conditioning images. We start the generation from Gaussian
other layers to 5×10−5 , and use the AdamW optimizer [33].
noise for the ng samples and provide the un-noised images
During inference, we can increase N and generate up to 30
for the other samples. Similarly, we set t0:nc =0 for all de-
images/batch on an RTX 3090 GPU. We use the UniPC [55]
noising steps, while gradually decreasing tng :N .
sampler with 10 denoising steps, which takes 15 seconds.
When nc =1, our method performs single-image recon-
struction (Fig. 6). Setting nc >1 allows to autoregressively
generate novel views from previous images (Fig. 1 bottom).
4. Results
In practice, we first generate one batch of images uncondi- Baselines. We compare against recent state-of-the-art
tionally and then condition the next batches on a subset of works for 3D generative modeling. Our goal is to create
previous images. This allows us to render smooth trajectories multi-view consistent images from real-world, realistic ob-
around 3D objects (see the supplemental material). jects with authentic surroundings. Therefore, we consider
methods that are trained on real-world datasets and select
3.4. Implementation Details HoloFusion (HF) [18], ViewsetDiffusion (VD) [38], and
Dataset. We train our method on the large-scale CO3Dv2 DFM [41]. We show results on two tasks: unconditional gen-
[31] dataset, which consists of posed multi-view images of eration (Sec. 4.1) and single-image reconstruction (Sec. 4.2).
real-world objects. Concretely, we choose the categories Metrics. We report FID [12] and KID [2] as common met-
Teddybear, Hydrant, Apple, and Donut. Per cate- rics for 2D/3D generation and measure the multi-view con-
gory, we train on 500–1000 objects with each 200 images sistency of generated images with peak signal-to-noise ratio
at resolution 256×256. We generate text captions with the (PSNR), structural similarity index (SSIM), and LPIPS [54].
BLIP-2 model [20] and sample one of 5 proposals per object. To ensure comparability, we evaluate all metrics on images
Training. We base our model on a pretrained latent- without backgrounds, as not every baseline models them.
diffusion text-to-image model. We only fine-tune the U-Net
4.1. Unconditional Generation
and keep the VAE encoder and decoder frozen. In each itera-
tion, we select N =5 images and their poses. We sample one Our method can be used to generate 3D-consistent views of
denoising timestep t∼[0, 1000], add noise to the images ac- an object from any pose with only text as input by using our
cording to Eq. 2, and compute the loss according to Eq. 3. In autoregressive generation (Sec. 3.3). Concretely, we sample
the projection layers, we skip the last image when building an (unobserved) image caption from the test set for the first
the voxel grid, which enforces to learn a 3D representation batch and generate N =10 images with a guidance scale [13]
that can be rendered from novel views. We train our method of λcfg =7.5. We then set λcfg =0 for subsequent batches, and
by varying between unconditional and image-conditional create a total of 100 images per object.
generation (Sec. 3.3). Concretely, with probabilities p1 =0.25 We evaluate against HoloFusion (HF) [18] and Viewset-
and p2 =0.25 we provide the first and/or second image as Diffusion (VD) [38]. We report quantitative results in Tab. 1
input and set the respective timestep to zero. Similar to Ruiz and qualitative results in Figs. 4 and 5. HF [18] creates di-
et al. [32], we create a prior dataset and use it during training verse images that sometimes show view-dependent floating
to maintain the 2D prior (see supplement for details). artifacts (see Fig. 5). VD [38] creates consistent but blurry
5047
HoloFusion (HF) [18] ViewsetDiffusion (VD) [38] Ours
Teddybear
Hydrant
Apple
Donut
Figure 4. Unconditional image generation of our method and baselines. We show renderings from different viewpoints for multiple
objects and categories. Our method produces consistent objects and backgrounds. Our textures are sharper in comparison to baselines. Please
see the supplemental material for more examples and animations.
images. In contrast, our method produces images with back- ViewsetDiffusion (VD) [38] and DFM [41]. Concretely, we
grounds and higher-resolution object details. sample one image from the dataset and generate 20 images
at novel views also sampled from the dataset. We follow
4.2. Single-Image Reconstruction Szymanowicz et al. [38] and report the per-view maximum
PSNR/SSIM and average LPIPS across multiple objects and
Our method can be conditioned on multiple images in or- viewpoints for all methods. We report quantitative results in
der to render any novel view in an autoregressive fashion Tab. 2 and show qualitative results in Fig. 6. VD [38] creates
(Sec. 3.3). To measure the 3D-consistency of our generated plausible results without backgrounds. DFM [41] creates
images, we compare single-image reconstruction against
5048
0° 340°
HF [18]
VD [38]
(no proj)
Ours
(no cfa)
Ours
Ours
Figure 5. Multi-view consistency of unconditional image generation. HoloFusion (HF) [18] has view-dependent floating artifacts (the
base in first row). ViewsetDiffusion (VD) [38] has blurrier renderings (second row). Without the projection layer, our method has no precise
control over viewpoints (third row). Without cross-frame-attention, our method suffers from identity changes of the object (fourth row). Our
full method produces detailed images that are 3D-consistent (fifth row).
Table 2. Quantitative comparison of single-image reconstruction. Given a single image as input, we measure the quality of novel views
through average PSNR, SSIM, and LPIPS [54] per category. We mask away the generated backgrounds to ensure comparability across all
methods. We improve over VD [38] while being on-par with DFM [41].
consistent results with backgrounds at a lower image reso- viewpoint directly with our model (Sec. 3.3). Being able
lution (128×128). Our method produces higher resolution to control the pose of the object is therefore an essential
images with similar reconstruction results and backgrounds. part of our contribution. The projection layers build up a
3D representation of the object that is explicitly rendered
4.3. Ablations into 3D-consistent features through volume rendering. This
The key ingredients of our method are the cross-frame- allows us to achieve viewpoint consistency, as also demon-
attention and projection layers that we add to the U-Net strated through single-image reconstruction (Tab. 3).
(Sec. 3.2). We highlight their importance in Tab. 3 and Fig. 5. How important are cross-frame-attention layers? They
How important are the projection layers? They are nec- are necessary to create images of the same object. Without
essary to allow precise control over the image viewpoints them, the teddybear in Fig. 5 (row 4) has the same general
(e.g., Fig. 5 row 3 does not follow the specified rotation). color scheme and follows the specified poses. However, dif-
Our goal is to generate a consistent set of images from any ferences in shape and texture lead to an inconsistent set of
5049
Input VD [38] DFM [41] Ours Real Image Table 3. Quantitative comparison of our method and ablations.
We report average PSNR, SSIM, LPIPS [54], FID [12], KID [2]
over Teddybear and Hydrant categories. We compare against
dropping the projection layer (“no proj”) and cross-frame-attention
(“no cfa”) from the U-Net (see Sec. 3.2). While still producing high-
quality images with similar FID/KID, this shows that our proposed
layers are necessary to obtain 3D-consistent images.
4.4. Limitations
sample 2
5. Conclusion
rusty [C] in the grass yellow and white [C] We presented ViewDiff, a method that, given text or image
Figure 7. Diversity of generated results. We condition our method
input, generates 3D-consistent images of real-world objects
on text input, which allows to create objects in a desired style. placed in authentic surroundings. Our method leverages the
We show samples for hand-crafted text descriptions that combine expressivity of large 2D text-to-image models and fine-tunes
attributes (e.g., color, shape, background) in a novel way. Each this 2D prior on real-world 3D datasets to produce diverse
row shows a different generation proposal from our method and we multi-view images in a joint denoising process. The core in-
denote the object category (Teddybear, Hydrant) as [C]. This sight of our work are two novel layers, namely cross-frame-
showcases the diversity of generated results, i.e., multiple different attention and the projection layer (Sec. 3.2). Our autoregres-
objects are generated for the same description. sive generation scheme (Sec. 3.3) allows to directly render
high-quality novel views of a generated 3D object.
5050
References [17] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hos-
seini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand
[1] Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Hen- Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot
derson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Ren- learning with retrieval augmented language models. JMLR,
derdiffusion: Image diffusion for 3D reconstruction, inpaint- 24(251):1–43, 2023. 4
ing and generation. In CVPR, 2023. 2
[18] Animesh Karnewar, Niloy J. Mitra, Andrea Vedaldi, and
[2] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and David Novotny. HoloFusion: Towards photo-realistic 3D
Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018. generative modeling. In ICCV, 2023. 2, 5, 6, 7
5, 8
[19] Animesh Karnewar, Andrea Vedaldi, David Novotny, and
[3] Aljaž Božič, Pablo Palafox, Justus Thies, Angela Dai, and Niloy J. Mitra. HoloDiffusion: Training a 3D diffusion model
Matthias Nießner. TransformerFusion: Monocular RGB scene using 2D images. In CVPR, 2023. 2
reconstruction using transformers. In NeurIPS, 2021. 4
[20] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-
[4] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexan- 2: Bootstrapping language-image pre-training with frozen im-
der W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, age encoders and large language models. arXiv:2301.12597,
Shalini De Mello, Tero Karras, and Gordon Wetzstein. Gen- 2023. 5
erative novel view synthesis with 3D-aware diffusion models.
[21] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
arXiv:2304.02602, 2023. 2
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-
[5] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-
tasia3D: Disentangling geometry and appearance for high- 3D content creation. In CVPR, 2023. 2
quality text-to-3D content creation. In ICCV, 2023. 1, 2
[22] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexi-
[6] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, ang Xu, Hao Su, et al. One-2-3-45: Any single image
Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly- to 3D mesh in 45 seconds without per-shape optimization.
annotated 3D reconstructions of indoor scenes. In CVPR, arXiv:2306.16928, 2023. 2
2017. 8
[23] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov,
[7] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, one image to 3d object. In Proceedings of the IEEE/CVF
Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe International Conference on Computer Vision (ICCV), pages
of annotated 3D objects. In CVPR, 2023. 2 9298–9309, 2023. 2, 4
[8] Prafulla Dhariwal and Alexander Nichol. Diffusion models [24] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie
beat GANs on image synthesis. In NeurIPS, 2021. 2 Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ating multiview-consistent images from a single-view image.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, In The Twelfth International Conference on Learning Repre-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- sentations, 2024. 2, 3
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [25] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
worth 16×16 words: Transformers for image recognition at Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
scale. In ICLR, 2021. 4 Representing scenes as neural radiance fields for view syn-
[10] Ayaan Haque, Matthew Tancik, Alexei Efros, Aleksander thesis. Communications of the ACM, 65(1):99–106, 2021. 2,
Holynski, and Angjoo Kanazawa. Instruct-NeRF2NeRF: Edit- 4
ing 3D scenes with instructions. In ICCV, 2023. 2 [26] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Bulò, Peter Kontschieder, and Matthias Nießner. DiffRF:
Deep residual learning for image recognition. In CVPR, 2016. Rendering-guided 3D radiance field diffusion. In CVPR,
4 2023. 2
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- [27] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall.
hard Nessler, and Sepp Hochreiter. GANs trained by a two DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
time-scale update rule converge to a local Nash equilibrium. 1, 2
In NeurIPS, 2017. 5, 8 [28] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Ali-
[13] Jonathan Ho and Tim Salimans. Classifier-free diffusion aksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov,
guidance. In NeurIPS Workshops, 2021. 2, 5 Peter Wonka, Sergey Tulyakov, et al. Magic123: One image
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- to high-quality 3D object generation using both 2D and 3D
sion probabilistic models. In NeurIPS, 2020. 2, 3 diffusion priors. arXiv:2306.17843, 2023. 2
[15] Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, [29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Matthias Nießner. Text2Room: Extracting textured 3D and Mark Chen. Hierarchical text-conditional image gen-
meshes from 2D text-to-image models. In ICCV, 2023. 2 eration with CLIP latents. arXiv:2204.06125, 2022. 1, 2,
[16] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, 3
Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: [30] Christian Reiser, Rick Szeliski, Dor Verbin, Pratul Srinivasan,
Low-rank adaptation of large language models. In ICLR, Ben Mildenhall, Andreas Geiger, Jon Barron, and Peter Hed-
2022. 4 man. MERF: Memory-efficient radiance fields for real-time
5051
view synthesis in unbounded scenes. ACM Transactions on [45] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Graphics (TOG), 42(4):1–12, 2023. 4 Komura, and Wenping Wang. NeuS: Learning neural implicit
[31] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, surfaces by volume rendering for multi-view reconstruction.
Luca Sbordone, Patrick Labatut, and David Novotny. Com- In NeurIPS, 2021. 2
mon objects in 3D: Large-scale learning and evaluation of [46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
real-life 3D category reconstruction. In ICCV, 2021. 2, 3, 5 vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
[32] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet:
Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Learning multi-view image-based rendering. In CVPR, 2021.
tuning text-to-image diffusion models for subject-driven gen- 4
eration. In CVPR, 2023. 1, 5 [47] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
[33] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Li, Hang Su, and Jun Zhu. ProlificDreamer: High-fidelity and
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, diverse text-to-3D generation with variational score distilla-
Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, tion. arXiv:2305.16213, 2023. 1, 2
Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad [48] Daniel Watson, William Chan, Ricardo Martin Brualla,
Norouzi. Photorealistic text-to-image diffusion models with Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi.
deep language understanding. In NeurIPS, 2022. 1, 2, 3, 5 Novel view synthesis with diffusion models. In ICLR, 2023.
[34] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon 2
Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung [49] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei,
Lee, and Seungryong Kim. Let 2D diffusion model Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and
know 3D-consistency for robust text-to-3D generation. Mike Zheng Shou. Tune-a-video: One-shot tuning of im-
arXiv:2303.07937, 2023. 2 age diffusion models for text-to-video generation. In ICCV,
[35] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, 2023. 4
and Xiao Yang. MVDream: Multi-view diffusion for 3D [50] Jianfeng Xiang, Jiaolong Yang, Binbin Huang, and Xin Tong.
generation. arXiv:2308.16512, 2023. 2 3D-aware image generation using 2D diffusion models. In
[36] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, ICCV, 2023. 2
and Surya Ganguli. Deep unsupervised learning using [51] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy.
nonequilibrium thermodynamics. In ICML, 2015. 3 Rerender a video: Zero-shot text-guided video-to-video trans-
[37] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and lation. In SIGGRAPH Asia, 2023. 4
Hujun Bao. NeuralRecon: Real-time coherent 3D reconstruc-
[52] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and
tion from monocular video. In CVPR, 2021. 4
Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor
[38] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea scenes. In ICCV, 2023. 8
Vedaldi. ViewSet diffusion: (0-)image-conditioned 3D gen-
[53] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
erative models from 2D data. In ICCV, 2023. 2, 5, 6, 7,
conditional control to text-to-image diffusion models. In
8
ICCV, 2023. 1, 2, 8
[39] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi,
[54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Lizhuang Ma, and Dong Chen. Make-it-3D: High-fidelity 3D
and Oliver Wang. The unreasonable effectiveness of deep
creation from a single image with diffusion prior. In ICCV,
features as a perceptual metric. In CVPR, 2018. 5, 7, 8
2023. 2
[40] Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and [55] Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and
Yasutaka Furukawa. MVDiffusion: Enabling holistic multi- Jiwen Lu. UniPC: A unified predictor-corrector framework
view image generation with correspondence-aware diffusion. for fast sampling of diffusion models. In NeurIPS, 2023. 5
In NeurIPS, 2023. 2 [56] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
[41] Ayush Tewari, Tianwei Yin, George Cazenavette, Semon and Noah Snavely. Stereo magnification: Learning view syn-
Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. thesis using multiplane images. ACM Transactions on Graph-
Freeman, and Vincent Sitzmann. Diffusion with forward ics, 37(4):65:1–12, 2018. 2
models: Solving stochastic inverse problems without direct [57] Zhizhuo Zhou and Shubham Tulsiani. SparseFusion: Dis-
supervision. In NeurIPS, 2023. 2, 5, 6, 7, 8 tilling view-conditioned diffusion for 3D reconstruction. In
[42] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, CVPR, 2023. 2
Michael Niemeyer, and Federico Tombari. TextMesh: Gener- [58] Joseph Zhu and Peiye Zhuang. HiFA: High-fidelity text-to-3D
ation of realistic 3D meshes from text prompts. In 3DV, 2024. with advanced diffusion guidance. arXiv:2305.18766, 2023.
2 2
[43] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-
Bin Huang, and Johannes Kopf. Consistent view synthesis
with pose-guided diffusion models. In CVPR, 2023. 2
[44] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
and Greg Shakhnarovich. Score Jacobian chaining: Lifting
pretrained 2D diffusion models for 3D generation. In CVPR,
2023. 2
5052