0% found this document useful (0 votes)
8 views10 pages

Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper

The paper presents a method for generating 3D-consistent images using pretrained text-to-image models, enhancing them with 3D volume-rendering and cross-frame-attention layers. This approach allows for the creation of high-quality, multi-view images from text descriptions or posed images, improving visual consistency and realism compared to existing methods. The authors demonstrate that their model effectively utilizes real-world datasets to produce diverse and photorealistic 3D assets while maintaining the expressive power of large-scale 2D data.

Uploaded by

f904608122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper

The paper presents a method for generating 3D-consistent images using pretrained text-to-image models, enhancing them with 3D volume-rendering and cross-frame-attention layers. This approach allows for the creation of high-quality, multi-view images from text descriptions or posed images, improving visual consistency and realism compared to existing methods. The authors demonstrate that their model effectively utilizes real-world datasets to produce diverse and photorealistic 3D assets while maintaining the expressive power of large-scale 2D data.

Uploaded by

f904608122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Höllein1,2 Aljaž Božič2 Norman Müller2 David Novotny2 Hung-Yu Tseng2
Christian Richardt2 Michael Zollhöfer2 Matthias Nießner1
1
Technical University of Munich 2 Meta
https://lukashoel.github.io/ViewDiff/

a stuffed
bear sitting
on a
wooden box

Input Multi-view generated images


Figure 1. Multi-view consistent image generation. Our method takes as input a text description, or any number of posed input images, and
generates high-quality, multi-view consistent images of a real-world 3D object in authentic surroundings from any desired camera poses.

in authentic surroundings. Compared to the existing methods,


the results generated by our method are consistent, and have
Abstract favorable visual quality (−30% FID, −37% KID).

3D asset generation is getting massive amounts of attention, 1. Introduction


inspired by the recent success of text-guided 2D content
creation. Existing text-to-3D methods use pretrained text- In recent years, text-to-image (T2I) diffusion models [29, 33]
to-image diffusion models in an optimization problem or have emerged as cutting-edge technologies, revolutionizing
fine-tune them on synthetic data, which often results in non- high-quality and imaginative 2D content creation guided by
photorealistic 3D objects without backgrounds. In this paper, text descriptions. These frameworks have found widespread
we present a method that leverages pretrained text-to-image applications, including extensions such as ControlNet [53]
models as a prior, and learn to generate multi-view images in and DreamBooth [32], showcasing their versatility and poten-
a single denoising process from real-world data. Concretely, tial. An intriguing direction in this domain is to use T2I mod-
we propose to integrate 3D volume-rendering and cross- els as powerful 2D priors for generating three-dimensional
frame-attention layers into each block of the existing U-Net (3D) assets. How can we effectively use these models to
network of the text-to-image model. Moreover, we design an create photo-realistic and diverse 3D assets?
autoregressive generation that renders more 3D-consistent Existing methods like DreamFusion [27], Fantasia3D [5],
images at any viewpoint. We train our model on real-world and ProlificDreamer [47] have demonstrated exciting results
datasets of objects and showcase its capabilities to generate by optimizing a 3D representation through score distillation
instances with a variety of high-quality shapes and textures sampling [27] from pretrained T2I diffusion models. The

5043
3D assets generated by these methods exhibit compelling 2. Related Work
diversity. However, their visual quality is not consistently
as high as that of the images generated by T2I models. A Text-To-2D. Denoising diffusion probabilistic models
key step to obtaining 3D assets is the ability to generate (DDPM) [14] model a data distribution by learning to invert
consistent multi-view images of the desired objects and their a Gaussian noising process with a deep network. Recently,
surroundings. These images can then be fitted to 3D represen- DDPMs were shown to be superior to generative adversarial
tations like NeRF [25] or NeuS [45]. HoloDiffusion [19] and networks [8], becoming the state-of-the-art framework for
ViewsetDiffusion [38] train a diffusion model from scratch image generation. Soon after, large text-conditioned models
using multi-view images and output 3D-consistent images. trained on billion-scale data were proposed in Imagen [33]
GeNVS [4] and DFM [41] additionally produce object sur- or Dall-E 2 [29]. While [8] achieved conditional generation
roundings, thereby increasing the realism of the generation. via guidance with a classifier, [13] proposed classifier-free
These methods ensure (photo)-realistic results by training guidance. ControlNet [53] proposed a way to tune the diffu-
on real-world 3D datasets [31, 56]. However, these datasets sion outputs by conditioning on various modalities, such as
are orders of magnitude smaller than the 2D dataset used to image segmentation or normal maps. Similar to ControlNet,
train T2I diffusion models. As a result, these approaches pro- our method builds on the strong 2D prior of a pretrained
duce realistic but less-diverse 3D assets. Alternatively, recent text-to-image (T2I) model. We further demonstrate how to
works like Zero-1-to-3 [23] and One-2-3-45 [22] leverage adjust this prior to generate 3D-consistent images of objects.
a pretrained T2I model and fine-tune it for 3D consistency.
These methods successfully preserve the diversity of gener- Text-To-3D. 2D DDPMs were applied to the generation of
ated results by training on a large synthetic 3D dataset [7]. 3D shapes [28, 34, 39, 42, 44, 50, 58] or scenes [10, 15, 40]
Nonetheless, the produced objects can be less photo-realistic from text descriptions. DreamFusion [27] proposed score dis-
and are without surroundings. tillation sampling (SDS) which optimizes a 3D shape whose
renders match the belief of the DDPM. Improved sample
In this paper, we propose a method that leverages the
quality was achieved by a second-stage mesh optimization
2D priors of the pretrained T2I diffusion models to produce
[5, 21], and smoother SDS convergence [35, 47]. Several
photo-realistic and 3D-consistent 3D asset renderings. As
methods use 3D data to train a novel-view synthesis model
shown in the first two rows of Fig. 1, the input is a text
whose multi-view samples can be later converted to 3D, e.g.
description or an image of an object, along with the cam-
conditioning a 2D DDPM on an image and a relative cam-
era poses of the desired rendered images. The proposed
era motion to generate novel views [23, 48]. However, due
approach produces multiple images of the same object in
to no explicit modelling of geometry, the outputs are view-
a single forward pass. Moreover, we design an autoregres-
inconsistent. Consistency can be improved with epipolar
sive generation scheme that allows to render more images
attention [43, 57], or optimizing a 3D shape from multi-view
at any novel viewpoint (Fig. 1, third row). Concretely, we
proposals [22]. Our work fine-tunes a 2D T2I model to gen-
introduce projection and cross-frame-attention layers, that
erate renders of a 3D object; however, we propose explicit
are strategically placed into the existing U-Net architecture,
3D unprojection and rendering operators to improve view-
to encode explicit 3D knowledge about the generated object
consistency. Concurrently, SyncDreamer [24] also add 3D
(see Fig. 2). By doing so, our approach paves the way to fine-
layers in their 2D DDPM. We differ by training on real data
tune T2I models on real-world 3D datasets, such as CO3D
with backgrounds and by showing that autoregressive gener-
[31], while benefiting from the large 2D prior encoded in
ation is sufficient to generate consistent images, making the
the pretrained weights. Our generated images are consistent,
second 3D reconstruction stage expendable.
diverse, and realistic renderings of objects.
To summarize, our contributions are: Diffusion on 3D Representations. Several works model
• a method that utilizes the pretrained 2D prior of text-to- the distribution of 3D representations. While DiffRF [26]
image models and turns them into 3D-consistent image leverages ground-truth 3D shapes, HoloDiffusion [19] is
generators. We train our approach on real-world multi- supervised only with 2D images. HoloFusion [18] extends
view datasets, allowing us to produce realistic and high- this work with a 2D diffusion render post-processor. Im-
quality images of objects and their surroundings (Sec. 3.1). ages can also be denoised by rendering a reconstructing 3D
• a novel U-Net architecture that combines commonly used shape [1, 38]. Unfortunately, the limited scale of existing 3D
2D layers with 3D-aware layers. Our projection and cross- datasets prevents these 3D diffusion models from extrapo-
frame-attention layers encode explicit 3D knowledge into lating beyond the training distribution. Instead, we exploit
each block of the U-Net architecture (Sec. 3.2). a large 2D pretrained DDPM and add 3D components that
• an autoregressive generation scheme that renders images are tuned on smaller-scale multi-view data. This leads to im-
of a 3D object from any desired viewpoint directly with proved multi-view consistency while maintaining the expres-
our model in a 3D-consistent way (Sec. 3.3). sivity brought by pretraining on billion-scale image data.

5044
up/downsample

MSE
Cross-Attention
t~U[0,1000] predicted sampled multi-view
Cross-Frame-Attn
RT, K, I,“ text“ noise noise & images
Projection Layer

MSE
ResNet Block

multi-branch U-Net (shared weights)

Figure 2. Method Overview. We augment the U-Net architecture of pretrained text-to-image models with new layers in every U-Net
block. These layers facilitate communication between multi-view images in a batch, resulting in a denoising process that jointly produces
3D-consistent images. First, we replace self-attention with cross-frame-attention (yellow) which compares the spatial features of all views.
We condition all attention layers on pose (RT ), intrinsics (K), and intensity (I) of each image. Second, we add a projection layer (green)
into the inner blocks of the U-Net. It creates a 3D representation from multi-view features and renders them into 3D-consistent features. We
fine-tune the U-Net using the diffusion denoising objective (Eq. 3) at timestep t, supervised from captioned multi-view images.

3. Method pled separately per image p(xnT ) = N (xnT ; 0, I), ∀n ∈


[0, N ]. We gradually denoise samples pθ (xnt−1 | x0:N t )=
We propose a method that produces 3D-consistent images N (xt−1 ; µnθ (x0:N , t), σt2 I) by predicting the per-image
t
from a given text or posed image input (see Fig. 1 top/mid). mean µnθ (x0:N , t) through a neural network µθ that is shared
t
Concretely, given desired output poses, we jointly generate between all images. Importantly, at each step, the model uses
all images corresponding to the condition. We leverage pre-
the previous states x0:N t of all images, i.e., there is commu-
trained text-to-image (T2I) models [29, 33] and fine-tune
nication between images during the model prediction. We
them on multi-view data [31]. We propose to augment the refer to Sec. 3.2 for details on how this is implemented. To
existing U-Net architecture by adding new layers into each train µθ , we define the forward process as a Markov chain:
block (see Fig. 2). At test time, we can condition our method
on multiple images (see Fig. 1 bottom), which allows us to
\label {eq:ddpm-forward-process} q(x_{1{:}T}^{0{:}N} \mid x_0^{0{:}N}) = \prod _{t=1}^T \prod _{n=0}^N q(x_t^n \mid x_{t-1}^n) \text {,} (2)
autoregressively render the same object from any viewpoint
directly with the diffusion model (see Sec. 3.3). √
where q(xnt | xnt−1 ) = N (xnt ; 1 − βt xnt−1 , βt I) and
3.1. 3D-Consistent Diffusion β1 , . . . , βT define a constant variance schedule, i.e., we ap-
Diffusion models [14, 36] are a class of gener- ply separate noise per image to produce training samples.
ative models that learn the probability distribution We follow Ho et al. [14] by learning a noise predictor ϵθ
instead of µθ . This allows to train ϵθ with an L2 loss:
R
pθ (x0 )= pθ (x0:T )dx1:T over data x0 ∼q(x0 ) and latent
variables x1:T =x1 , . . . , xT . Our method is based on pre-
trained text-to-image models, which are diffusion models \label {eq:ddpm-eps-loss} \mathbb {E}_{x_0^{0{:}N}\!\!, \epsilon ^{0{:}N} \sim \mathcal {N}(\mathbf {0}, \mathbf {I}), n}\left [\norm { \epsilon ^n - \epsilon _{\theta }^n(x_t^{0{:}N}, t) }^2\right ] \text {.} (3)
pθ (x0 | c) with an additional text condition c. For clarity, we
drop the condition c for the remainder of this section. 3.2. Augmentation of the U-Net architecture
To produce multiple images x0:N 0 at once, which are 3D- To model a 3D-consistent denoising process over all im-
consistent with each other, we seek R to model their joint ages, we predict per-image noise ϵnθ (x0:N
t , t) through a neu-
probability distribution pθ (x0:N 0:N 0:N
0 )= pθ (x0:T )dx1:T . Simi- ral network ϵθ . This neural network is initialized from the
larly to concurrent work by Liu et al. [24], we generate one pretrained weights of existing text-to-image models, and is
set of images pθ (x0:N
0 ) by adapting the reverse process of usually defined as a U-Net architecture [29, 33]. We seek to
DDPMs [14] as a Markov chain over all images jointly: leverage the previous states x0:N
t of all images to arrive at
a 3D-consistent denoising step. To this end, we propose to
add two layers into the U-Net architecture, namely a cross-
\label {eq:ddpm-reverse-process} p_{\theta }(x_{0{:}T}^{0{:}N}) := p(x_T^{0{:}N}) \prod _{t=1}^T \prod _{n=0}^N p_{\theta }(x_{t-1}^n \mid x_t^{0{:}N}) \text {,} (1)
frame-attention layer and a projection layer. We note that the
predicted per-image noise needs to be image specific, since
where we start the generation from Gaussian noise sam- all images are generated starting from separate Gaussian

5045
noise. It is therefore important to keep around 2D layers
that act separately on each image, which we achieve by
finetuning the existing ResNet [11] and ViT [9] blocks. We

Aggregator MLP
CompressNet
summarize our architecture in Fig. 2. In the following, we

(1x1 CNN)
input posed
discuss our two proposed layers in more detail. features
Cross-Frame Attention. Inspired by video diffusion [49,
51], we add cross-frame-attention layers into the U-Net archi-
tecture. Concretely, we modify the existing self-attention lay-
T  3D CNN
ers to calculate CFAttn(Q, K, V )=softmax QK √
d
V with

(1x1 CNN)
ExpandNet

(1x1 CNN)
ScaleNet

Renderer
Volume
\label {eq:cfa} Q = W^Qh_i \text {,~~} K = W^K[h_j]_{j \neq i} \text {,~~} V = W^V[h_j]_{j \neq i} \text {,} (4) output
features
where W Q , W K , W V are the pretrained weights for feature
projection, and hi ∈RC×H×W is the input spatial feature of
each image i∈[1, N ]. Intuitively, this matches features across
all frames, which allows generating the same global style. Figure 3. Architecture of the projection layer. We produce 3D-
Additionally, we add a conditioning vector to all cross- consistent output features from posed input features. First, we
frame and cross-attention layers to inform the network about unproject the compressed image features into 3D and aggregate
the viewpoint of each image. First, we add pose information them into a joint voxel grid with an MLP. Then we refine the voxel
grid with a 3D CNN. A volume renderer similar to NeRF [25]
by encoding each image’s camera matrix p ∈ R4×4 into an
renders 3D-consistent features from the grid. Finally, we apply a
embedding z1 ∈ R4 similar to Zero-1-to-3 [23]. Addition-
learned scale function and expand the feature dimension.
ally, we concatenate the focal length and principal point of
each camera into an embedding z2 ∈ R4 . Finally, we provide
an intensity encoding z3 ∈ R2 , which stores the mean and h0:N ∈ RC×H×W by projecting each voxel into each image
in
variance of the image RGB values. At training time, we set plane. First, we compress h0:N with a 1×1 convolution to
in
z3 to the true values of each input image, and at test time, a reduced feature dimension C ′ =16. We then take the bilin-
we set z3 =[0.5, 0] for all images. This helps to reduce the early interpolated feature at the image plane location and
view-dependent lighting differences contained in the dataset place it into the voxel. This way, we create a separate voxel
(e.g., due to different camera exposure). We construct the grid per view, and merge them into a single grid through an
conditioning vector as z=[z1 , z2 , z3 ], and add it through a aggregator MLP. Inspired by IBRNet [46], the MLP predicts
LoRA-linear-layer [16] W ′Q to the feature projection matrix per-view weights followed by a weighted feature average.
Q. Concretely, we compute the projected features as: We then run a small 3D CNN on the voxel grid to refine
\label {eq:cfa-lora} Q = W^Qh_i + s \cdot W'^Q[h_i; z] \text {,} (5) the 3D feature space. Afterwards, we render the voxel grid
C ′ ×H×W
into output features h0:N
out ∈ R with volumetric ren-
where we set s=1. Similarly, we add the condition via W ′K dering similar to NeRF [25]. We dedicate half of the voxel
to K, and W ′V to V . grid to foreground and half to background and apply the
background model from MERF [30] during ray-marching.
Projection Layer. Cross-frame attention layers are helpful
to produce globally 3D-consistent images. However, the We found it is necessary to add a scale function after the
objects do not precisely follow the specified poses, which volume rendering output. The volume renderer typically uses
leads to view-inconsistencies (see Fig. 5 and Tab. 3). To this a sigmoid activation function as the final layer during ray-
end, we add a projection layer into the U-Net architecture marching [25]. However, the input features are defined in an
(Fig. 3). The idea of this layer is to create 3D-consistent arbitrary floating-point range. To convert h0:N
out back into the
features that are then further processed by the next U-Net same range, we non-linearly scale the features with 1×1 con-
layers (e.g. ResNet blocks). By repeating this layer across all volutions and ReLU activations. Finally, we expand h0:N out to
stages of the U-Net, we ensure that the per-image features the input feature dimension C. We refer to the supplemental
are in a 3D-consistent space. We do not add the projection material for details about each component’s architecture.
layer to the first and last U-Net blocks, as we saw no benefit
3.3. Autoregressive Generation
from them at these locations. We reason that the network
processes image-specific information at those stages and Our method takes as input multiple samples x0:N
t at once
thus does not need a 3D-consistent feature space. and denoises them 3D-consistently. During training, we set
Inspired by multi-view stereo literature [3, 17, 37], we N =5, but can increase it at inference time up to memory
create a 3D feature voxel grid from all input spatial features constraints, e.g., N =30. However, we want to render an

5046
object from any possible viewpoint directly with our network. Table 1. Quantitative comparison of unconditional image gener-
To this end, we propose an autoregressive image generation ation. We report average FID [12] and KID [2] per category and
scheme, i.e., we condition the generation of next viewpoints improve by a significant margin. This signals that our images are
more similar to the distribution of real images in the dataset. We
on previously generated images. We provide the timesteps
mask away the background for our method and the real images to
t0:N of each image as input to the U-Net. By varying t0:N , ensure comparability of numbers with the baselines.
we can achieve different types of conditioning.
Unconditional Generation. All samples are initialized to HF [18] VD [38] Ours
Gaussian noise and are denoised jointly. The timesteps t0:N Category
FID↓ KID↓ FID↓ KID↓ FID↓ KID↓
are kept identical for all samples throughout the reverse pro-
cess. We provide different cameras per image and a single Teddybear 81.93 0.072 201.71 0.169 49.39 0.036
text prompt. The generated images are 3D-consistent, show- Hydrant 61.19 0.042 138.45 0.118 46.45 0.033
ing the object from the desired viewpoints (Figs. 4 and 5). Donut 105.97 0.091 199.14 0.136 68.86 0.054
Apple 62.19 0.056 183.67 0.149 56.85 0.043
Image-Conditional Generation. We divide the total num-
ber of samples N =nc +ng into a conditional part nc and
generative part ng . The first nc samples correspond to im-
We fine-tune the model on 2× A100 GPUs for 60K it-
ages and cameras that are provided as input. The other ng
erations (7 days) with a total batch size of 64. We set the
samples should generate novel views that are similar to the
learning rate for the volume renderer to 0.005 and for all
conditioning images. We start the generation from Gaussian
other layers to 5×10−5 , and use the AdamW optimizer [33].
noise for the ng samples and provide the un-noised images
During inference, we can increase N and generate up to 30
for the other samples. Similarly, we set t0:nc =0 for all de-
images/batch on an RTX 3090 GPU. We use the UniPC [55]
noising steps, while gradually decreasing tng :N .
sampler with 10 denoising steps, which takes 15 seconds.
When nc =1, our method performs single-image recon-
struction (Fig. 6). Setting nc >1 allows to autoregressively
generate novel views from previous images (Fig. 1 bottom).
4. Results
In practice, we first generate one batch of images uncondi- Baselines. We compare against recent state-of-the-art
tionally and then condition the next batches on a subset of works for 3D generative modeling. Our goal is to create
previous images. This allows us to render smooth trajectories multi-view consistent images from real-world, realistic ob-
around 3D objects (see the supplemental material). jects with authentic surroundings. Therefore, we consider
methods that are trained on real-world datasets and select
3.4. Implementation Details HoloFusion (HF) [18], ViewsetDiffusion (VD) [38], and
Dataset. We train our method on the large-scale CO3Dv2 DFM [41]. We show results on two tasks: unconditional gen-
[31] dataset, which consists of posed multi-view images of eration (Sec. 4.1) and single-image reconstruction (Sec. 4.2).
real-world objects. Concretely, we choose the categories Metrics. We report FID [12] and KID [2] as common met-
Teddybear, Hydrant, Apple, and Donut. Per cate- rics for 2D/3D generation and measure the multi-view con-
gory, we train on 500–1000 objects with each 200 images sistency of generated images with peak signal-to-noise ratio
at resolution 256×256. We generate text captions with the (PSNR), structural similarity index (SSIM), and LPIPS [54].
BLIP-2 model [20] and sample one of 5 proposals per object. To ensure comparability, we evaluate all metrics on images
Training. We base our model on a pretrained latent- without backgrounds, as not every baseline models them.
diffusion text-to-image model. We only fine-tune the U-Net
4.1. Unconditional Generation
and keep the VAE encoder and decoder frozen. In each itera-
tion, we select N =5 images and their poses. We sample one Our method can be used to generate 3D-consistent views of
denoising timestep t∼[0, 1000], add noise to the images ac- an object from any pose with only text as input by using our
cording to Eq. 2, and compute the loss according to Eq. 3. In autoregressive generation (Sec. 3.3). Concretely, we sample
the projection layers, we skip the last image when building an (unobserved) image caption from the test set for the first
the voxel grid, which enforces to learn a 3D representation batch and generate N =10 images with a guidance scale [13]
that can be rendered from novel views. We train our method of λcfg =7.5. We then set λcfg =0 for subsequent batches, and
by varying between unconditional and image-conditional create a total of 100 images per object.
generation (Sec. 3.3). Concretely, with probabilities p1 =0.25 We evaluate against HoloFusion (HF) [18] and Viewset-
and p2 =0.25 we provide the first and/or second image as Diffusion (VD) [38]. We report quantitative results in Tab. 1
input and set the respective timestep to zero. Similar to Ruiz and qualitative results in Figs. 4 and 5. HF [18] creates di-
et al. [32], we create a prior dataset and use it during training verse images that sometimes show view-dependent floating
to maintain the 2D prior (see supplement for details). artifacts (see Fig. 5). VD [38] creates consistent but blurry

5047
HoloFusion (HF) [18] ViewsetDiffusion (VD) [38] Ours
Teddybear
Hydrant
Apple
Donut

Figure 4. Unconditional image generation of our method and baselines. We show renderings from different viewpoints for multiple
objects and categories. Our method produces consistent objects and backgrounds. Our textures are sharper in comparison to baselines. Please
see the supplemental material for more examples and animations.

images. In contrast, our method produces images with back- ViewsetDiffusion (VD) [38] and DFM [41]. Concretely, we
grounds and higher-resolution object details. sample one image from the dataset and generate 20 images
at novel views also sampled from the dataset. We follow
4.2. Single-Image Reconstruction Szymanowicz et al. [38] and report the per-view maximum
PSNR/SSIM and average LPIPS across multiple objects and
Our method can be conditioned on multiple images in or- viewpoints for all methods. We report quantitative results in
der to render any novel view in an autoregressive fashion Tab. 2 and show qualitative results in Fig. 6. VD [38] creates
(Sec. 3.3). To measure the 3D-consistency of our generated plausible results without backgrounds. DFM [41] creates
images, we compare single-image reconstruction against

5048
0° 340°
HF [18]
VD [38]
(no proj)
Ours
(no cfa)
Ours
Ours

Figure 5. Multi-view consistency of unconditional image generation. HoloFusion (HF) [18] has view-dependent floating artifacts (the
base in first row). ViewsetDiffusion (VD) [38] has blurrier renderings (second row). Without the projection layer, our method has no precise
control over viewpoints (third row). Without cross-frame-attention, our method suffers from identity changes of the object (fourth row). Our
full method produces detailed images that are 3D-consistent (fifth row).

Table 2. Quantitative comparison of single-image reconstruction. Given a single image as input, we measure the quality of novel views
through average PSNR, SSIM, and LPIPS [54] per category. We mask away the generated backgrounds to ensure comparability across all
methods. We improve over VD [38] while being on-par with DFM [41].

Teddybear Hydrant Donut Apple


Method
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
VD [38] 19.68 0.70 0.30 22.36 0.80 0.19 18.27 0.68 0.14 19.54 0.64 0.31
DFM [41] 21.81 0.82 0.16 22.67 0.83 0.12 23.91 0.86 0.10 25.79 0.91 0.07
Ours 21.98 0.84 0.13 22.49 0.85 0.11 21.50 0.85 0.18 25.94 0.91 0.11

consistent results with backgrounds at a lower image reso- viewpoint directly with our model (Sec. 3.3). Being able
lution (128×128). Our method produces higher resolution to control the pose of the object is therefore an essential
images with similar reconstruction results and backgrounds. part of our contribution. The projection layers build up a
3D representation of the object that is explicitly rendered
4.3. Ablations into 3D-consistent features through volume rendering. This
The key ingredients of our method are the cross-frame- allows us to achieve viewpoint consistency, as also demon-
attention and projection layers that we add to the U-Net strated through single-image reconstruction (Tab. 3).
(Sec. 3.2). We highlight their importance in Tab. 3 and Fig. 5. How important are cross-frame-attention layers? They
How important are the projection layers? They are nec- are necessary to create images of the same object. Without
essary to allow precise control over the image viewpoints them, the teddybear in Fig. 5 (row 4) has the same general
(e.g., Fig. 5 row 3 does not follow the specified rotation). color scheme and follows the specified poses. However, dif-
Our goal is to generate a consistent set of images from any ferences in shape and texture lead to an inconsistent set of

5049
Input VD [38] DFM [41] Ours Real Image Table 3. Quantitative comparison of our method and ablations.
We report average PSNR, SSIM, LPIPS [54], FID [12], KID [2]
over Teddybear and Hydrant categories. We compare against
dropping the projection layer (“no proj”) and cross-frame-attention
(“no cfa”) from the U-Net (see Sec. 3.2). While still producing high-
quality images with similar FID/KID, this shows that our proposed
layers are necessary to obtain 3D-consistent images.

Method PSNR↑ SSIM↑ LPIPS↓ FID↓ KID↓


Ours (no proj) 16.55 0.71 0.29 47.95 0.034
Figure 6. Single-image reconstruction of our method and base- Ours (no cfa) 18.15 0.76 0.25 47.93 0.034
lines. Given one image/pose as input, our method produces plausi- Ours 22.24 0.84 0.11 47.92 0.034
ble novel views that are consistent with the real shape and texture.
We can also produce detailed backgrounds that match the input.
to retain the controllable generation through text descriptions
(Sec. 3.4). We show the diversity and controllability of our
sample 1

generations in Fig. 7 with hand-crafted text prompts. This


highlights that, after finetuning, our model is still faithful
to text input, and can combine attributes in a novel way, i.e.,
our model learns to extrapolate from the training set.

4.4. Limitations
sample 2

Our method generates 3D-consistent, high-quality images of


diverse objects according to text descriptions or input images.
Nevertheless, there are several limitations. First, our method
[C] wearing a green hat [C] sits on a green blanket sometimes produces images with slight inconsistency, as
shown in the supplement. Since the model is fine-tuned on a
sample 1

real-world dataset consisting of view-dependent effects (e.g.,


exposure changes), our framework learns to generate such
variations across different viewpoints. A potential solution is
to add lighting condition through a ControlNet [53]. Second,
our work focuses on objects, but similarly scene-scale
sample 2

generation on large datasets [6, 52] can be explored.

5. Conclusion
rusty [C] in the grass yellow and white [C] We presented ViewDiff, a method that, given text or image
Figure 7. Diversity of generated results. We condition our method
input, generates 3D-consistent images of real-world objects
on text input, which allows to create objects in a desired style. placed in authentic surroundings. Our method leverages the
We show samples for hand-crafted text descriptions that combine expressivity of large 2D text-to-image models and fine-tunes
attributes (e.g., color, shape, background) in a novel way. Each this 2D prior on real-world 3D datasets to produce diverse
row shows a different generation proposal from our method and we multi-view images in a joint denoising process. The core in-
denote the object category (Teddybear, Hydrant) as [C]. This sight of our work are two novel layers, namely cross-frame-
showcases the diversity of generated results, i.e., multiple different attention and the projection layer (Sec. 3.2). Our autoregres-
objects are generated for the same description. sive generation scheme (Sec. 3.3) allows to directly render
high-quality novel views of a generated 3D object.

images. We reason that the cross-frame-attention layers are 6. Acknowledgements


essential for defining a consistent object identity.
This work was done during Lukas’ internship at Meta Re-
Does the 2D prior help? We utilize a 2D prior in form ality Labs Zurich as well as at TU Munich, funded by
of the pretrained text-to-image model that we fine-tune in a a Meta sponsored research agreement. Matthias Nießner
3D-consistent fashion (Sec. 3.1). This enables our method to was also supported by the ERC Starting Grant Scan2CAD
produce sharp and detailed images of objects from different (804724). We also thank Angela Dai for the video voice-
viewpoints. Also, we train our method on captioned images over.

5050
References [17] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hos-
seini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand
[1] Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Hen- Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot
derson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Ren- learning with retrieval augmented language models. JMLR,
derdiffusion: Image diffusion for 3D reconstruction, inpaint- 24(251):1–43, 2023. 4
ing and generation. In CVPR, 2023. 2
[18] Animesh Karnewar, Niloy J. Mitra, Andrea Vedaldi, and
[2] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and David Novotny. HoloFusion: Towards photo-realistic 3D
Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018. generative modeling. In ICCV, 2023. 2, 5, 6, 7
5, 8
[19] Animesh Karnewar, Andrea Vedaldi, David Novotny, and
[3] Aljaž Božič, Pablo Palafox, Justus Thies, Angela Dai, and Niloy J. Mitra. HoloDiffusion: Training a 3D diffusion model
Matthias Nießner. TransformerFusion: Monocular RGB scene using 2D images. In CVPR, 2023. 2
reconstruction using transformers. In NeurIPS, 2021. 4
[20] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-
[4] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexan- 2: Bootstrapping language-image pre-training with frozen im-
der W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, age encoders and large language models. arXiv:2301.12597,
Shalini De Mello, Tero Karras, and Gordon Wetzstein. Gen- 2023. 5
erative novel view synthesis with 3D-aware diffusion models.
[21] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
arXiv:2304.02602, 2023. 2
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-
[5] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-
tasia3D: Disentangling geometry and appearance for high- 3D content creation. In CVPR, 2023. 2
quality text-to-3D content creation. In ICCV, 2023. 1, 2
[22] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexi-
[6] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, ang Xu, Hao Su, et al. One-2-3-45: Any single image
Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly- to 3D mesh in 45 seconds without per-shape optimization.
annotated 3D reconstructions of indoor scenes. In CVPR, arXiv:2306.16928, 2023. 2
2017. 8
[23] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov,
[7] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, one image to 3d object. In Proceedings of the IEEE/CVF
Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe International Conference on Computer Vision (ICCV), pages
of annotated 3D objects. In CVPR, 2023. 2 9298–9309, 2023. 2, 4
[8] Prafulla Dhariwal and Alexander Nichol. Diffusion models [24] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie
beat GANs on image synthesis. In NeurIPS, 2021. 2 Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ating multiview-consistent images from a single-view image.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, In The Twelfth International Conference on Learning Repre-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- sentations, 2024. 2, 3
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [25] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
worth 16×16 words: Transformers for image recognition at Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
scale. In ICLR, 2021. 4 Representing scenes as neural radiance fields for view syn-
[10] Ayaan Haque, Matthew Tancik, Alexei Efros, Aleksander thesis. Communications of the ACM, 65(1):99–106, 2021. 2,
Holynski, and Angjoo Kanazawa. Instruct-NeRF2NeRF: Edit- 4
ing 3D scenes with instructions. In ICCV, 2023. 2 [26] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Bulò, Peter Kontschieder, and Matthias Nießner. DiffRF:
Deep residual learning for image recognition. In CVPR, 2016. Rendering-guided 3D radiance field diffusion. In CVPR,
4 2023. 2
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- [27] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall.
hard Nessler, and Sepp Hochreiter. GANs trained by a two DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
time-scale update rule converge to a local Nash equilibrium. 1, 2
In NeurIPS, 2017. 5, 8 [28] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Ali-
[13] Jonathan Ho and Tim Salimans. Classifier-free diffusion aksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov,
guidance. In NeurIPS Workshops, 2021. 2, 5 Peter Wonka, Sergey Tulyakov, et al. Magic123: One image
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- to high-quality 3D object generation using both 2D and 3D
sion probabilistic models. In NeurIPS, 2020. 2, 3 diffusion priors. arXiv:2306.17843, 2023. 2
[15] Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, [29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Matthias Nießner. Text2Room: Extracting textured 3D and Mark Chen. Hierarchical text-conditional image gen-
meshes from 2D text-to-image models. In ICCV, 2023. 2 eration with CLIP latents. arXiv:2204.06125, 2022. 1, 2,
[16] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, 3
Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: [30] Christian Reiser, Rick Szeliski, Dor Verbin, Pratul Srinivasan,
Low-rank adaptation of large language models. In ICLR, Ben Mildenhall, Andreas Geiger, Jon Barron, and Peter Hed-
2022. 4 man. MERF: Memory-efficient radiance fields for real-time

5051
view synthesis in unbounded scenes. ACM Transactions on [45] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Graphics (TOG), 42(4):1–12, 2023. 4 Komura, and Wenping Wang. NeuS: Learning neural implicit
[31] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, surfaces by volume rendering for multi-view reconstruction.
Luca Sbordone, Patrick Labatut, and David Novotny. Com- In NeurIPS, 2021. 2
mon objects in 3D: Large-scale learning and evaluation of [46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
real-life 3D category reconstruction. In ICCV, 2021. 2, 3, 5 vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
[32] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet:
Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Learning multi-view image-based rendering. In CVPR, 2021.
tuning text-to-image diffusion models for subject-driven gen- 4
eration. In CVPR, 2023. 1, 5 [47] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
[33] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Li, Hang Su, and Jun Zhu. ProlificDreamer: High-fidelity and
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, diverse text-to-3D generation with variational score distilla-
Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, tion. arXiv:2305.16213, 2023. 1, 2
Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad [48] Daniel Watson, William Chan, Ricardo Martin Brualla,
Norouzi. Photorealistic text-to-image diffusion models with Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi.
deep language understanding. In NeurIPS, 2022. 1, 2, 3, 5 Novel view synthesis with diffusion models. In ICLR, 2023.
[34] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon 2
Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung [49] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei,
Lee, and Seungryong Kim. Let 2D diffusion model Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and
know 3D-consistency for robust text-to-3D generation. Mike Zheng Shou. Tune-a-video: One-shot tuning of im-
arXiv:2303.07937, 2023. 2 age diffusion models for text-to-video generation. In ICCV,
[35] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, 2023. 4
and Xiao Yang. MVDream: Multi-view diffusion for 3D [50] Jianfeng Xiang, Jiaolong Yang, Binbin Huang, and Xin Tong.
generation. arXiv:2308.16512, 2023. 2 3D-aware image generation using 2D diffusion models. In
[36] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, ICCV, 2023. 2
and Surya Ganguli. Deep unsupervised learning using [51] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy.
nonequilibrium thermodynamics. In ICML, 2015. 3 Rerender a video: Zero-shot text-guided video-to-video trans-
[37] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and lation. In SIGGRAPH Asia, 2023. 4
Hujun Bao. NeuralRecon: Real-time coherent 3D reconstruc-
[52] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and
tion from monocular video. In CVPR, 2021. 4
Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor
[38] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea scenes. In ICCV, 2023. 8
Vedaldi. ViewSet diffusion: (0-)image-conditioned 3D gen-
[53] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
erative models from 2D data. In ICCV, 2023. 2, 5, 6, 7,
conditional control to text-to-image diffusion models. In
8
ICCV, 2023. 1, 2, 8
[39] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi,
[54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Lizhuang Ma, and Dong Chen. Make-it-3D: High-fidelity 3D
and Oliver Wang. The unreasonable effectiveness of deep
creation from a single image with diffusion prior. In ICCV,
features as a perceptual metric. In CVPR, 2018. 5, 7, 8
2023. 2
[40] Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and [55] Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and
Yasutaka Furukawa. MVDiffusion: Enabling holistic multi- Jiwen Lu. UniPC: A unified predictor-corrector framework
view image generation with correspondence-aware diffusion. for fast sampling of diffusion models. In NeurIPS, 2023. 5
In NeurIPS, 2023. 2 [56] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
[41] Ayush Tewari, Tianwei Yin, George Cazenavette, Semon and Noah Snavely. Stereo magnification: Learning view syn-
Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. thesis using multiplane images. ACM Transactions on Graph-
Freeman, and Vincent Sitzmann. Diffusion with forward ics, 37(4):65:1–12, 2018. 2
models: Solving stochastic inverse problems without direct [57] Zhizhuo Zhou and Shubham Tulsiani. SparseFusion: Dis-
supervision. In NeurIPS, 2023. 2, 5, 6, 7, 8 tilling view-conditioned diffusion for 3D reconstruction. In
[42] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, CVPR, 2023. 2
Michael Niemeyer, and Federico Tombari. TextMesh: Gener- [58] Joseph Zhu and Peiye Zhuang. HiFA: High-fidelity text-to-3D
ation of realistic 3D meshes from text prompts. In 3DV, 2024. with advanced diffusion guidance. arXiv:2305.18766, 2023.
2 2
[43] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-
Bin Huang, and Johannes Kopf. Consistent view synthesis
with pose-guided diffusion models. In CVPR, 2023. 2
[44] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
and Greg Shakhnarovich. Score Jacobian chaining: Lifting
pretrained 2D diffusion models for 3D generation. In CVPR,
2023. 2

5052

You might also like