0% found this document useful (0 votes)
75 views19 pages

Rodin: A Generative Model For Sculpting 3D Digital Avatars Using Diffusion

Uploaded by

Thanh Hảo Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views19 pages

Rodin: A Generative Model For Sculpting 3D Digital Avatars Using Diffusion

Uploaded by

Thanh Hảo Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

Tengfei Wang1† * Bo Zhang2 * Ting Zhang2 Shuyang Gu2 Jianmin Bao2


Tadas Baltrusaitis2 Jingjing Shen2 Dong Chen2 Fang Wen2 Qifeng Chen1 Baining Guo2
1
HKUST 2 Microsoft Research
arXiv:2212.06135v1 [cs.CV] 12 Dec 2022

Input portrait Generated 3D avatar “In white suit” “Smile with glasses” “With red hair” “bearded man with curly hair in black leather jacket”
Figure 1. Our diffusion model, Rodin, can produce high-fidelity 3D avatars, as shown in the first row. Our model also supports 3D avatar
generation from a single portrait or text prompt, while permitting text-based semantic manipulation (second row). See the webpage for
video demos.

Abstract avatars with realistic hairstyles and facial hair like beards.
This paper presents a 3D generative model that uses dif- We also demonstrate 3D avatar generation from image or
fusion models to automatically generate 3D digital avatars text as well as text-guided editability.
represented as neural radiance fields. A significant chal-
lenge in generating such avatars is that the memory and 1. Introduction
processing costs in 3D are prohibitive for producing the
Generative models [2, 34] are one of the most promising
rich details required for high-quality avatars. To tackle this
ways to analyze and synthesize visual data including 2D
problem we propose the roll-out diffusion network (Rodin),
images and 3D models. At the forefront of generative mod-
which represents a neural radiance field as multiple 2D fea-
eling is the diffusion model [14, 24, 61], which has shown
ture maps and rolls out these maps into a single 2D feature
phenomenal generative power for images [19,49,52,54] and
plane within which we perform 3D-aware diffusion. The
videos [23, 59]. Indeed, we are witnessing a 2D content-
Rodin model brings the much-needed computational effi-
creation revolution driven by the rapid advances of diffusion
ciency while preserving the integrity of diffusion in 3D by
and generative modeling.
using 3D-aware convolution that attends to projected fea-
In this paper, we aim to expand the applicability of dif-
tures in the 2D feature plane according to their original re-
fusion such that it can serve as a generative model for 3D
lationship in 3D. We also use latent conditioning to orches-
digital avatars. We use “digital avatars” to refer to the tra-
trate the feature generation for global coherence, leading
ditional avatars manually created by 3D artists, as opposed
to high-fidelity avatars and enabling their semantic editing
to the recently emerging photorealistic avatars [8, 44]. The
based on text prompts. Finally, we use hierarchical synthe-
reason for focusing on digital avatars is twofold. On the
sis to further enhance details. The 3D avatars generated by
one hand, digital avatars are widely used in movies, games,
our model compare favorably with those produced by exist-
the metaverse, and the 3D industry in general. On the other
ing generative techniques. We can generate highly detailed
hand, the available digital avatar data is very scarce as each
* Equal contribution. † Intern at Microsoft Research. avatar has to be painstakingly created by a specialized 3D
artist using a sophisticated creation pipeline [20, 35], espe- identities, hairstyles, and clothing created by 3D artists [69].
cially for modeling hair and facial hair. All this leads to a Several application scenarios are thus supported. We can
compelling scenario for generative modeling. use the model to generate an unlimited number of avatars
We present a diffusion model for automatically produc- from scratch, each avatar being different from others as well
ing digital avatars represented as neural radiance fields [39], as the ones in the training data. As shown in Figure 1, we
with each point describing the color radiance and density of can generate highly-detailed avatars with realistic hairstyles
the 3D volume. The core challenge in generating neural vol- and facial hairs styled as beards, mustaches, goatees, and
ume avatars is the prohibitive memory and computational sideburns. Hairstyle and facial hair are essential for rep-
cost for the rich details required by high-quality avatars. resenting people’s unique personal identities. Yet, these
Without rich details, our results will always be somewhat styles have been notoriously difficult to generate well with
“toy-like”. To tackle this challenge, we develop Rodin, the existing approaches. The Rodin model also allows avatar
roll-out diffusion network. We take a neural volume repre- customization with the resulting avatar capturing the visual
sented as multiple 2D feature maps and roll out these maps characteristics of the person portrayed in the image or the
into a single 2D feature plane and perform 3D-aware dif- textual description. Finally, our framework supports text-
fusion within this plane. Specifically, we use the tri-plane guided semantic editing. The strong generative power of
representation [9], which represents a volume by three axis- diffusion shows great promise in 3D modeling.
aligned orthogonal feature planes. By simply rolling out
feature maps, the Rodin model can perform 3D-aware dif- 2. Related Work
fusion using an efficient 2D architecture and drawing power
The state of generative modeling [5,14,15,28,50,65,75]
from the model’s three key ingredients below.
has seen rapid progress in past years. Diffusion models [14,
The first is the 3D-aware convolution. The 2D CNN 24, 61, 73] have recently shown unprecedented generative
processing used in conventional 2D diffusion cannot well ability and compositional power. The most remarkable suc-
handle the feature maps originating from orthogonal planes. cess happens in text-to-image synthesis [19, 40, 49, 52, 54],
Rather than treating the features as plain 2D input, the 3D- which serves as a foundation model and enables various
aware convolution explicitly accounts for the fact that a 2D appealing applications [21, 53, 66] previously unattainable.
feature in one plane (of the tri-plane) is a projection from a While diffusion models have been successfully applied to
piece of 3D data and is hence intrinsically associated with different modalities [11, 23, 26, 32], its generative capabil-
the same data’s projected features in the other two planes. ity is much less explored in 3D generation, with only a few
To encourage cross-plane communication we involve all attempts on modeling 3D primitives [37, 74, 76].
these associated features in the convolution and thus bridge Early 3D generation works [58] rely on either GAN [17]
the associated features together and synchronize their detail or VAE [29] to model the distribution of 3D shape repre-
synthesis according to their 3D relationship. sentation like voxel grids [6, 70], point clouds [1, 7, 31, 72],
The second key ingredient is latent conditioning. We use mesh [33, 63] and implicit neural representation [43, 60].
a latent vector to orchestrate the feature generation so that it However, existing works have not demonstrated the abil-
is globally coherent across the 3D volume, leading to better- ity to produce complex 3D assets yet. Concurrent to this
quality avatars and enabling their semantic editing. We do work, Bautista et al. [4] train a diffusion model to generate
this by using the avatars in the training dataset to train an ad- the latent vector that encodes the radiance field [39] of syn-
ditional image encoder which extracts a semantic latent vec- thetic scenes, yet this work only produces coarse 3D geom-
tor serving as the conditional input to the diffusion model. etry. In comparison, we propose a hierarchical 3D genera-
This latent conditioning essentially acts as an autoencoder tion framework with effective 3D-aware operators, offering
in orchestrating the feature generation. For semantic ed- unprecedented 3D detail synthesis.
itability, we adopt a frozen CLIP image encoder [9] that Another line of work learns 3D-aware generation by uti-
shares the latent space with text prompts. lizing richly available 2D data. 3D-aware GANs [9, 10,
The final key ingredient is hierarchical synthesis. We 12, 16, 18, 42, 56, 57, 62, 71, 77] recently attract significant
start by generating a low-resolution tri-plane (64 × 64), fol- research interest, which are trained to produce radiance
lowed by a diffusion-based upsampling that yields a higher fields with image level distribution matching. However,
resolution (256 × 256). When training the diffusion upsam- these methods suffer from instabilities and mode collapse
pler, it is instrumental in penalizing the image-level loss that of GAN training, and it is still challenging to attain authen-
we compute in a patch-wise manner. tic avatars that can be viewed from large angles. Concur-
Taken together, the above ingredients work in concert to rently, there are a few attempts to use diffusion models for
enable the Rodin model to coherently perform diffusion in the problem. Daniel et al. [68] proposes to synthesize novel
3D with an efficient 2D architecture. The Rodin is trained views with a pose-conditioned 2D diffusion model, yet the
with a multi-view image dataset of 100K avatars of diverse results are not intrinsically 3D. Ben et al. [46] optimizes a
Image
encoder

z Base Diffusion 256x256 Conv


Latent diffusion upsampler tri-plane
RGB ∫ refinement
z0 ∼ N (0, I) diffusion
64x64 Volumetric
tri-plane MLP rendering
Text-cond. decoder
“An old man with
glasses” diffusion

Figure 2. An overview of our Rodin model. We derive the latent z via the mapping from image, text, or random noise, which is used
to control the base diffusion model to generate 64 × 64 tri-planes. We train another diffusion model to upsample this coarse result to
256 × 256 tri-planes that are used to render final multi-view images with volumetric rendering and convolutional refinement. The operators
used in diffusion models are designed to be 3D-aware.

radiance field using the supervision from a pretrained text- field of 3D avatars. Specifically, the 3D volume is factor-
to-image diffusion model and produces impressive 3D ob- ized into three axis-aligned orthogonal feature planes, de-
jects of diverse genres. Nonetheless, pretrained 2D gen- noted by yuv , ywu , yvw ∈ RH×W ×C , each of which has
erative networks only offer limited 3D knowledge and in- spatial resolution of H × W and number of channel as C.
evitably lead to blurry 3D results. A high-quality generation Compared to voxel grids, the tri-plane representation offers
framework in 3D space is still highly desired. a considerably smaller memory footprint without sacrific-
ing the expressivity. Hence, rich 3D information is ex-
3. Approach plicitly memorized in the tri-plane features, and one can
query the feature of the 3D point p ∈ R3 by projecting
Unlike prior methods that learn 3D-aware generation
it onto each plane and aggregating the retrieved features,
from a 2D image collection, we aim to learn the 3D
i.e., yp = yuv (puv ) + ywu (pwu ) + yvw (pvw ). With such
avatar generation using the multi-view renderings from the
positional feature, one can derive the density σ ∈ R+ and
Blender synthetic pipeline [69]. Rather than treating the
view-dependent color c ∈ R3 of each 3D location given the
multi-view images of the same subject as individual training
viewing direction d ∈ S2 with a lightweight MLP decoder
samples, we fit the volumetric neural representation for each
GθMLP , which can be formulated as
avatar, which is used to explain all the observations from
different viewpoints. Thereafter we use diffusion models c(p, d), σ(p) = GθMLP yp , ξ(yp ), d .

(1)
to characterize the distribution of these 3D instances. Our
diffusion-based 3D generation is a hierarchical process — Here, we apply the Fourier embedding operator ξ(·) [64]
we first utilize a diffusion model to generate the coarse ge- on the queried feature rather than the spatial coordinate.
ometry, followed by a diffusion upsampler for detail syn- The tri-plane features and the MLP decoder are optimized
thesis. As illustrated in Figure 2, the whole 3D portrait gen- such that the rendering of the neural radiance field matches
eration comprises multiple training stages, which we detail the multi-view images {x}Nv for the given subject, where
in the following subsections. x ∈ RH0 ×W0 ×3 . We enforce the rendered image given by
volumetric rendering [38], i.e., x̂ = R(c, σ), to match the
3.1. Robust 3D Representation Fitting
corresponding ground truth with mean squared error loss.
To train a generative network with explicit 3D supervi- Besides, we introduce sparse, smooth, and compact regular-
sion, we need an expressive 3D representation that accounts izers to reduce the “floating” artifacts [3] in free space. For
for multi-view images, which should meet the following re- more tri-plane fitting details, please refer to the Appendix.
quirements. First, we need an explicit representation that While prior per-scene reconstruction mainly concerns
is amenable to generative network processing. Second, we the fitting quality, our 3D fitting procedure should also con-
require a compact representation that is memory efficient; sider several key aspects for generation purposes. First, the
otherwise, it would be too costly to store a myriad of such tri-plane features of different subjects should rigorously re-
3D instances for training. Furthermore, we expect fast rep- side in the same domain. To achieve this, we adopt a shared
resentation fitting since hours of optimization as vanilla MLP decoder when fitting distinct portraits, thus implicitly
NeRF [39] would make it unaffordable to generate abun- pushing the tri-plane features to the shared latent space rec-
dant 3D training data as required for generative modeling. ognizable by the decoder. Second, the MLP decoder has
Taking these into consideration, we adopt tri-plane rep- to possess some level of robustness. That is, the decoder
resentation proposed by [9] to model the neural radiance should be tolerant to slight perturbation of tri-plane fea-
w latent z

v w v
u u w
v

u
(a) (b)
(a) 256 × 256 tri-plane (b) 64 × 64 tri-plane (c) 64 × 64 tri-plane
without random scaling with random scaling Figure 4. We propose two mechanisms to ensure coherent tri-plane
Figure 3. While 256 × 256 tri-planes give good renderings (a), the generation. Our 3D-aware convolution considers the 3D relation-
64 × 64 variant gives much worse result (b). Hence, we introduce ship in (a) and correlates the associated elements from separate
random scaling during fitting so as to obtain a robust representa- feature planes as shown in (b). In (b), we also visualize the usage
tion that can be effectively rendered in continuous scales (c). of a shared latent code to orchestrate the feature generation.

tures, and thus one can still obtain plausible results even if from the Gaussian noise yT ∼ N (0, I) and sequentially
the tri-plane features are imperfectly generated. More im- produces less noisy samples {yT , yT −1 , . . .} until reach-
portantly, the decoder should be robust to varied tri-plane ing y0 .
sizes because hierarchical 3D generation is trained on multi- We first train a base diffusion model to generate the
resolution tri-plane features. As shown in Figure 3, when coarse-level tri-planes, e.g., at 64 × 64 resolution. A
solely fitting 256 × 256 tri-planes, its 64 × 64 resolution straightforward approach is to adopt the 2D network struc-
variant cannot be effectively rendered. To address this, we ture used in the state-of-the-art image-based diffusion mod-
randomly scale the tri-plane during fitting, which is instru- els for our tri-plane generation. Specifically, we can con-
mental in deriving multi-resolution tri-plane features simul- catenate the tri-plane features in the channel dimension as
taneously with a shared decoder. in [9], which forms y = (yuv ⊕ ywu ⊕ yvw ) ∈ RH×W ×3C ,
and employ a well-designed 2D U-Net to model the data
3.2. Latent Conditioned 3D Diffusion Model distribution through the denoising diffusion process. How-
Now the 3D avatar generation is reduced to learning ever, such a baseline model produces 3D avatars with severe
the distribution of tri-plane features, i.e., p(y), where y = artifacts. We conjecture the generation artifact comes from
(yuv , ywu , yvw ). Such generative modeling is non-trivial the incompatibility between the tri-plane representation and
since y is highly dimensional. We leverage diffusion mod- the 2D U-Net. As shown in Figure 4(a), intuitively, one
els for the task, which have shown compelling quality in can regard the tri-plane features as the projection of neural
complex image modeling. volume towards the frontal, bottom, and side views, respec-
On a high level, the diffusion model generates y by grad- tively. Hence, the channel-wise concatenation of these or-
ually reversing a Markov forward process. Starting from thogonal planes for CNN processing is problematic because
y0 ∼ p(y), the forward process q yields a sequence of in- these planes are not spatially aligned. To better handle the
creasing noisy latent codes {yt | t ∈ [0, T ]} according to tri-plane representation, we make the following efforts.
yt := αt y0 +σt , where  ∈ N (0, I) is the added Gaussian 3D-aware convolution. Using CNN to process channel-
noise; αt and σt define a noise schedule whose log signal- wise concatenated tri-planes will cause the mixing of the-
to-noise ratio λt = log[αt2 /σt2 ] linearly decreases with the oretically uncorrected features in terms of 3D. One sim-
timestep t. With sufficient noising steps, we reach a pure ple yet effective way to address this is to spatially roll out
Gaussian noise, i.e., yT ∼ N (0, I). The generative process the tri-plane features. As shown in Figure 4(b), we con-
corresponds to reversing the above noising process, where catenate the tri-plane features horizontally, yielding ỹ =
the diffusion model is trained to denoise yt into y0 for all hstack(yuv , ywu , yvw ) ∈ RH×3W ×C . Such feature roll-
t using a mean squared error loss. Following [24], better out allows independent processing of feature planes. For
generation quality can be achieved by parameterizing the simplicity, we subsequently use y to denote such input form
diffusion model ˆθ to predict the added noise: by default. However, the tri-plane roll-out hampers cross-
h i plane communication, while the 3D generation requires the
2
Lsimple = Et,x0 , ˆθ (αt y0 + σt , t) −  2 . (2) synergy of the tri-plane generation.
To better process the tri-plane features, we need an effi-
In practice, our diffusion model training also jointly opti- cient 3D operator that performs on the tri-plane rather than
mizes the variational lower bound loss LVLB as suggested treating it as a plain 2D input. To achieve this, we pro-
in [41], which allows high-quality generation with fewer pose 3D-aware convolution to effectively process the tri-
timesteps. During inference, stochastic ancestral sam- plane features while respecting their 3D relationship. A
pler [24] is used to generate the final samples, which starts point on a certain feature plane actually corresponds to an
axis-aligned 3D line in the volume, which also has two cor- encoder [48] that has shared latent space with text prompts.
responding line projections in other planes, as shown in We will show how the learned model produces controllable
Figure 4(a). The features of these corresponding locations text-guided generation results.
essentially describe the same 3D primitive and should be Another notable benefit of latent conditioning is that it
learned synchronously. However, such a 3D relationship allows classifier-free guidance [25], a technique typically
is neglected when employing plain 2D convolution to tri- used to boost the sampling quality in the conditional gener-
plane processing. As such, our 3D-aware convolution ex- ation. When training the diffusion model, we randomly zero
plicitly introduces such 3D inductive bias by attending the the latent embedding with 20% probability, thus adapting
features of each plane to the corresponding row/column of the diffusion decoder to unconditional generation. During
the rest planes. In this way, we enable 3D processing capa- inference, we can steer the model toward better generation
bility with 2D CNNs. This 3D-aware convolution applied sampling according to
on the tri-plane representation, in fact, is a generic way to
simplify 3D convolutions previously too costly to compute ˆθ (y, z) = λθ (y, z) + (1 − λ)θ (y), (3)
when modeling high-resolution 3D volumes.
where θ (y, z) and θ (y) are the conditional and uncondi-
The 3D-aware convolution is depicted in Figure 4(b).
tional -predictions respectively, and λ > 0 specifies the
Ideally, the compute for yuv would attend to full elements
guidance strength.
from the corresponding row/column, i.e., ywu and yvw ,
Our latent conditioned base model thus supports both un-
from other planes. For parallel computing, we simplify
conditional generation as well as the conditional generation
this and aggregate the row/column elements. Specifically,
that is used for portrait inversion. To account for full diver-
we apply the axis-wise pooling for ywu and yvw , yield-
sity during unconditional sampling, we additionally train a
ing a row vector ywu→u ∈ R1×W ×C and a column vec-
diffusion model to model the distribution of the latent z,
tor yvw→v ∈ RH×1×C respectively. For each point of
whereas the latent yT describes the residual variation. We
yuv , we can easily access the corresponding element in the
include this latent diffusion model in Figure 2.
aggregated vectors. We expand the aggregated vectors to
the original 2D dimension (i.e., replicating the column vec- 3.3. Diffusion Tri-plane Upsampler
tors along row dimension, and vice versa) and thus derive
y(·)u , yv(·) ∈ RH×W ×C . By far, we can perform 2D con- To generate high-fidelity 3D structures, we further train
volution on the channel-wise concatenation of the feature a diffusion super-resolution (SR) model to increase the tri-
maps, i.e., Conv2D(yuv ⊕y(·)u ⊕yv(·) ). because yuv is now plane resolution from 64×64 to 256×256. At this stage, the
spatially aligned with the aggregation of the corresponding diffusion upsampler is conditioned on the low-resolution
elements from other planes. The compute for yvw and ywu (LR) tri-plane y LR . Different from the base model training
is conducted likewise. The 3D-aware convolution greatly we parameterize the diffusion upsampler yθHR (ytHR , y LR , t)
enhances the cross-plane communication, and we empiri- to predict the high-resolution (HR) ground truth y0HR instead
cally observe reduced artifacts and improved generation of of the added noise . The 3D-aware convolution is utilized
thin structures like hair strands. in each residual block to enhance detail synthesis.
Following prior cascaded image generation works, we
Latent conditioning. We further propose to learn a latent
apply condition augmentation to reduce the domain gap
vector to orchestrate the tri-plane generation. As shown in
between the output from the base model and the LR con-
Figure 2, we additionally train an image encoder E to ex-
ditional input for SR training. We conduct careful tun-
tract a semantic latent vector serving as the conditional in-
ing for the tri-plane augmentation with a combination of
put of the base diffusion model, so essentially the whole
random downsampling, Gaussian blurring and Gaussian
framework is an autoencoder. To be specific, we extract the
noises, making the rendered augmented LR tri-plane resem-
latent vector from the frontal view of each training subject,
ble the base rendering output as much as possible.
i.e., z = Eθ (xfront ) ∈ R512 , and the diffusion model condi-
Nonetheless, we find that a tri-plane restoration with a
tioned on z is trained to reconstruct the tri-plane of the same
lower `2 distance to the ground truth may not necessarily
subject. We use adaptive group normalization (AdaGN) to
correspond to a satisfactory image rendering. Hence, we
modulate the activations of the diffusion model, where z is
need to directly constrain the rendered image. Specifically,
injected into every residual block, and in this way, the fea-
we obtain the rendered image xˆHR ∈ R256×256×3 from the
tures of the orthogonal planes are synchronously generated
predicted tri-plane ŷ0HR with x̂ = R(GθMLP (ŷ0HR )), and we
according to a shared latent.
further penalize the perceptual loss [27] between this ren-
The latent conditioning not only leads to higher gener- dered result and the ground truth, according to
ation quality but also permits a disentangled latent space,
thus allowing semantic editing of generated results. To
X
Lperc = Et,x̂ kΨl (x̂) − Ψl (x)k22 , (4)
achieve better editability, we adopt a frozen CLIP image l
Figure 5. Unconditional generation samples by our Rodin model. We visualize the mesh extracted from the generated density field.

Pi-GAN GIRAFFE EG3D Autoencoder Ours


FID ↓ 78.3 64.6 40.5 67.4 26.1
Table 1. Quantitative comparison with baseline methods.

Model configuration FID↓


A. Baseline 39.2
B. + Latent conditioning 37.4
C. + Tri-plane roll-out 28.4
Figure 6. Latent interpolation results for generated avatars. D. + 3D-aware conv 26.1
Table 2. Ablation study of the proposed components.

where Ψl denotes the multi-level feature extraction using ance field reconstruction. The tri-planes for our generation
a pretrained VGG. Usually, the volume rendering requires have the dimension of 256 × 256 × 32 in each feature plane.
stratified sampling along each ray, which is computationally We optimize a shared MLP decoder when fitting the first
prohibitive for high-resolution rendering. Hence, we com- 1,000 subjects. This decoder consists of 4 fully connected
pute Lperc on random 112 × 112 image patches with high layers and is fixed when fitting the following subjects. Thus
sampling importance on face region. Compared with prior different subjects are fitted separately in distributed servers.
3D-aware GANs that require rendering full images, our 3D- Both the base and upsampling diffusion networks adopt
aware SR can be easily scalable to high resolutions due to U-Net architecture to process the roll-out tri-plane features.
the permit of patchwise training with direct supervision. We apply full-attention for 82 , 162 and 322 scales within the
Modeling high-frequency detail and thin structures are network and adopt 3D-aware convolution at higher scales to
particularly challenging in volumetric rendering. Thus, at enhance the details. While we generate 2562 tri-planes with
this stage, we jointly train a convolution refiner [67] on our the diffusion upsampler, we also render image and compute
data which complements the missing details of the NeRF image loss at 5122 resolution, with a convolutional refine-
rendering, ultimately producing compelling 1024 × 1024 ment further enhancing the details to 10242 . For more de-
image outputs. tails about the network architecture and training strategies,
please refer to our Appendix.
4. Experiments
4.2. Unconditional Generation Results
4.1. Implementation Details
Figure 5 shows several samples generated by the Rodin
To train our 3D diffusion, we obtain 100K 3D avatars model, showing the capability to synthesize high-quality
with a random combination of identities, expressions, 3D renderings with impressive details, e.g., glasses and
hairstyles, and accessories using synthetic engine [69]. For hairstyle. To reflect the geometry, we extract the mesh from
each avatar, we render 300 multi-view images with known the generated density field using marching cubes, which
camera pose, which are sufficient for a high-quality radi- demonstrates high-fidelity geometry. More uncurated sam-
Pi-GAN [10] GIRAFFE [42] EG3D [9] Our Rodin model
Figure 7. Qualitative comparison with state-of-the-art approaches.

ples are shown in the Appendix. We also explore the in-


terpolation of the latent condition z between two generated
avatars, as shown in Figure 6, where we observe consistent
high-quality interpolation results with smooth appearance
transition.
Base diffusion Diffusion upsampler Convolutional refiner
4.3. Comparison
Figure 8. Hierarchical generation progressively improves results.
We compare our method with state-of-the-art 3D-
4.4. Analysis of the Rodin model
aware GANs, e.g., Pi-GAN [10] and GIRAFFE [42] and
EG3D [9], which learn to produce neural radiance field Both 3D-aware convolution and latent conditioning are
from 2D image supervision. Moreover, we implement an crucial for 3D synthesis. To prove this, we conduct the
auto-encoder baseline, which leverages the multi-view su- ablation study as shown in Table 2. We start from a base-
pervision and reconstructs the radiance field from the la- line that uses a plain 2D CNN to process channel-wise con-
tent. We differ in this baseline by using the power diffusion- catenated tri-plane features following [9]. With latent con-
based decoder with 3D-aware designs. We adapt the offi- ditioning, we achieve a lower FID. Feeding the network
cial implementation of prior works to 360-degree genera- with roll-out tri-plane features significantly reduces the FID
tion and retrain them using the same dataset. score because tri-planes are no longer improperly mingled.
We use FID score [22] to measure the quality of im- The proposed 3D-aware convolution further improves the
age renderings. As per [30], we use the features extracted synthesis quality, especially for thin structures like hair and
from the CLIP model to compute FID, which we find better cloth texture. More visual results regarding these ablations
correlates the perceptual quality. Specifically, we compute can be found in the Appendix.
the FID score using 5K generated samples. The quantita- Hierarchical generation is critical for high-fidelity re-
tive comparison is shown in Table 1, where we see that the sults. One significant benefit of this approach is that we
Rodin model induces significantly lower FID than others. can train different diffusion models dedicated to different
The visual comparison in Figure 7 shows a clear qual- scales in a supervised manner, as opposed to end-to-end
ity superiority of our Rodin model over prior arts. Our synthesis with image loss. This also enables patch-wise
method gives visually pleasing multi-view renderings with training without the need to render full images. Thus hi-
high-quality geometry, e.g., for glasses and hair, whereas erarchical training allows high-resolution avatar generation
3D-aware GANs produce more artifacts due to the geometry without suffering the prohibitive memory issue. Figure 8
ambiguity caused by the simple use of image supervision. shows the progressive quality improvement after the base
“Smiling young woman in glasses”

“A woman with afro hair in red wearing”


(a) (b)
Figure 9. Results of (a) portrait inversion and (b) text-to-avatar generation.

Tri-plane level loss Image-level loss Cond. augment FID ↓


X 48.5
X X 38.6
X X X 26.1
Table 3. Ablation study of the tri-plane upsampling strategy. Input Reconstructed avatar with visualized mesh

diffusion, diffusion upsampler, and convolution refinement,


respectively. It can be seen that the diffusion upsampler is
critical, largely enhancing the synthesis quality, while con-
volution refinement further adds delicate details. “Blond hair” “Smiling” “With sunglasses” “Short hair”
Diffusion upsampling training strategies. When training Figure 10. Results of text-guided avatar manipulation.
the tri-plane upsampler, we parameterize the model to pre-
dict the clean tri-plane ground truth at each diffusion re- As shown in Figure 9(b), one can finely customize the
version step. Meanwhile, conditioning augmentation is of avatars using detailed text descriptions.
great significance to let the model generalize to the coarse- Text-based avatar customization. We can also semanti-
level tri-plane generated from the base model. Besides, we cally edit generated avatars using text prompts. For a gen-
observe enforcing image-level loss is beneficial to final per- erated avatar with the CLIP image embedding zi , we can
ceptual quality. The effectiveness of these strategies are obtain a direction δ in the CLIP’s text embedding based on
quantitatively justified in Table 3. prompt engineering [45]. We assume colinearity between
the CLIP’s image and text embedding, thus we obtain the
4.5. Applications manipulated embedding as zi + δ, which is used to con-
dition the generative process. As shown in Figure 10, we
3D portrait from a single image. We can hallucinate a 3D
can achieve a wide variety of disentangled and meaningful
avatar from a single portrait by conditioning the base gener-
control faithful to the text prompt.
ator with the CLIP image embedding for that input image.
Note that our goal is different from face/head reconstruc-
5. Conclusion
tion [13, 51], but to conveniently produce a personalized
digital avatar for users. As shown in Figure 9(a), the gen- From experiments, we observe that the Rodin model is a
erated avatars keep the main characteristics of the portrait, powerful generative model for 3D avatars. This model also
e.g., expression, hairstyle, glass wearing, etc., while being allows users to customize avatars from a portrait or text,
360-degree renderable. thus significantly lowering the barrier of personalized avatar
Text-to-avatar generation. Another natural way to cus- creation. While this paper only focuses on avatars, the main
tomize avatars is to use language guidance. To do this, ideas behind the Rodin model are applicable to the diffu-
we train a text-conditioned diffusion model to generate the sion model for general 3D scenes. Indeed, the prohibitive
CLIP image embedding used to semantically control the computational cost has been a challenge for 3D content cre-
avatar generation. We use a subset of the LAION-400M ation. An efficient 2D architecture for performing coherent
dataset [55] containing portrait-text pairs to train this model. and 3D-aware diffusion in 3D is an important step toward
tackling the challenge. For future work, it would be fruitful [11] Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog
to improve the sampling speed of the 3D diffusion model bits: Generating discrete data using diffusion models with
and study jointly leveraging the ample 2D data to mitigate self-conditioning. arXiv preprint arXiv:2208.04202, 2022. 2
the 3D data bottleneck. [12] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong.
Gram: Generative radiance manifolds for 3d-aware image
generation. In Proceedings of the IEEE/CVF Conference
References on Computer Vision and Pattern Recognition, pages 10673–
10683, 2022. 2
[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and
Leonidas Guibas. Learning representations and generative [13] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde
models for 3d point clouds. In International conference on Jia, and Xin Tong. Accurate 3d face reconstruction with
machine learning, pages 40–49. PMLR, 2018. 2 weakly-supervised learning: From single image to image set.
In Proceedings of the IEEE/CVF Conference on Computer
[2] Alankrita Aggarwal, Mamta Mittal, and Gopi Battineni.
Vision and Pattern Recognition Workshops, pages 0–0, 2019.
Generative adversarial network: An overview of theory and
8
applications. International Journal of Information Manage-
ment Data Insights, 1(1):100004, 2021. 1 [14] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. Advances in Neural Informa-
[3] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P tion Processing Systems, 34:8780–8794, 2021. 1, 2, 12
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
[15] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
anti-aliased neural radiance fields. In Proceedings of the
transformers for high-resolution image synthesis. In Pro-
IEEE/CVF Conference on Computer Vision and Pattern
ceedings of the IEEE/CVF conference on computer vision
Recognition, pages 5470–5479, 2022. 3, 13
and pattern recognition, pages 12873–12883, 2021. 2
[4] Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Wal-
[16] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,
ter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent
Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and
Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al.
Sanja Fidler. Get3d: A generative model of high quality
Gaudi: A neural architect for immersive 3d scene genera-
3d textured shapes learned from images. arXiv preprint
tion. arXiv preprint arXiv:2207.13751, 2022. 2
arXiv:2209.11163, 2022. 2
[5] Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Willcocks. Deep generative modelling: A comparative re- Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
view of vaes, gans, normalizing flows, energy-based and Yoshua Bengio. Generative adversarial networks. Commu-
autoregressive models. arXiv preprint arXiv:2103.04922, nications of the ACM, 63(11):139–144, 2020. 2
2021. 2
[18] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian
[6] Andrew Brock, Theodore Lim, James M Ritchie, and Theobalt. Stylenerf: A style-based 3d-aware genera-
Nick Weston. Generative and discriminative voxel mod- tor for high-resolution image synthesis. arXiv preprint
eling with convolutional neural networks. arXiv preprint arXiv:2110.08985, 2021. 2
arXiv:1608.04236, 2016. 2 [19] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
[7] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. tor quantized diffusion model for text-to-image synthesis. In
Learning gradient fields for shape generation. In European Proceedings of the IEEE/CVF Conference on Computer Vi-
Conference on Computer Vision, pages 364–381. Springer, sion and Pattern Recognition, pages 10696–10706, 2022. 1,
2020. 2 2
[8] Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, [20] Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch,
Michael Zollhoefer, Shun-Suke Saito, Stephen Lombardi, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts-
Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. Authen- Escolano, Rohit Pandey, Jason Dourgarian, et al. The re-
tic volumetric avatars from a phone scan. ACM Transactions lightables: Volumetric performance capture of humans with
on Graphics (TOG), 41(4):1–19, 2022. 1 realistic relighting. ACM Transactions on Graphics (ToG),
[9] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, 38(6):1–19, 2019. 2
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J [21] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
geometry-aware 3d generative adversarial networks. In Pro- age editing with cross attention control. arXiv preprint
ceedings of the IEEE/CVF Conference on Computer Vision arXiv:2208.01626, 2022. 2
and Pattern Recognition, pages 16123–16133, 2022. 2, 3, 4, [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
7, 13 Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[10] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, two time-scale update rule converge to a local nash equilib-
and Gordon Wetzstein. pi-gan: Periodic implicit generative rium. Advances in neural information processing systems,
adversarial networks for 3d-aware image synthesis. In Pro- 30, 2017. 7
ceedings of the IEEE/CVF conference on computer vision [23] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
and pattern recognition, pages 5799–5809, 2021. 2, 7 Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Poole, Mohammad Norouzi, David J Fleet, et al. Imagen [39] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
video: High definition video generation with diffusion mod- Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
els. arXiv preprint arXiv:2210.02303, 2022. 1, 2 Representing scenes as neural radiance fields for view syn-
[24] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- thesis. Communications of the ACM, 65(1):99–106, 2021. 2,
sion probabilistic models. Advances in Neural Information 3
Processing Systems, 33:6840–6851, 2020. 1, 2, 4 [40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
[25] Jonathan Ho and Tim Salimans. Classifier-free diffusion Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
guidance. arXiv preprint arXiv:2207.12598, 2022. 5 Mark Chen. Glide: Towards photorealistic image generation
[26] Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, By- and editing with text-guided diffusion models. arXiv preprint
oung Jin Choi, and Nam Soo Kim. Diff-tts: A denois- arXiv:2112.10741, 2021. 2
ing diffusion model for text-to-speech. arXiv preprint [41] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
arXiv:2104.01409, 2021. 2 denoising diffusion probabilistic models. In International
[27] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Conference on Machine Learning, pages 8162–8171. PMLR,
losses for real-time style transfer and super-resolution. In 2021. 4
European conference on computer vision, pages 694–711. [42] Michael Niemeyer and Andreas Geiger. Giraffe: Represent-
Springer, 2016. 5 ing scenes as compositional generative neural feature fields.
In Proceedings of the IEEE/CVF Conference on Computer
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Vision and Pattern Recognition, pages 11453–11464, 2021.
generator architecture for generative adversarial networks.
2, 7
In Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 4401–4410, 2019. 2 [43] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
[29] Diederik P Kingma, Max Welling, et al. An introduction to
tinuous signed distance functions for shape representation.
variational autoencoders. Foundations and Trends® in Ma-
In Proceedings of the IEEE/CVF conference on computer vi-
chine Learning, 12(4):307–392, 2019. 2
sion and pattern recognition, pages 165–174, 2019. 2
[30] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo
[44] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien
Aila, and Jaakko Lehtinen. The role of imagenet
Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo
classes in fr\’echet inception distance. arXiv preprint
Martin-Brualla. Nerfies: Deformable neural radiance fields.
arXiv:2203.06026, 2022. 7
In Proceedings of the IEEE/CVF International Conference
[31] Ruihui Li, Xianzhi Li, Ka-Hei Hui, and Chi-Wing Fu. Sp- on Computer Vision, pages 5865–5874, 2021. 1
gan: Sphere-guided 3d shape generation and manipulation.
[45] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
ACM Transactions on Graphics (TOG), 40(4):1–12, 2021. 2
and Dani Lischinski. Styleclip: Text-driven manipulation of
[32] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy stylegan imagery. In Proceedings of the IEEE/CVF Inter-
Liang, and Tatsunori B Hashimoto. Diffusion-lm im- national Conference on Computer Vision, pages 2085–2094,
proves controllable text generation. arXiv preprint 2021. 8
arXiv:2205.14217, 2022. 2
[46] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
[33] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
Geiger. Towards unsupervised learning of generative mod- preprint arXiv:2209.14988, 2022. 2
els for 3d controllable image synthesis. In Proceedings of
[47] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizad-
the IEEE/CVF conference on computer vision and pattern
wongsa, and Supasorn Suwajanakorn. Diffusion autoen-
recognition, pages 5871–5880, 2020. 2
coders: Toward a meaningful and decodable representation.
[34] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and In IEEE Conference on Computer Vision and Pattern Recog-
Arun Mallya. Generative adversarial networks for image and nition (CVPR), 2022. 13
video synthesis: Algorithms and applications. Proceedings [48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
of the IEEE, 109(5):839–862, 2021. 1 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[35] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Sheikh. Deep appearance models for face rendering. ACM ing transferable visual models from natural language super-
Transactions on Graphics (ToG), 37(4):1–13, 2018. 2 vision. In International Conference on Machine Learning,
[36] Ilya Loshchilov and Frank Hutter. Decoupled weight de- pages 8748–8763. PMLR, 2021. 5
cay regularization. In International Conference on Learning [49] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Representations, ICLR, 2019. 12 and Mark Chen. Hierarchical text-conditional image gen-
[37] Shitong Luo and Wei Hu. Diffusion probabilistic models for eration with clip latents. arXiv preprint arXiv:2204.06125,
3d point cloud generation. In Proceedings of the IEEE/CVF 2022. 1, 2, 13
Conference on Computer Vision and Pattern Recognition, [50] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
pages 2837–2845, 2021. 2 Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
[38] Nelson Max. Optical models for direct volume rendering. Zero-shot text-to-image generation. In International Confer-
IEEE Transactions on Visualization and Computer Graphics, ence on Machine Learning, pages 8821–8831. PMLR, 2021.
1(2):99–108, 1995. 3 2
[51] Eduard Ramon, Gil Triginer, Janna Escur, Albert Pumarola, [64] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara
Jaime Garcia, Xavier Giro-i Nieto, and Francesc Moreno- Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
Noguer. H3d-net: Few-shot high-fidelity 3d head reconstruc- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features
tion. In Proceedings of the IEEE/CVF International Confer- let networks learn high frequency functions in low dimen-
ence on Computer Vision, pages 5620–5629, 2021. 8 sional domains. Advances in Neural Information Processing
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Systems, 33:7537–7547, 2020. 3
Patrick Esser, and Björn Ommer. High-resolution image [65] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical vari-
synthesis with latent diffusion models. In Proceedings of ational autoencoder. Advances in Neural Information Pro-
the IEEE/CVF Conference on Computer Vision and Pattern cessing Systems, 33:19667–19679, 2020. 2
Recognition, pages 10684–10695, 2022. 1, 2 [66] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
[53] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Chen, Qifeng Chen, and Fang Wen. Pretraining is all
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine you need for image-to-image translation. arXiv preprint
tuning text-to-image diffusion models for subject-driven arXiv:2205.12952, 2022. 2
generation. arXiv preprint arXiv:2208.12242, 2022. 2 [67] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. To-
[54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala wards real-world blind face restoration with generative fa-
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed cial prior. In Proceedings of the IEEE/CVF Conference
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, on Computer Vision and Pattern Recognition, pages 9168–
Rapha Gontijo Lopes, et al. Photorealistic text-to-image 9178, 2021. 6
diffusion models with deep language understanding. arXiv [68] Daniel Watson, William Chan, Ricardo Martin-Brualla,
preprint arXiv:2205.11487, 2022. 1, 2 Jonathan Ho, Andrea Tagliasacchi, and Mohammad
[55] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Norouzi. Novel view synthesis with diffusion models. arXiv
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo preprint arXiv:2210.04628, 2022. 2
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: [69] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian
Open dataset of clip-filtered 400 million image-text pairs. Dziadzio, Thomas J Cashman, and Jamie Shotton. Fake it
arXiv preprint arXiv:2111.02114, 2021. 8 till you make it: face analysis in the wild using synthetic
[56] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas data alone. In Proceedings of the IEEE/CVF international
Geiger. Graf: Generative radiance fields for 3d-aware im- conference on computer vision, pages 3681–3691, 2021. 2,
age synthesis. Advances in Neural Information Processing 3, 6, 12
Systems, 33:20154–20166, 2020. 2 [70] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
Josh Tenenbaum. Learning a probabilistic latent space of
[57] Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao,
object shapes via 3d generative-adversarial modeling. Ad-
and Andreas Geiger. Voxgraf: Fast 3d-aware image synthesis
vances in neural information processing systems, 29, 2016.
with sparse voxel grids. arXiv preprint arXiv:2206.07695,
2
2022. 2
[71] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong.
[58] Zifan Shi, Sida Peng, Yinghao Xu, Yiyi Liao, and Yujun
Gram-hd: 3d-consistent image generation at high resolu-
Shen. Deep generative models on 3d representations: A sur-
tion with generative radiance manifolds. arXiv preprint
vey. arXiv preprint arXiv:2210.15663, 2022. 2
arXiv:2206.07255, 2022. 2
[59] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
[72] Chulin Xie, Chuxin Wang, Bo Zhang, Hao Yang, Dong
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
Chen, and Fang Wen. Style-based point generator with ad-
Oran Gafni, et al. Make-a-video: Text-to-video generation
versarial rendering for point cloud completion. In Proceed-
without text-video data. arXiv preprint arXiv:2209.14792,
ings of the IEEE/CVF Conference on Computer Vision and
2022. 1
Pattern Recognition, pages 4619–4628, 2021. 2
[60] Vincent Sitzmann, Julien Martel, Alexander Bergman, David [73] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run-
Lindell, and Gordon Wetzstein. Implicit neural representa- sheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin
tions with periodic activation functions. Advances in Neural Cui, and Ming-Hsuan Yang. Diffusion models: A compre-
Information Processing Systems, 33:7462–7473, 2020. 2 hensive survey of methods and applications. arXiv preprint
[61] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- arXiv:2209.00796, 2022. 2
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based [74] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Goj-
generative modeling through stochastic differential equa- cic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: La-
tions. arXiv preprint arXiv:2011.13456, 2020. 1, 2 tent point diffusion models for 3d shape generation. arXiv
[62] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue preprint arXiv:2210.06978, 2022. 2
Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- [75] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong
ing for high-resolution 3d-aware portrait synthesis. ACM Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin:
Transactions on Graphics, 2022. 2 Transformer-based gan for high-resolution image genera-
[63] Attila Szabó, Givi Meishvili, and Paolo Favaro. Unsu- tion. In Proceedings of the IEEE/CVF Conference on Com-
pervised generative 3d shape learning from natural images. puter Vision and Pattern Recognition, pages 11304–11314,
arXiv preprint arXiv:1910.00287, 2019. 2 2022. 2
[76] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation
and completion through point-voxel diffusion. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 5826–5835, 2021. 2 xt xො 0,t
[77] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian.
Cips-3d: A 3d-aware generator of gans based on LR
conditionally-independent pixel synthesis. arXiv preprint Figure 11. Architecture of the upsample diffusion model.
arXiv:2110.09788, 2021. 2

Appendix latent vector. The upsample diffusion model is a U-Net-like


A. Background of Diffusion Models model but we apply only one upsample layer that directly
upscales the feature maps from 64 to 256 for efficiency, as
Diffusion models produce data by reversing a grad- shown in Figure 11. The tri-plane roll-out and 3D-aware
ual noising process. The forward noising process is a convolution are utilized in each residual block. When train-
Markov chain that corrupts the data by gradually adding ing the upsample model, we apply condition augmentation
random noises for steps t = 1, · · · , T . Each step in the on the tri-planes to reduce the domain gap as described
forward
√ process is a Gaussian transition q(xt |xt−1 ) := in Section 3.3. Specifically, we degrade the ground-truth
N ( 1 − βt xt−1 , βt I), where {βt }Tt=0 are usually pre- 256 × 256 tri-planes with a random combination of down-
defined variance schedule. Furthermore, the noisy latent scale, Gaussian blur, and Gaussian noise.
variable xt can be derived from x0 directly as: We utilize AdamW optimizer [36] with a batch size of 48
√ √ and a learning rate of 5e-5 for the base diffusion model, and
xt = αt x0 + 1 − αt z, z ∼ N (0, I), (5)
with a batch size of 16 and a learning rate of 5e-5 for the
Qt upsample diffusion model. We also apply the exponential
where αt := s=1 (1 − βs ). When T is large enough, αT
moving average (EMA) with a rate of 0.9999 during train-
gets closer to 0 and the last latent variable xT is nearly an
ing. We set the diffusion steps as 1,000 for the base model,
isotropic Gaussian distribution.
and 100 for the upsample model, with a linear noise sched-
To sample data from the given distribution, we can re-
ule. During inference we sample 100 diffusion steps for
verse the noising process by learning a denoising model
both the base model and the upsample model. All the exper-
θ (xt , t). The denoising model θ (xt , t) starts from the
iments are performed on NVIDIA Tesla 32G-V100 GPUs.
Gaussian noise xT and iteratively reduces the noise for
t = T − 1, · · · , 0. Specifically, it takes the noisy latent
B.2. Tri-plane Fitting
variable xt at each timestep t and predicts the correspond-
ing noise  with a minimal mean square error: Our framework learns the 3D avatar generation from ex-
plicit 3D representations obtained from fitting multi-view
min Ex0 ∼p(x0 ),z∼N (0,I),t ||θ (xt , t) − z||22 . (6) images. However, a multi-view consistent, diverse, high-
θ
quality and large-scale dataset of face images is difficult
With the learned denoising model, the data can be sampled and expensive to collect. Images collected from the Web
with the following reverse diffusion process: have no guarantee of multi-view consistency and suffer pri-
1

βt
 vacy and copyright risks. Regarding this, we turn to syn-
xt−1 = √ xt − √ θ (xt , t) + σt z, (7) thetic techniques that can randomly render novel 3D por-
1 − βt 1 − αt
traits by randomly combining assets manually created by
where z ∼ N (0, I) is a randomly sampled noise, and σt is artists. We leverage the Blender synthetic pipeline [69] that
the variance of the added noise. generates human faces along with random sampling from a
large collection of hair, clothing, expression and accessory.
B. Implementation Details Hence, we create 100K synthetic individuals independently
and for each render 300 multi-view images with a resolution
B.1. Architectural Design and Training Details of 256 × 256.
Our base diffusion model adopts the U-Net architecture For tri-plane fitting, we learn 256 × 256 × 32 × 3 spa-
from [14] with a channel number of 192, while we make tial features for each person along with a lightweight MLP
several modifications including tri-plane roll-out and 3D- decoder consisting of 4 fully connected layers as shown in
aware convolution, as discussed in Section 3.2. To orches- Figure 12. We randomly initialize the tri-plane feature and
trate the tri-plane generation and enable semantic editing, MLP weights. During fitting, we apply random rescaling
we also introduce a condition encoder, a fixed CLIP ViT- (downsample to a resolution in [64, 256] followed by an up-
B/32 image encoder, to map a reference image to a semantic sampling to 256) to ensure that the MLP decoder is robust to
33

softplus density 32
point F F F F

PSNR
sigmoid color 31
features C C C C

view 30
direction
29
Figure 12. Architecture of the MLP decoder.
64 128 256 384 512
Tri-plane resolution
multi-resolution tri-plane features. To enable scalable and
efficient fitting, we first optimize the shared 4-layer MLP Figure 13. Effect of tri-plane resolution for tri-plane fitting.
decoder when fitting the first 1,000 subjects, and this de-
coder is fixed when fitting the following subjects. Thus dif-
ferent subjects are fitted separately in distributed servers. 32
For multi-view images {x}Nv for the given subject,
where x ∈ RH0 ×W0 ×3 , we minimize the mean squared er-

PSNR
ror LMSE between the rendered image via volumetric ren- 30
dering, i.e., x̂ = R(c, σ) and the corresponding ground
truth image. Moreover, we introduce additional regularizers
28
to improve the fitting quality. To be specific, we manage to
reduce the “floating” artifact by enforcing the sparsity loss
100 150 200 250 300 350 400
Lsparse which penalizes the `1 magnitude of the predicted Number of images
density, the smoothness loss Lsmooth [9] that encourages a
smooth density field, as well as the distortion loss Ldist [3] Figure 14. Effect of image numbers for tri-plane fitting.
that encourages compact rays with localized weight distri-
bution.
and also apply exponential moving average (EMA) with a
B.3. Text-based Avatar Customization rate of 0.9999 during training.
As shown in Section 4.5, the Rodin model can edit gener-
ated avatars with text prompts. For a generated avatar with B.5. Text-to-avatar Generation
a conditioned latent zi , we can obtain an editing direction As shown in Section 4.5, we perform text-to-avatar gen-
δ = ETclip (Ttgt ) − ETcilp (Tsrc ) in the text embedding space eration by training a text-conditioned diffusion model that
of CLIP based on prompt engineering. For instance, we can generates an image embedding from a text embedding in the
choose the source text Tsrc from some general descriptions CLIP space. We adopt the network architecture from [49]
such as “a photo of a person” and “a portrait of a person”, and train it on a subset of the LAION-400M dataset, con-
and use the target text Ttgt such as “a photo of a person taining 100K portrait-text pairs. We set the diffusion steps
with blond hair” and “a photo of a smiling person ”. As as 1,000 with a linear noise schedule. We utilize AdamW
we assume colinearity between the CLIP’s image and text optimizer with a batch size of 96 and a learning rate of
embedding, we can obtain the manipulated embedding as 4e − 5, and also apply exponential moving average (EMA)
zi + δ, which is used to generate edited avatars. with a rate of 0.9999 during training.
B.4. Latent Diffusion for Unconditional Sampling C. Additional Ablation Study and Analysis
As discussed in Section 3.2, our base diffusion model
C.1. Tri-plane Settings
supports both unconditional generation and conditional
generation. To account for full diversity during uncondi- Choices of Tri-plane resolution. To analyze the impact of
tional sampling, we additionally train a diffusion model to tri-plane resolution, we experiment with different tri-planes
model the distribution of the latent z. The latent diffusion from a set of {64, 128, 256, 384, 512} to fit 1024 × 1024
adopts a 20-layer MLPs network [47] with the hidden chan- images and show the results in Figure 13. Overall, the fitting
nel of 2048 that iteratively predicts the latent code z ∈ R512 quality increases with the tri-plane resolution. Empirically,
from random Gaussian noise. We set the diffusion steps as we find that the 256 × 256 tri-plane is strong enough to
1,000 with a linear noise schedule. We utilize AdamW op- represent a subject. Considering the memory cost, we thus
timizer with a batch size of 96 and a learning rate of 4e − 5, choose to utilize 256 × 256 tri-planes in our experiments.
Scale w/o CFG 1.2 1.5 3.0 6.0
PSNR 24.06 24.21 24.07 24.05 24.15
SSIM 0.795 0.794 0.792 0.782 0.775
LPIPS 0.128 0.121 0.133 0.141 0.146
Table 4. Quantitative results of conditional avatar reconstruction.

are crucial for high-fidelity results, especially for thin struc-


tures such as hair strands and clothing details, by enhancing
Figure 15. Visualization of intermediate generation results of dif- cross-plane communication. To validate the impact of these
ferent time steps. designs in high-quality tri-plane, we modify the upsample
diffusion model with different configurations and remove
the convolution refinement with the base diffusion fixed.
Figure 16 demonstrates with rollout and 3D-aware convo-
lution, the full model shows a clear improvement compared
to the base model.

C.4. Nearest Neighbors Analysis


The Rodin model enables a hassle-free creation experi-
ence of an unlimited number of avatars from scratch, each
avatar being distinct. Figure 17 shows the nearest neighbors
of some generated samples in the main paper, which indi-
cates that the model does not simply memorize the training
data.
Base + Roll-out + 3D-aware conv
Figure 16. Both tri-plane roll-out and 3D-aware convolution are C.5. Conditional Avatar Generation
crucial for high-fidelity results. Quantitative metrics. On top of unconditional generation,
we can also hallucinate a 3D avatar from a single reference
image by conditioning the base generator with the CLIP
Number of images for fitting. We also explore how many
image embedding for that input image. We evaluate the
images are needed to achieve a high-quality fitting. As
conditional generation on 1K test data where each subject
shown in Figure 14, the fitting quality get almost saturated
contains 300 images from different views. Table 4 reports
when using 300 different views for the neural tri-plane re-
the metrics between reconstructed images and ground-truth
construction.
synthetic images.
C.2. Visualization of Different Diffusion Steps Classifier-free guidance. Our model supports classifier-
free guidance (CFG) sampling when inference, which is a
Diffusion models generate samples by gradually remov- technique typically used to boost the sampling quality in
ing noises for t ∈ [T, 0], and analyzing these intermediate conditional generation. Table 4 evaluates generation qual-
results would reach an in-depth understanding of the gen- ity with different scales of classifier-free guidance in terms
eration process. We thus demonstrate the generated results of PSNR, SSIM and LPIPS.
over the reverse process in Figure 15, where we render the
predicted tri-plane of our base diffusion, xˆ0 , at each time
step t. Notwithstanding that our diffusion is performed in
D. Additional Visual Results
tri-plane feature space, the reverse process is similar to that Figure 18 and Figure 19 show more random samples
in image space, where the coarse structure appears first and generated by the Rodin model, showing the capability to
fine details appear in the last iterative steps. synthesize high-quality 3D renderings with impressive de-
tails. To reflect the geometry, we also extract the mesh from
C.3. Effect of 3D-aware Convolution
the generated density field using marching cubes, which
By rolling out tri-plane feature maps and applying 3D- demonstrates high-fidelity geometry. Figure 20 gives un-
aware convolution, the Rodin model performs 3D-aware curated generated samples, which possess visually-pleasing
diffusion using an efficient 2D architecture. As analyzed quality and diversity. We also explore the interpolation of
in Section 3.2, tri-plane roll-out and 3D-aware convolution the latent condition z between two generated avatars, as
Generated
Top-3 nearest neighbors

Figure 17. Nearest neighbors in the training data according to CLIP feature similarity.

shown in Figure 21, where we observe consistent interpo- ings produced by our model may potentially be misused and
lation results with smooth appearance transition. Figure 21 viable solutions so avoid this include adding tags or water-
shows additional results of creating 3D portraits from a sin- marks when distributing the generated photos.
gle reference image. This work successfully generalizes the power of diffu-
sion models from 2D to 3D and is promising to offer the
E. Societal Impact new design tool for 3D artists which could significantly
save the costs of the traditional 3D modeling and render-
The Rodin model aims to enable a low-cost, fast and cus- ing pipeline. In the next we intend to explore the possibility
tomizable creation experience of 3D digital avatars that re- of modeling general 3D scenes using the same technique
fer to the traditional avatars manually created by 3D artists, and investigate novel applications such as Lego and archi-
as opposed to photorealistic avatars. The reason for focus- tect designs.
ing on digital avatars is twofold. On the one hand, digital
avatars are widely used in movies, games, the metaverse,
and the 3D industry in general. On the other hand, the avail-
able digital avatar data is very scarce as each avatar has to
be painstakingly created by a specialized 3D artist using a
sophisticated creation pipeline, especially for modeling hair
and facial hair.
Rather than collecting real photos, all our training im-
ages are rendered by Blender. Such synthetic data can miti-
gate the privacy and copyright concerns that existed in real
face collection. Another advantage of using synthetic data
is that we could have control over the variation and diver-
sity of rendered images, eliminating the data bias in exist-
ing face datasets. Also, digital avatars are easier to be dis-
tinguished from real people compared with photo-realistic
avatars, hindering misuse for impersonating real persons.
Nonetheless, the 3D portrait reconstruction and text-based
avatar customization could still be misused for spreading
disinformation maliciously, like all other AI-based content
generation models. We caution that the high-quality render-
Figure 18. Unconditional generation samples by our Rodin model. We visualize the mesh extracted from the generated density field.
Figure 19. Unconditional generation samples by our Rodin model.
Figure 20. Uncurated generation results by our Rodin model.

Figure 21. Latent interpolation results for generated avatars.


Figure 22. Additional results of portrait inversion.

You might also like