Rodin: A Generative Model For Sculpting 3D Digital Avatars Using Diffusion
Rodin: A Generative Model For Sculpting 3D Digital Avatars Using Diffusion
Input portrait Generated 3D avatar “In white suit” “Smile with glasses” “With red hair” “bearded man with curly hair in black leather jacket”
Figure 1. Our diffusion model, Rodin, can produce high-fidelity 3D avatars, as shown in the first row. Our model also supports 3D avatar
generation from a single portrait or text prompt, while permitting text-based semantic manipulation (second row). See the webpage for
video demos.
Abstract avatars with realistic hairstyles and facial hair like beards.
This paper presents a 3D generative model that uses dif- We also demonstrate 3D avatar generation from image or
fusion models to automatically generate 3D digital avatars text as well as text-guided editability.
represented as neural radiance fields. A significant chal-
lenge in generating such avatars is that the memory and 1. Introduction
processing costs in 3D are prohibitive for producing the
Generative models [2, 34] are one of the most promising
rich details required for high-quality avatars. To tackle this
ways to analyze and synthesize visual data including 2D
problem we propose the roll-out diffusion network (Rodin),
images and 3D models. At the forefront of generative mod-
which represents a neural radiance field as multiple 2D fea-
eling is the diffusion model [14, 24, 61], which has shown
ture maps and rolls out these maps into a single 2D feature
phenomenal generative power for images [19,49,52,54] and
plane within which we perform 3D-aware diffusion. The
videos [23, 59]. Indeed, we are witnessing a 2D content-
Rodin model brings the much-needed computational effi-
creation revolution driven by the rapid advances of diffusion
ciency while preserving the integrity of diffusion in 3D by
and generative modeling.
using 3D-aware convolution that attends to projected fea-
In this paper, we aim to expand the applicability of dif-
tures in the 2D feature plane according to their original re-
fusion such that it can serve as a generative model for 3D
lationship in 3D. We also use latent conditioning to orches-
digital avatars. We use “digital avatars” to refer to the tra-
trate the feature generation for global coherence, leading
ditional avatars manually created by 3D artists, as opposed
to high-fidelity avatars and enabling their semantic editing
to the recently emerging photorealistic avatars [8, 44]. The
based on text prompts. Finally, we use hierarchical synthe-
reason for focusing on digital avatars is twofold. On the
sis to further enhance details. The 3D avatars generated by
one hand, digital avatars are widely used in movies, games,
our model compare favorably with those produced by exist-
the metaverse, and the 3D industry in general. On the other
ing generative techniques. We can generate highly detailed
hand, the available digital avatar data is very scarce as each
* Equal contribution. † Intern at Microsoft Research. avatar has to be painstakingly created by a specialized 3D
artist using a sophisticated creation pipeline [20, 35], espe- identities, hairstyles, and clothing created by 3D artists [69].
cially for modeling hair and facial hair. All this leads to a Several application scenarios are thus supported. We can
compelling scenario for generative modeling. use the model to generate an unlimited number of avatars
We present a diffusion model for automatically produc- from scratch, each avatar being different from others as well
ing digital avatars represented as neural radiance fields [39], as the ones in the training data. As shown in Figure 1, we
with each point describing the color radiance and density of can generate highly-detailed avatars with realistic hairstyles
the 3D volume. The core challenge in generating neural vol- and facial hairs styled as beards, mustaches, goatees, and
ume avatars is the prohibitive memory and computational sideburns. Hairstyle and facial hair are essential for rep-
cost for the rich details required by high-quality avatars. resenting people’s unique personal identities. Yet, these
Without rich details, our results will always be somewhat styles have been notoriously difficult to generate well with
“toy-like”. To tackle this challenge, we develop Rodin, the existing approaches. The Rodin model also allows avatar
roll-out diffusion network. We take a neural volume repre- customization with the resulting avatar capturing the visual
sented as multiple 2D feature maps and roll out these maps characteristics of the person portrayed in the image or the
into a single 2D feature plane and perform 3D-aware dif- textual description. Finally, our framework supports text-
fusion within this plane. Specifically, we use the tri-plane guided semantic editing. The strong generative power of
representation [9], which represents a volume by three axis- diffusion shows great promise in 3D modeling.
aligned orthogonal feature planes. By simply rolling out
feature maps, the Rodin model can perform 3D-aware dif- 2. Related Work
fusion using an efficient 2D architecture and drawing power
The state of generative modeling [5,14,15,28,50,65,75]
from the model’s three key ingredients below.
has seen rapid progress in past years. Diffusion models [14,
The first is the 3D-aware convolution. The 2D CNN 24, 61, 73] have recently shown unprecedented generative
processing used in conventional 2D diffusion cannot well ability and compositional power. The most remarkable suc-
handle the feature maps originating from orthogonal planes. cess happens in text-to-image synthesis [19, 40, 49, 52, 54],
Rather than treating the features as plain 2D input, the 3D- which serves as a foundation model and enables various
aware convolution explicitly accounts for the fact that a 2D appealing applications [21, 53, 66] previously unattainable.
feature in one plane (of the tri-plane) is a projection from a While diffusion models have been successfully applied to
piece of 3D data and is hence intrinsically associated with different modalities [11, 23, 26, 32], its generative capabil-
the same data’s projected features in the other two planes. ity is much less explored in 3D generation, with only a few
To encourage cross-plane communication we involve all attempts on modeling 3D primitives [37, 74, 76].
these associated features in the convolution and thus bridge Early 3D generation works [58] rely on either GAN [17]
the associated features together and synchronize their detail or VAE [29] to model the distribution of 3D shape repre-
synthesis according to their 3D relationship. sentation like voxel grids [6, 70], point clouds [1, 7, 31, 72],
The second key ingredient is latent conditioning. We use mesh [33, 63] and implicit neural representation [43, 60].
a latent vector to orchestrate the feature generation so that it However, existing works have not demonstrated the abil-
is globally coherent across the 3D volume, leading to better- ity to produce complex 3D assets yet. Concurrent to this
quality avatars and enabling their semantic editing. We do work, Bautista et al. [4] train a diffusion model to generate
this by using the avatars in the training dataset to train an ad- the latent vector that encodes the radiance field [39] of syn-
ditional image encoder which extracts a semantic latent vec- thetic scenes, yet this work only produces coarse 3D geom-
tor serving as the conditional input to the diffusion model. etry. In comparison, we propose a hierarchical 3D genera-
This latent conditioning essentially acts as an autoencoder tion framework with effective 3D-aware operators, offering
in orchestrating the feature generation. For semantic ed- unprecedented 3D detail synthesis.
itability, we adopt a frozen CLIP image encoder [9] that Another line of work learns 3D-aware generation by uti-
shares the latent space with text prompts. lizing richly available 2D data. 3D-aware GANs [9, 10,
The final key ingredient is hierarchical synthesis. We 12, 16, 18, 42, 56, 57, 62, 71, 77] recently attract significant
start by generating a low-resolution tri-plane (64 × 64), fol- research interest, which are trained to produce radiance
lowed by a diffusion-based upsampling that yields a higher fields with image level distribution matching. However,
resolution (256 × 256). When training the diffusion upsam- these methods suffer from instabilities and mode collapse
pler, it is instrumental in penalizing the image-level loss that of GAN training, and it is still challenging to attain authen-
we compute in a patch-wise manner. tic avatars that can be viewed from large angles. Concur-
Taken together, the above ingredients work in concert to rently, there are a few attempts to use diffusion models for
enable the Rodin model to coherently perform diffusion in the problem. Daniel et al. [68] proposes to synthesize novel
3D with an efficient 2D architecture. The Rodin is trained views with a pose-conditioned 2D diffusion model, yet the
with a multi-view image dataset of 100K avatars of diverse results are not intrinsically 3D. Ben et al. [46] optimizes a
Image
encoder
Figure 2. An overview of our Rodin model. We derive the latent z via the mapping from image, text, or random noise, which is used
to control the base diffusion model to generate 64 × 64 tri-planes. We train another diffusion model to upsample this coarse result to
256 × 256 tri-planes that are used to render final multi-view images with volumetric rendering and convolutional refinement. The operators
used in diffusion models are designed to be 3D-aware.
radiance field using the supervision from a pretrained text- field of 3D avatars. Specifically, the 3D volume is factor-
to-image diffusion model and produces impressive 3D ob- ized into three axis-aligned orthogonal feature planes, de-
jects of diverse genres. Nonetheless, pretrained 2D gen- noted by yuv , ywu , yvw ∈ RH×W ×C , each of which has
erative networks only offer limited 3D knowledge and in- spatial resolution of H × W and number of channel as C.
evitably lead to blurry 3D results. A high-quality generation Compared to voxel grids, the tri-plane representation offers
framework in 3D space is still highly desired. a considerably smaller memory footprint without sacrific-
ing the expressivity. Hence, rich 3D information is ex-
3. Approach plicitly memorized in the tri-plane features, and one can
query the feature of the 3D point p ∈ R3 by projecting
Unlike prior methods that learn 3D-aware generation
it onto each plane and aggregating the retrieved features,
from a 2D image collection, we aim to learn the 3D
i.e., yp = yuv (puv ) + ywu (pwu ) + yvw (pvw ). With such
avatar generation using the multi-view renderings from the
positional feature, one can derive the density σ ∈ R+ and
Blender synthetic pipeline [69]. Rather than treating the
view-dependent color c ∈ R3 of each 3D location given the
multi-view images of the same subject as individual training
viewing direction d ∈ S2 with a lightweight MLP decoder
samples, we fit the volumetric neural representation for each
GθMLP , which can be formulated as
avatar, which is used to explain all the observations from
different viewpoints. Thereafter we use diffusion models c(p, d), σ(p) = GθMLP yp , ξ(yp ), d .
(1)
to characterize the distribution of these 3D instances. Our
diffusion-based 3D generation is a hierarchical process — Here, we apply the Fourier embedding operator ξ(·) [64]
we first utilize a diffusion model to generate the coarse ge- on the queried feature rather than the spatial coordinate.
ometry, followed by a diffusion upsampler for detail syn- The tri-plane features and the MLP decoder are optimized
thesis. As illustrated in Figure 2, the whole 3D portrait gen- such that the rendering of the neural radiance field matches
eration comprises multiple training stages, which we detail the multi-view images {x}Nv for the given subject, where
in the following subsections. x ∈ RH0 ×W0 ×3 . We enforce the rendered image given by
volumetric rendering [38], i.e., x̂ = R(c, σ), to match the
3.1. Robust 3D Representation Fitting
corresponding ground truth with mean squared error loss.
To train a generative network with explicit 3D supervi- Besides, we introduce sparse, smooth, and compact regular-
sion, we need an expressive 3D representation that accounts izers to reduce the “floating” artifacts [3] in free space. For
for multi-view images, which should meet the following re- more tri-plane fitting details, please refer to the Appendix.
quirements. First, we need an explicit representation that While prior per-scene reconstruction mainly concerns
is amenable to generative network processing. Second, we the fitting quality, our 3D fitting procedure should also con-
require a compact representation that is memory efficient; sider several key aspects for generation purposes. First, the
otherwise, it would be too costly to store a myriad of such tri-plane features of different subjects should rigorously re-
3D instances for training. Furthermore, we expect fast rep- side in the same domain. To achieve this, we adopt a shared
resentation fitting since hours of optimization as vanilla MLP decoder when fitting distinct portraits, thus implicitly
NeRF [39] would make it unaffordable to generate abun- pushing the tri-plane features to the shared latent space rec-
dant 3D training data as required for generative modeling. ognizable by the decoder. Second, the MLP decoder has
Taking these into consideration, we adopt tri-plane rep- to possess some level of robustness. That is, the decoder
resentation proposed by [9] to model the neural radiance should be tolerant to slight perturbation of tri-plane fea-
w latent z
v w v
u u w
v
u
(a) (b)
(a) 256 × 256 tri-plane (b) 64 × 64 tri-plane (c) 64 × 64 tri-plane
without random scaling with random scaling Figure 4. We propose two mechanisms to ensure coherent tri-plane
Figure 3. While 256 × 256 tri-planes give good renderings (a), the generation. Our 3D-aware convolution considers the 3D relation-
64 × 64 variant gives much worse result (b). Hence, we introduce ship in (a) and correlates the associated elements from separate
random scaling during fitting so as to obtain a robust representa- feature planes as shown in (b). In (b), we also visualize the usage
tion that can be effectively rendered in continuous scales (c). of a shared latent code to orchestrate the feature generation.
tures, and thus one can still obtain plausible results even if from the Gaussian noise yT ∼ N (0, I) and sequentially
the tri-plane features are imperfectly generated. More im- produces less noisy samples {yT , yT −1 , . . .} until reach-
portantly, the decoder should be robust to varied tri-plane ing y0 .
sizes because hierarchical 3D generation is trained on multi- We first train a base diffusion model to generate the
resolution tri-plane features. As shown in Figure 3, when coarse-level tri-planes, e.g., at 64 × 64 resolution. A
solely fitting 256 × 256 tri-planes, its 64 × 64 resolution straightforward approach is to adopt the 2D network struc-
variant cannot be effectively rendered. To address this, we ture used in the state-of-the-art image-based diffusion mod-
randomly scale the tri-plane during fitting, which is instru- els for our tri-plane generation. Specifically, we can con-
mental in deriving multi-resolution tri-plane features simul- catenate the tri-plane features in the channel dimension as
taneously with a shared decoder. in [9], which forms y = (yuv ⊕ ywu ⊕ yvw ) ∈ RH×W ×3C ,
and employ a well-designed 2D U-Net to model the data
3.2. Latent Conditioned 3D Diffusion Model distribution through the denoising diffusion process. How-
Now the 3D avatar generation is reduced to learning ever, such a baseline model produces 3D avatars with severe
the distribution of tri-plane features, i.e., p(y), where y = artifacts. We conjecture the generation artifact comes from
(yuv , ywu , yvw ). Such generative modeling is non-trivial the incompatibility between the tri-plane representation and
since y is highly dimensional. We leverage diffusion mod- the 2D U-Net. As shown in Figure 4(a), intuitively, one
els for the task, which have shown compelling quality in can regard the tri-plane features as the projection of neural
complex image modeling. volume towards the frontal, bottom, and side views, respec-
On a high level, the diffusion model generates y by grad- tively. Hence, the channel-wise concatenation of these or-
ually reversing a Markov forward process. Starting from thogonal planes for CNN processing is problematic because
y0 ∼ p(y), the forward process q yields a sequence of in- these planes are not spatially aligned. To better handle the
creasing noisy latent codes {yt | t ∈ [0, T ]} according to tri-plane representation, we make the following efforts.
yt := αt y0 +σt , where ∈ N (0, I) is the added Gaussian 3D-aware convolution. Using CNN to process channel-
noise; αt and σt define a noise schedule whose log signal- wise concatenated tri-planes will cause the mixing of the-
to-noise ratio λt = log[αt2 /σt2 ] linearly decreases with the oretically uncorrected features in terms of 3D. One sim-
timestep t. With sufficient noising steps, we reach a pure ple yet effective way to address this is to spatially roll out
Gaussian noise, i.e., yT ∼ N (0, I). The generative process the tri-plane features. As shown in Figure 4(b), we con-
corresponds to reversing the above noising process, where catenate the tri-plane features horizontally, yielding ỹ =
the diffusion model is trained to denoise yt into y0 for all hstack(yuv , ywu , yvw ) ∈ RH×3W ×C . Such feature roll-
t using a mean squared error loss. Following [24], better out allows independent processing of feature planes. For
generation quality can be achieved by parameterizing the simplicity, we subsequently use y to denote such input form
diffusion model ˆθ to predict the added noise: by default. However, the tri-plane roll-out hampers cross-
h i plane communication, while the 3D generation requires the
2
Lsimple = Et,x0 , ˆθ (αt y0 + σt , t) − 2 . (2) synergy of the tri-plane generation.
To better process the tri-plane features, we need an effi-
In practice, our diffusion model training also jointly opti- cient 3D operator that performs on the tri-plane rather than
mizes the variational lower bound loss LVLB as suggested treating it as a plain 2D input. To achieve this, we pro-
in [41], which allows high-quality generation with fewer pose 3D-aware convolution to effectively process the tri-
timesteps. During inference, stochastic ancestral sam- plane features while respecting their 3D relationship. A
pler [24] is used to generate the final samples, which starts point on a certain feature plane actually corresponds to an
axis-aligned 3D line in the volume, which also has two cor- encoder [48] that has shared latent space with text prompts.
responding line projections in other planes, as shown in We will show how the learned model produces controllable
Figure 4(a). The features of these corresponding locations text-guided generation results.
essentially describe the same 3D primitive and should be Another notable benefit of latent conditioning is that it
learned synchronously. However, such a 3D relationship allows classifier-free guidance [25], a technique typically
is neglected when employing plain 2D convolution to tri- used to boost the sampling quality in the conditional gener-
plane processing. As such, our 3D-aware convolution ex- ation. When training the diffusion model, we randomly zero
plicitly introduces such 3D inductive bias by attending the the latent embedding with 20% probability, thus adapting
features of each plane to the corresponding row/column of the diffusion decoder to unconditional generation. During
the rest planes. In this way, we enable 3D processing capa- inference, we can steer the model toward better generation
bility with 2D CNNs. This 3D-aware convolution applied sampling according to
on the tri-plane representation, in fact, is a generic way to
simplify 3D convolutions previously too costly to compute ˆθ (y, z) = λθ (y, z) + (1 − λ)θ (y), (3)
when modeling high-resolution 3D volumes.
where θ (y, z) and θ (y) are the conditional and uncondi-
The 3D-aware convolution is depicted in Figure 4(b).
tional -predictions respectively, and λ > 0 specifies the
Ideally, the compute for yuv would attend to full elements
guidance strength.
from the corresponding row/column, i.e., ywu and yvw ,
Our latent conditioned base model thus supports both un-
from other planes. For parallel computing, we simplify
conditional generation as well as the conditional generation
this and aggregate the row/column elements. Specifically,
that is used for portrait inversion. To account for full diver-
we apply the axis-wise pooling for ywu and yvw , yield-
sity during unconditional sampling, we additionally train a
ing a row vector ywu→u ∈ R1×W ×C and a column vec-
diffusion model to model the distribution of the latent z,
tor yvw→v ∈ RH×1×C respectively. For each point of
whereas the latent yT describes the residual variation. We
yuv , we can easily access the corresponding element in the
include this latent diffusion model in Figure 2.
aggregated vectors. We expand the aggregated vectors to
the original 2D dimension (i.e., replicating the column vec- 3.3. Diffusion Tri-plane Upsampler
tors along row dimension, and vice versa) and thus derive
y(·)u , yv(·) ∈ RH×W ×C . By far, we can perform 2D con- To generate high-fidelity 3D structures, we further train
volution on the channel-wise concatenation of the feature a diffusion super-resolution (SR) model to increase the tri-
maps, i.e., Conv2D(yuv ⊕y(·)u ⊕yv(·) ). because yuv is now plane resolution from 64×64 to 256×256. At this stage, the
spatially aligned with the aggregation of the corresponding diffusion upsampler is conditioned on the low-resolution
elements from other planes. The compute for yvw and ywu (LR) tri-plane y LR . Different from the base model training
is conducted likewise. The 3D-aware convolution greatly we parameterize the diffusion upsampler yθHR (ytHR , y LR , t)
enhances the cross-plane communication, and we empiri- to predict the high-resolution (HR) ground truth y0HR instead
cally observe reduced artifacts and improved generation of of the added noise . The 3D-aware convolution is utilized
thin structures like hair strands. in each residual block to enhance detail synthesis.
Following prior cascaded image generation works, we
Latent conditioning. We further propose to learn a latent
apply condition augmentation to reduce the domain gap
vector to orchestrate the tri-plane generation. As shown in
between the output from the base model and the LR con-
Figure 2, we additionally train an image encoder E to ex-
ditional input for SR training. We conduct careful tun-
tract a semantic latent vector serving as the conditional in-
ing for the tri-plane augmentation with a combination of
put of the base diffusion model, so essentially the whole
random downsampling, Gaussian blurring and Gaussian
framework is an autoencoder. To be specific, we extract the
noises, making the rendered augmented LR tri-plane resem-
latent vector from the frontal view of each training subject,
ble the base rendering output as much as possible.
i.e., z = Eθ (xfront ) ∈ R512 , and the diffusion model condi-
Nonetheless, we find that a tri-plane restoration with a
tioned on z is trained to reconstruct the tri-plane of the same
lower `2 distance to the ground truth may not necessarily
subject. We use adaptive group normalization (AdaGN) to
correspond to a satisfactory image rendering. Hence, we
modulate the activations of the diffusion model, where z is
need to directly constrain the rendered image. Specifically,
injected into every residual block, and in this way, the fea-
we obtain the rendered image xˆHR ∈ R256×256×3 from the
tures of the orthogonal planes are synchronously generated
predicted tri-plane ŷ0HR with x̂ = R(GθMLP (ŷ0HR )), and we
according to a shared latent.
further penalize the perceptual loss [27] between this ren-
The latent conditioning not only leads to higher gener- dered result and the ground truth, according to
ation quality but also permits a disentangled latent space,
thus allowing semantic editing of generated results. To
X
Lperc = Et,x̂ kΨl (x̂) − Ψl (x)k22 , (4)
achieve better editability, we adopt a frozen CLIP image l
Figure 5. Unconditional generation samples by our Rodin model. We visualize the mesh extracted from the generated density field.
where Ψl denotes the multi-level feature extraction using ance field reconstruction. The tri-planes for our generation
a pretrained VGG. Usually, the volume rendering requires have the dimension of 256 × 256 × 32 in each feature plane.
stratified sampling along each ray, which is computationally We optimize a shared MLP decoder when fitting the first
prohibitive for high-resolution rendering. Hence, we com- 1,000 subjects. This decoder consists of 4 fully connected
pute Lperc on random 112 × 112 image patches with high layers and is fixed when fitting the following subjects. Thus
sampling importance on face region. Compared with prior different subjects are fitted separately in distributed servers.
3D-aware GANs that require rendering full images, our 3D- Both the base and upsampling diffusion networks adopt
aware SR can be easily scalable to high resolutions due to U-Net architecture to process the roll-out tri-plane features.
the permit of patchwise training with direct supervision. We apply full-attention for 82 , 162 and 322 scales within the
Modeling high-frequency detail and thin structures are network and adopt 3D-aware convolution at higher scales to
particularly challenging in volumetric rendering. Thus, at enhance the details. While we generate 2562 tri-planes with
this stage, we jointly train a convolution refiner [67] on our the diffusion upsampler, we also render image and compute
data which complements the missing details of the NeRF image loss at 5122 resolution, with a convolutional refine-
rendering, ultimately producing compelling 1024 × 1024 ment further enhancing the details to 10242 . For more de-
image outputs. tails about the network architecture and training strategies,
please refer to our Appendix.
4. Experiments
4.2. Unconditional Generation Results
4.1. Implementation Details
Figure 5 shows several samples generated by the Rodin
To train our 3D diffusion, we obtain 100K 3D avatars model, showing the capability to synthesize high-quality
with a random combination of identities, expressions, 3D renderings with impressive details, e.g., glasses and
hairstyles, and accessories using synthetic engine [69]. For hairstyle. To reflect the geometry, we extract the mesh from
each avatar, we render 300 multi-view images with known the generated density field using marching cubes, which
camera pose, which are sufficient for a high-quality radi- demonstrates high-fidelity geometry. More uncurated sam-
Pi-GAN [10] GIRAFFE [42] EG3D [9] Our Rodin model
Figure 7. Qualitative comparison with state-of-the-art approaches.
softplus density 32
point F F F F
PSNR
sigmoid color 31
features C C C C
view 30
direction
29
Figure 12. Architecture of the MLP decoder.
64 128 256 384 512
Tri-plane resolution
multi-resolution tri-plane features. To enable scalable and
efficient fitting, we first optimize the shared 4-layer MLP Figure 13. Effect of tri-plane resolution for tri-plane fitting.
decoder when fitting the first 1,000 subjects, and this de-
coder is fixed when fitting the following subjects. Thus dif-
ferent subjects are fitted separately in distributed servers. 32
For multi-view images {x}Nv for the given subject,
where x ∈ RH0 ×W0 ×3 , we minimize the mean squared er-
PSNR
ror LMSE between the rendered image via volumetric ren- 30
dering, i.e., x̂ = R(c, σ) and the corresponding ground
truth image. Moreover, we introduce additional regularizers
28
to improve the fitting quality. To be specific, we manage to
reduce the “floating” artifact by enforcing the sparsity loss
100 150 200 250 300 350 400
Lsparse which penalizes the `1 magnitude of the predicted Number of images
density, the smoothness loss Lsmooth [9] that encourages a
smooth density field, as well as the distortion loss Ldist [3] Figure 14. Effect of image numbers for tri-plane fitting.
that encourages compact rays with localized weight distri-
bution.
and also apply exponential moving average (EMA) with a
B.3. Text-based Avatar Customization rate of 0.9999 during training.
As shown in Section 4.5, the Rodin model can edit gener-
ated avatars with text prompts. For a generated avatar with B.5. Text-to-avatar Generation
a conditioned latent zi , we can obtain an editing direction As shown in Section 4.5, we perform text-to-avatar gen-
δ = ETclip (Ttgt ) − ETcilp (Tsrc ) in the text embedding space eration by training a text-conditioned diffusion model that
of CLIP based on prompt engineering. For instance, we can generates an image embedding from a text embedding in the
choose the source text Tsrc from some general descriptions CLIP space. We adopt the network architecture from [49]
such as “a photo of a person” and “a portrait of a person”, and train it on a subset of the LAION-400M dataset, con-
and use the target text Ttgt such as “a photo of a person taining 100K portrait-text pairs. We set the diffusion steps
with blond hair” and “a photo of a smiling person ”. As as 1,000 with a linear noise schedule. We utilize AdamW
we assume colinearity between the CLIP’s image and text optimizer with a batch size of 96 and a learning rate of
embedding, we can obtain the manipulated embedding as 4e − 5, and also apply exponential moving average (EMA)
zi + δ, which is used to generate edited avatars. with a rate of 0.9999 during training.
B.4. Latent Diffusion for Unconditional Sampling C. Additional Ablation Study and Analysis
As discussed in Section 3.2, our base diffusion model
C.1. Tri-plane Settings
supports both unconditional generation and conditional
generation. To account for full diversity during uncondi- Choices of Tri-plane resolution. To analyze the impact of
tional sampling, we additionally train a diffusion model to tri-plane resolution, we experiment with different tri-planes
model the distribution of the latent z. The latent diffusion from a set of {64, 128, 256, 384, 512} to fit 1024 × 1024
adopts a 20-layer MLPs network [47] with the hidden chan- images and show the results in Figure 13. Overall, the fitting
nel of 2048 that iteratively predicts the latent code z ∈ R512 quality increases with the tri-plane resolution. Empirically,
from random Gaussian noise. We set the diffusion steps as we find that the 256 × 256 tri-plane is strong enough to
1,000 with a linear noise schedule. We utilize AdamW op- represent a subject. Considering the memory cost, we thus
timizer with a batch size of 96 and a learning rate of 4e − 5, choose to utilize 256 × 256 tri-planes in our experiments.
Scale w/o CFG 1.2 1.5 3.0 6.0
PSNR 24.06 24.21 24.07 24.05 24.15
SSIM 0.795 0.794 0.792 0.782 0.775
LPIPS 0.128 0.121 0.133 0.141 0.146
Table 4. Quantitative results of conditional avatar reconstruction.
Figure 17. Nearest neighbors in the training data according to CLIP feature similarity.
shown in Figure 21, where we observe consistent interpo- ings produced by our model may potentially be misused and
lation results with smooth appearance transition. Figure 21 viable solutions so avoid this include adding tags or water-
shows additional results of creating 3D portraits from a sin- marks when distributing the generated photos.
gle reference image. This work successfully generalizes the power of diffu-
sion models from 2D to 3D and is promising to offer the
E. Societal Impact new design tool for 3D artists which could significantly
save the costs of the traditional 3D modeling and render-
The Rodin model aims to enable a low-cost, fast and cus- ing pipeline. In the next we intend to explore the possibility
tomizable creation experience of 3D digital avatars that re- of modeling general 3D scenes using the same technique
fer to the traditional avatars manually created by 3D artists, and investigate novel applications such as Lego and archi-
as opposed to photorealistic avatars. The reason for focus- tect designs.
ing on digital avatars is twofold. On the one hand, digital
avatars are widely used in movies, games, the metaverse,
and the 3D industry in general. On the other hand, the avail-
able digital avatar data is very scarce as each avatar has to
be painstakingly created by a specialized 3D artist using a
sophisticated creation pipeline, especially for modeling hair
and facial hair.
Rather than collecting real photos, all our training im-
ages are rendered by Blender. Such synthetic data can miti-
gate the privacy and copyright concerns that existed in real
face collection. Another advantage of using synthetic data
is that we could have control over the variation and diver-
sity of rendered images, eliminating the data bias in exist-
ing face datasets. Also, digital avatars are easier to be dis-
tinguished from real people compared with photo-realistic
avatars, hindering misuse for impersonating real persons.
Nonetheless, the 3D portrait reconstruction and text-based
avatar customization could still be misused for spreading
disinformation maliciously, like all other AI-based content
generation models. We caution that the high-quality render-
Figure 18. Unconditional generation samples by our Rodin model. We visualize the mesh extracted from the generated density field.
Figure 19. Unconditional generation samples by our Rodin model.
Figure 20. Uncurated generation results by our Rodin model.