Clay
Clay
High-quality 3D Assets
LONGWEN ZHANG∗ , ShanghaiTech University and Deemos Technology Co., Ltd., China
ZIYU WANG∗ , ShanghaiTech University and Deemos Technology Co., Ltd., China
QIXUAN ZHANG† , ShanghaiTech University and Deemos Technology Co., Ltd., China
QIWEI QIU, ShanghaiTech University and Deemos Technology Co., Ltd., China
ANQI PANG, ShanghaiTech University, China
HAORAN JIANG, ShanghaiTech University and Deemos Technology Co., Ltd., China
arXiv:2406.13897v1 [cs.CV] 30 May 2024
Fig. 1. Against the backdrop of the great digital expanse, CLAY orchestrates a vibrant explosion of 3D creativity, unleashing unlimited imagination.
directly from a diverse range of 3D geometries. Specifically, it adopts neural as well as the inherent entanglement of geometry and appearance
fields to represent continuous and complete surfaces and uses a geometry of 3D assets.
generative module with pure transformer blocks in latent space. We present State-of-the-art 3D asset generation techniques largely build on
a progressive training scheme to train CLAY on an ultra large 3D model two distinct strategies: either lifting 2D generation into 3D or em-
dataset obtained through a carefully designed processing pipeline, resulting bracing 3D native strategies. In a nutshell, the former line of work
in a 3D native geometry generator with 1.5 billion parameters. For appear-
leverages 2D generative models [Rombach et al. 2022; Saharia et al.
ance generation, CLAY sets out to produce physically-based rendering (PBR)
textures by employing a multi-view material diffusion model that can gener- 2022] via intricate optimization techniques such as score distilla-
ate 2K resolution textures with diffuse, roughness, and metallic modalities. tions [Poole et al. 2023; Wang et al. 2023], or further refines 2D
We demonstrate using CLAY for a range of controllable 3D asset creations, models for multi-view generation [Liu et al. 2023c; Shi et al. 2024].
from sketchy conceptual designs to production ready assets with intricate They address the diverse appearance generation problem by em-
details. Even first time users can easily use CLAY to bring their vivid 3D ploying pretrained 2D generative models. As 2D priors do not easily
imaginations to life, unleashing unlimited creativity. translate to coherent 3D ones, methods based on 2D generation
generally lack concise 3D controls (preserving lines, angles, planes,
CCS Concepts: • Computing methodologies → Artificial intelligence.
etc) that one would expect in a foundational model and they conse-
Additional Key Words and Phrases: 3D Asset Generation, Multi-modal Con- quently fail to maintain high geometric fidelity. In comparison, 3D
trol, Physically-based Rendering, Diffusion Transformer, Large-scale Model native approaches attempt to train generative models directly from
3D datasets [Chang et al. 2015; Deitke et al. 2023] where 3D shapes
can be represented in explicit forms such as point clouds [Nichol
1 INTRODUCTION et al. 2022], meshes [Nash et al. 2020; Siddiqui et al. 2024] or implicit
Three-dimensional (3D) imagination allows us humans to visualize forms such as neural fields [Chen and Zhang 2019; Zhang et al.
and design structures, spaces, and systems before they are physically 2023c]. They can better “understand” and hence preserve geometric
constructed. When we were kids, we learned to build objects using features, but have limited generation ability unless they employ
this imagination, with as simple as clay, stones, or wood sticks, and much larger models, as shown in concurrent works [Ren et al. 2024;
for the lucky few, LEGO blocks. To us then, a building formed by Yariv et al. 2024]. Yet larger models subsequently require training
a few simple blocks can imaginatively transform to a magnificent on larger datasets, which are expensive to obtain, the problem that
castle and a wood stick attached to a stone into a LightSaber, Jedi’s 3D generation aims to address in the first place.
or Sith’s. In fact, with a diverse range of pieces in different shapes, In this paper, we aim to bring together the best of 2D-based and
sizes, and colors in hand, we once imagined having virtually unlim- 3D-based generations by following the “pretrain-then-adaptation”
ited capabilities for creating objects. This boundless imagination paradigm adopted in text/image generation, effectively mitigating
has fundamentally transformed the entertainment industry, from 3D data scarcity issue. We present CLAY, a novel Controllable and
feature films to computer games, and has led to significant advances Large-scale generative scheme to create 3D Assets with high-qualitY
in the field of computer graphics, from modeling to rendering. In geometry and appearance. CLAY manages to scale up the founda-
contrast, the capabilities of producing creative content by far fall tion model for 3D native geometry generation at an unprecedented
far behind our imagination. For example, the current 3D creation quality and variety, and at the same time it can generate appearance
workflow still requires immense artistic expertise and tedious man- with rich multi-view physically-based textures. The 3D assets gener-
ual labor. An ideal 3D creation tool should conveniently convert ated by CLAY contain not only geometric meshes but also material
our kid-like vibrant imagination into digital reality - it should effort- properties (diffuse, roughness, metallic, etc.), directly deployable to
lessly craft geometry and textures and support diverse controllable existing 3D asset production pipelines. As a versatile foundation
strategies for creation, translating abstract concepts into tangible, model, CLAY also supports a rich class of controllable adaptations
digital forms. and creations (from text prompts to 2D images, and to diverse 3D
Latest progresses on AI Generated Content (AIGC) [Po et al. 2023] primitives), to help conveniently convert a user’s imagination to
reignite the hope and enthusiasm to bridge imagination and creation, creation.
epitomized by the text-based 2D image generation that benefits from The core of CLAY is a large-scale generative model that extracts
the consolidation of large image datasets, effective neural network rich 3D priors directly from a diverse range of 3D geometries. Specifi-
architectures (e.g., Transformer [Vaswani et al. 2017], Diffusion cally, we adopt the neural field design from 3DShape2VecSet [Zhang
Model [Ho et al. 2020]), adaptation schemes (e.g., LoRA [Hu et al. et al. 2023c] to depict continuous and complete surfaces along with a
2022], ControlNet [Zhang et al. 2023b]), etc. It is not an exaggera- tailored multi-resolution geometry Variational Autoencoder (VAE).
tion that the 2D creation workflow has largely been revolutionized, We customize the geometry generative module in latent space with
perhaps symbolized by the controversial triumph of Midjourney’s an adaptive latent size. To conveniently scale up the model, we
AI-generated “Théâtre D’opéra Spatial” at a digital arts competition. adopt a minimalistic latent diffusion transformer (DiT) with pure
In a similar vein, we have also witnessed rapid progress in 3D asset transformer blocks to accommodate the adaptive latent size. We
generation. Yet compared with 2D generation, 3D generation has further propose a progressive training scheme to carefully increase
not yet reached the same level of progress that can fundamentally re- both the latent size and model parameters, resulting in a 3D na-
shape the 3D creation pipeline. Its model scalability and adaptation tive geometry generator with 1.5 billion parameters. The quality of
capabilities fall far behind mature 2D techniques. The challenges are training samples is crucial for fine-grained geometry generation,
multi-fold, stemming from the limited scale of quality 3D datasets especially considering the limited size of available 3D datasets. We
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 3
hence present a new data processing pipeline to standardize the image generation with viewpoint prompts to produce 3D shapes
diverse 3D data and enhance the data quality. Specifically, it includes via NeRF [Mildenhall et al. 2021] optimization. Although the idea is
a remeshing process that converts various 3D surfaces into occu- intriguing, earlier attempts struggled to consistently produce high-
pancy fields, preserving essential geometric features such as sharp quality and diverse results. Often, generating satisfactory results
edges and flat surfaces. At the same time, we harness the capabili- requires repeated adjustments to parameters and long waits of op-
ties of GPT-4V [OpenAI 2023] to produce robust annotations that timizations. Subsequent enhancements in SDS have explored the
accentuate these geometric characteristics. possibility of extending the idea to various neural fields [Chen et al.
The combination of new architecture, training scheme, and train- 2024; Huang et al. 2024; Lin et al. 2023; Wu et al. 2024; Yu et al. 2023b;
ing data in CLAY leads to a novel 3D native generative model that Zhu et al. 2024], ranging from DMTet [Shen et al. 2021] to the most
can create high-quality geometry, serving as the foundation to down- recent 3D Gaussian splatting [Kerbl et al. 2023], Various modifica-
stream model adaptations. For appearance generation, the scarcity tions managed to elevate the performance [Chen et al. 2023a; Li et al.
of abundant data poses a significant challenge for synthesizing ma- 2024; Metzer et al. 2023; Seo et al. 2024; Wang et al. 2023; Zhang
terial texture maps. To tackle this issue, CLAY sets out to generate et al. 2023a]. Yet a critical challenge remains: 2D image diffusion
multi-view physically-based rendering (PBR) textures, and subse- models utilized in SDS still lack an explicit understanding of neither
quently project them onto geometry. We construct a multi-view geometry nor viewpoint. The lack of perspective information and
material diffusion model analogous to 2D diffusion model [Rom- explicit 3D supervision can lead to the multi-head Janus problem,
bach et al. 2022] but trained on high-quality PBR textures from where realistic 3D renderings do not translate to view consistency
Objaverse [Deitke et al. 2023], to efficiently generate diffuse, rough- and every rendered view can be deemed as the front view.
ness, and metallic modalities while avoiding tedious distillation. We To mitigate the problem, Zero-1-to-3 [Liu et al. 2023c] proposes
further extend the diffusion model to support super-resolution as to integrate view information into the image generation process.
well as to accurately map the multi-view textures onto the generated This can be achieved by training an additional mapping from the
geometry. The modified model allows for much faster high-quality transformation matrix to the pretrained Stable Diffusion model,
textures generation than traditional optimization methods, produc- enabling the network to obtain some prior knowledge on view
ing 2K resolution in the UV space for realistic rendering. position and distribution. Alternative solutions attempt to employ
We further explore various adaptation schemes including LoRA- SDS to optimize a coherent neural field [Qian et al. 2024; Sun et al.
like fine-tuning and cross-attention-based conditioning, to support 2024; Tang et al. 2024; Zhang et al. 2023d], but they generally require
classic text or image-based creations as well as 3D-aware controls long optimization time. Latest developments [Blattmann et al. 2023;
from diverse primitives (multi-view images, voxels, bounding boxes, Li et al. 2023; Liu et al. 2024a; Long et al. 2024; Qiu et al. 2024; Shi
point clouds, implicit representations, etc). These extensive adapta- et al. 2023, 2024] have focused on directly generating multi-view
tion capabilities of CLAY hence enable controllable 3D asset creation images with view consistency, by employing enhanced attention
ranging from sketchy conceptual designs to more sophisticated ones mechanisms. These approaches have significantly improved multi-
with intricate details. Even first time users can use CLAY to bring view image generation, achieving a higher level of consistency.
their vivid 3D imaginations to life with our tailored interactive con- The downside there is the need to fine-tune Stable Diffusion
trols: a bustling village can be generated from scattered bounding using additional images either by conducting multi-view render-
boxes across a barren landscape, a spacecraft with futuristic wings ing [Deitke et al. 2023] or using auxiliary multi-view datasets [Reizen-
and propulsion system from craft blocks with textual descriptions, stein et al. 2021; Wu et al. 2023; Yu et al. 2023a]. Since the multi-view
and ultimately creations from imaginations. results can already be used to extract 3D shapes (e.g., via multi-view
stereo or neural methods), techniques such as SyncDreamer [Liu
2 RELATED WORK et al. 2024a] and Wonder3D [Long et al. 2024] employed NeuS [Wang
3D generation is undoubtedly the fastest-growing research arena in et al. 2021a] to accelerate generation. One-2-3-45 [Liu et al. 2023d]
AIGC. Efficient and high quality 3D asset creation via generation has gone one step further to train generalizable NeuS [Long et al.
benefits entertainment and gaming industry as well as film and 2022] on 3D datasets, to tackle sparse view inputs. Since the starting
animation productions. Previous practices have explored different point of all these approaches are 2D images, they unanimously focus
routes, ranging from directly training on 3D datasets, to imposing on the quality of generated images without attempting to preserve
generated 2D images as priors, and to imposing 3D priors on top of geometric fidelity. As a result, the generated geometry often suffers
2D generation. from incompleteness and lacks details.
Imposing 2D Images as Prior. 3D generation methods in this cate- Imposing 3D Geometry as Priors. To address challenges in 2D-
gory attempt to exploit significant strides made in 2D image gen- based techniques, an emerging class of solutions attempt to impose
eration, exemplified by latest advances such as DALL·E [Ramesh 3D shapes as priors. Even though One-2-3-45 [Liu et al. 2023d] is
et al. 2021], Imagen [Saharia et al. 2022] and Stable Diffusion [Rom- viewed as using 2D image priors, the clever use NeuS as geometry
bach et al. 2022]. Extending this prowess to 3D generation, many proxy reveals the possibility of imposing 3D shape priors. For ex-
approaches have adopted image-based techniques, focusing on trans- ample, Instant3D [Li et al. 2023], LRM [Hong et al. 2024; Wang et al.
forming 2D images into 3D structures or imposing 2D images as 2024], DMV3D [Xu et al. 2024] and TGS [Zou et al. 2024] further
priors. DreamFusion [Poole et al. 2023] pioneered this practice by utilized sparse-view or single-view reconstructors that leverage a Vi-
introducing Score Distillation Sampling (SDS) and employed 2D sion Transformer (ViT) as the vision backbone, coupled with a deep
4 • Zhang and Wang, et al
Fig. 2. An overview of our CLAY framework for 3D generation. Central to the framework is a large generative model trained on extensive 3D data, capable of
transforming textual descriptions into detailed 3D geometries. The model is further enhanced by physically-based material generation and versatile modal
adaptation, to enable the creation of 3D assets from diverse concepts and ensure their realistic rendering in digital environments.
transformer architecture to directly reconstruct NeRF with both techniques to create unique representations for each geometry in the
color and density attributes. They are hence commonly referred to training dataset, which is not efficient during training as they do not
Large Reconstruction Models (LRMs). Yet these techniques still fo- benefit from autoencoders. Other models such as SDFusion [Cheng
cus on minimizing the volume rendering loss rather than explicitly et al. 2023] and ShapeGPT [Yin et al. 2023] adopt an intuitive 3D
generating surfaces, resulting in coarse or noisy geometry. VAE (Variational Autoencoder) for encoding geometries and recon-
Apparently, the most straightforward practice to generate 3D structing SDF fields. These methods, primarily trained or tested on
would be to train on 3D datasets, rather than 2D images or image- the ShapeNet [Chang et al. 2015] dataset, are limited in the diversity
induced 3D shapes. Early approaches [Choy et al. 2016; Fan et al. and variety of shapes they can generate. 3DGen [Gupta et al. 2023]
2017; Groueix et al. 2018; Mescheder et al. 2019; Tang et al. 2019, employs a triplane VAE for both encoding and decoding SDF fields
2021a] primarily utilized 3D convolutional networks to understand whereas Shap-E [Jun and Nichol 2023], 3DShape2VecSet [Zhang
the 3D grid structure. Point-E [Nichol et al. 2022] took a pioneering et al. 2023c], and Michelangelo [Zhao et al. 2023] adopt a differ-
step by leveraging a pure transformer-based diffusion model for ent trajectory by utilizing transformers to encode the input point
denoising directly on the point clouds. This method is notable for its clouds into parameters for the decoding networks, signifying a shift
simplicity and efficiency, yet it faces great difficulties in transforming towards more sophisticated neural network architectures in 3D
the generated point clouds into precise, common mesh surfaces. generative models.
Polygen [Nash et al. 2020] and MeshGPT [Siddiqui et al. 2024] By far methods that aim to direct learning from 3D datasets, while
take a different approach by natively representing meshes through capable of producing better geometries than 2D-based generation,
points and surface sequences. These models are capable of producing still cannot match the hand-crafted ones by artists, in either detail
extremely high-quality meshes, but their dependence on small, high- or complexity. We observe, through the development of CLAY, this
quality datasets restricts their broader applicability. XCube [Ren is mainly because they have not sufficiently explored rich geometric
et al. 2024] introduces a strategy that simplifies geometry into multi- features embedded in the datasets. In addition, their small model
resolution voxels before diffusion. It streamlines the process but size limits the capability of generalization and diversification. In
faces challenges in managing complex prompts and supporting a CLAY, we resort to tailored geometry processing to mine a variety
broad range of downstream tasks, limiting its overall flexibility. groups of 3D datasets as well as discuss effective techniques to scale
It is worth mentioning that different 3D generation techniques up the generation model.
have relied on different datasets. This is not surprising as they are
based on different geometric representation but problematic as it is
essential to have a unified dataset that includes all available shapes. 3 LARGE-SCALE 3D GENERATIVE MODEL
One such attempt is to represent geometry uniformly in terms An effective 3D generative model should be able to generate 3D con-
of Signed Distance Field (SDF) [Park et al. 2019; Yariv et al. 2024], tents from different conditional inputs such as text, images, point
occupancy fields [Peng et al. 2020; Tang et al. 2021b], or both [Liu clouds, and voxels. As aforementioned, the task is challenging in
et al. 2024b; Zheng et al. 2023], and train directly on 3D datasets. how to define a 3D model: should 3D asset be viewed in terms of
Such approaches provide a more explicit mechanism than NeRF for geometry with per-vertex color or geometry with a texture map?
learning and extracting surfaces but require the latent encoding of should the 3D geometry be inferred from the generated appear-
watertight meshes for generation. Models such as DeepSDF [Park ance data or be directly generated? In CLAY, we adopt a minimalist
et al. 2019] and Mosaic-SDF [Yariv et al. 2024] utilize optimization approach, i.e., we separate the geometry and texture generation
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 5
Model size 𝑛 params 𝑛 layers 𝑑 model 𝑛 heads 𝑑 head Latent length Batch size Learning rate
Tiny 227M 24 768 12 64 512 1024 1e-4
512 16384 1e-5
Small 392M 24 1024 16 64
1024 8192 5e-6
512 16384 1e-4
Medium 600M 24 1280 16 80
1024 8192 5e-5
512 8192 1e-4
Large 853M 24 1536 16 96 1024 4096 1e-5
2048 2048 5e-6
512 4096 1e-4
XL 1.5B 24 2048 16 128 1024 2048 1e-5
2048 1024 5e-6
the text prompt into textual features c. The DiT’s role, defined as Our model, once trained on our expanded dataset (Sec. 3.2), demon-
𝜖 (·), is to predict the noise in Z𝑡 at timestep 𝑡: strates strong capabilities to generate 3D objects from text prompts
at a high quality and accuracy. During inference, we utilize a 100-
𝜖 (Z𝑡 , 𝑡, c) = {CrossAttn(SelfAttn(Z𝑡 ##t), c)}24, (3) timestep denoising process with linear-space timestep spacing for
efficient 3D geometry generation. The model then engages in dense
where the symbol ## signifies concatenation, and for clarity, certain sampling at a 5123 grid resolution with our VAE’s geometry decoder,
elements like projection and feed-forward layers are omitted from precisely determining occupancy values for detailed geometry cap-
this description. To efficiently capture fine geometric details, we ture, which are then converted to mesh using Marching Cubes.
optimize the DiT on high-dimensional latent sets. Specifically, we
employ a progressive training scheme, varying the latent code length 3.2 Data Standardization for Pretraining
for quicker convergence and time efficiency. Starting with a length of The effectiveness and robustness of large-scale 3D generative mod-
latent code 𝐿 = 512 at a higher learning rate, we gradually increase els rely on the quality and the scale of 3D datasets. Unlike text
to 1024, then to 2048, each time reducing the learning rate based and 2D images which are abundant and hence can support Stable
on empirical observations. This progressive scaling method ensures Diffusion, 3D datasets such as ShapeNet [Chang et al. 2015] and
robust and efficient training of our DiT. Objaverse [Deitke et al. 2023] are limited in size or quality. To obtain
large-scale high quality 3D data, it is essential to overcome chal-
Scaling-up Scheme. Scaling-up CLAY requires enhancing both
lenges such as non-watertight meshes, inconsistent orientations and
the VAE and DiT architectures with pre-normalization and GeLU
inaccurate annotation. Our solution is to apply a remeshing method
activation, to facilitate faster computation of attention mechanism.
for geometry unification and GPT-4V [OpenAI 2023] for precise
The feed-forward dimension is four times of the model dimension.
automatic annotation. Our standardization starts with filtering out
For noise scheduling, a discrete scheduler with 1000 timesteps is
unsuitable data, such as complex scenes and fragmented scans, re-
employed, and a cosine beta schedule is utilized during training.
sulting in a refined collection of 527K objects from ShapeNet and
Following the latest practice on diffusion training [Lin et al. 2024],
Objaverse, laying a robust groundwork for enhanced model perfor-
we implement zero terminal SNR by rescaling betas and opt for “v-
mance through tailored unification and annotation techniques.
prediction” as our training objective, a strategy that promotes stable
inference. To evaluate the impact of model size on performance, we Geometry Unification. To address the challenge of predicting a 3D
train five DiTs with sizes varying from 227 million to 1.5 billion shape’s occupancy field in the presence of non-watertight meshes
parameters, as outlined in Table. 1. Our smallest model, designed after data filtration, we propose a standardized geometry remesh-
for verification, can be trained on a single node with 8 NVidia ing protocol to ensure watertightness while avoiding discarding
A800 GPUs due to its smaller batch size, to support preliminary useful data in the training set. Popular remeshing tools such as
experiments. For larger models, we employed larger batch sizes, Manifold [Huang et al. 2018a], while efficient, tend to smooth edges
resulting in improved training stability and faster convergence rates. and corners, with its updated version, ManifoldPlus [Huang et al.
Our largest model, the XL, was trained on a cluster of 256 NVidia 2020], showing improved but inconsistent results. Alternatives such
A800 GPUs, for approximately 15 days, with progressive training. as “mesh-to-sdf” [Marian 2021] and Dual Octree Graph Networks
Following the insights in Gesmundo and Maile [2023] of Head (DOGN) [Wang 2022; Wang et al. 2022] set out to compute Signed
addition, Heads expansion and Hidden dimension expansion, we pro- and Unsigned Distance Fields but they are computationally costly.
gressively scale up the DiT during training. This approach offers As depicted in Fig. 4, the quality of training data for advanced 3D
benefits such as enhanced time efficiency, improved knowledge re- models is affected by these remeshing techniques, underscoring the
tention, and a reduced risk of the model trapped in the local optima. need for a strategy that balances precision and efficiency. Specific
This scaling-up process in DiT training, leveraging the suggested criteria for effective remeshing include: (1) Geometric Preservation
training techniques, is designed to optimize the model’s learning - maintaining essential geometric features with minimal alteration;
trajectory and overall performance. (2) Volume Conservation - ensuring the integrity of all structural
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 7
Fig. 5. Our Material Diffusion architecture and Asset Enhancement pipeline. Our Material Diffusion network, derived from existing diffusion models, facilitates
efficient fine-tuning. Following mesh quadrification and atlasing, it generates textures through a multi-view approach and subsequently back-projecte
them onto UV maps. The resultant materials, closely aligned with geometries and user inputs (text/image), faithfully respond to diverse lighting conditions,
culminating in realistic renderings.
5 MODEL ADAPTATION
CLAY, when pretrained, also serves as a versatile foundation model.
For example, CLAY directly supports Low-Rank Adaptation (LoRA)
on the attention layers of our DiT. This allows for efficient fine-
tuning, enabling the generation of 3D content targeted to specific
styles, as illustrated in Fig. 6. Further, the minimalistic architecture
enables us to efficiently support various conditional modalities to
support conditioned generation. We implement several exemplary
conditions that can be easily provided by a user, including text,
which is natively supported, as well as image/sketch, voxel, multi-
view images, point cloud, bounding box, and partial point cloud
with an extension box. These conditions, which can be applied
individually or in combination, enable the model to either faithfully
Fig. 6. Generation after LoRA fine-tuning on different specific datasets generate content based on a single condition or create 3D content
including the rock dataset and the pocket monster dataset. After generating with styles and user controls blended from multiple conditions,
a LEGO duck (center), which was one of the first toys designed by LEGO offering a wide range of creative possibilities.
founder Ole Kirk Kristiansen, CLAY can further generate variants in stone
styles (left) and pocket monster styles (right). 5.1 Conditioning Scheme
Building upon our existing text prompt conditioning, we extend the
model to incorporate additional conditions in parallel. Our use of
introduced in Text2Tex [Chen et al. 2023b], and integrate advanced pre-normalization [Xiong et al. 2020] converts the attention results
super-resolution techniques Real-ESRGAN [Wang et al. 2021b] and into residuals, enabling the addition of extra conditions as parallel
MultiDiffusion [Bar-Tal et al. 2023], achieving 2K texture resolution residuals alongside the text condition, which can be expressed as:
sufficient for most realistic rendering tasks. Our Material Diffusion 𝑛
∑︁
scheme enables the creation of high-quality textures, resulting in Z ← Z + CrossAttn(Z, c) + 𝛼𝑖 CrossAttn𝑖 (Z, c𝑖 ), (4)
production quality rendering. Our generation results are of a much 𝑖=1
higher quality and visual pleasantness than previous 3D generation where CrossAttn denotes the original text conditioning, CrossAttn𝑖
schemes enhancing engagement and realism of the generated 3D denotes the 𝑖-th additional trainable module and c𝑖 is the 𝑖-th condi-
assets. tion. The inclusion of scalar 𝛼𝑖 in this residual framework allows
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 9
Fig. 7. Illustration of our network’s conditioning design across various modalities. When used together, they support the creation of cinematic scenes with
lifelike renderings.
5.2 Implementation
Spatial Control. Our 3D geometry generative model incorporates
We discuss how to implement a variety of conditions for controlled
conditions in 3D modalities, a unique feature absent in previous
3D content generation. Each condition involves independently train-
approaches. This allows for spatial controls similar to those in 2D
ing an additional CrossAttn𝑖 (·) while keeping other parameters
diffusion models. However, different from 3D UNet structures with
fixed. Fig. 7 and Table. 2 showcase the specifications and hyperpa-
convolutional backbones that naturally maintain spatial resolution,
rameters of training for each condition. The base model and training
our approach uses a VAE that dynamically generates latent codes in-
data is described in Sec. 6.
terwoven with spatial coordinates, imposing a new set of challenges
for achieving precise spatial controls. Images and Sketches. For image and sketch conditions, we use
To address the integration of 3D conditions, we set out to learn the pretrained Vision Transformer (ViT) DINOv2 to extract both
additional positional embeddings for spatial features. This allows patch and global features. These features are integrated into CLAY
our attention layer to differentiate point coordinates from their via cross-attention, as indicated in Eqn. 4. This module is trained
features effectively. We start by associating the feature embedding using rendered RGB images and corresponding sketches from our
f ∈ R𝑀 ×𝐶 , learned during fine-tuning or extracted from a backbone dataset, ensuring alignment between the generated 3D models and
network, with sparse 3D points p ∈ R𝑀 ×3 sampled based on the type the visual characteristics of the conditioning images or sketches.
of condition being used, where 𝑀 and 𝐶 are the length and channels
of specific conditioning embedding. The exact sampling strategy is Voxel. Voxels represent spatial cubes and provide an intuitive
tailored to each condition type and will be detailed subsequently. medium for 3D construction. To integrate voxel-based guidance,
We then apply cross attention more specifically as: we initially construct a 163 voxel grid for each 3D object in our
dataset, marking each cell as occupied or vacant. These voxel grids
are down-sampled to a 83 feature volume using 3D convolution. The
3
CrossAttn𝑖 (Z, f + PosEmb(p)), (5) volume features f ∈ R8 ×𝐶 , added with positional embeddings of
10 • Zhang and Wang, et al
volume centers PosEmb(p), are then flattened and integrated into celebrating the fusion of art, tech, and human ingenuity as well as
the DiT through cross-attention. After training, CLAY can generate embracing our rich cultural heritage. The array also includes techno-
3D geometries that correspond to user-defined voxel structures, logically advanced vehicles, cultural artifacts, everyday items, and
effectively translating abstract voxel designs into intricate 3D forms. imaginative elements, all of which highlight the model’s capacity
for high-fidelity and varied 3D creations suitable for applications in
Bounding Boxes. Bounding boxes provide a straightforward way gaming, film, and virtual simulations.
for users to control the aspect ratio and position of 3D objects, Fig. 9 showcases CLAY’s conditioning capabilities across different
essential in interactive generation applications. The bounding box modalities. With image conditioning, CLAY generates geometric
features f ∈ R8×𝐶 , added with positional embeddings PosEmb(p), entities that faithfully resemble the input images, be it real-world
are learned during condition fine-tuning, enabling precise spatial photos, AI-generated concepts, or hand-drawn sketches. CLAY also
control. allows for the creation of entire towns or bedrooms from scattered
Sparse Point Cloud. Point clouds offer an easily accessible abstrac- bounding boxes. Using multi-view images, it reliably reconstructs
tion for 3D shapes. CLAY can use sparse point clouds as conditions, 3D geometries from multiple perspectives and normal maps. CLAY
to generate variants from input meshes or points. For this, we set further manages to generate from sparse point cloud, indicating it
feature embeddings f = 0, which indicates no feature embedding, can also serve as an effective surface reconstruction tool, analogous
and sample 512 points as p and learn the corresponding positional but outperforming GCNO [Xu et al. 2023] from as few as 512 points
embedding PosEmb(p). This allows CLAY to generate detailed 3D in the “knot” case. Additionally, CLAY can be used to further improve
geometries based on sparse surface point clouds while maintaining 3D geometries generated by existing techniques while maintaining
the overall shape and appearance. sharp edges and flat surfaces largely missing in prior art. Diversity
wise, CLAY excels in generating rich varieties in shapes from the
Multi-view Images. CLAY also supports multi-view images or same voxel input, transforming the same coarse shape into anything
multi-view normal maps as conditions, offering spatial control from a futuristic monument to a Medieval Castle, from an SUV
through projected views of 3D geometries. As a demonstration, to a space shuttle, resembling our unlimited imagination. Finally,
we use DINOv2 to extract features from various views’ images gen- CLAY can be used to complete missing parts from partially available
erated by the Wonder3D. These features are back-projected into geometry and therefore serves as both a geometry completion tool
a 3D volume similar to previous method [Liu et al. 2024a], then and an editing tool. For example, it allows us to alter a monster’s
down-sampled and flattened for integration into the DiT using body or turn a companion robot into a battle-ready counterpart, a
cross-attention, a similar procedure to the voxel condition. Star Wars fantasy for many.
Partial Point Cloud with Extension Box. This condition specifically 6.1 Evaluations
aims to address the point cloud completion task, where a certain
We have conducted comprehensive evaluations on CLAY, focusing
bounding box indicates the generation region of missing parts. We
on various aspects including model sizes, conditioning types, prompt
merge the input point cloud with the corner points of an exten-
engineering, multi-view conditioning, and geometry diversity.
sion box, applying a similar approach for learning bounding box
conditioning and sparse point cloud conditioning by concatenating Quantitative Evaluations. Here we evaluate nine versions of CLAY
these two set of features. This integration is instrumental in the as illustrated in Table. 3. The text-to-shape evaluation employs met-
effective reconstruction of incomplete geometries, precisely within rics including render-FID, render-KID, P-FID, P-KID, CLIP, and ULIP-
the specified extension areas. T, using a 16K text-shape pair validation set. We apply FID and KID
to both 2D (image rendering) and 3D (point cloud) feature spaces.
6 RESULTS For render-FID and render-KID, images are rendered from eight
We have trained five base models of different model sizes using our views, and PointNet++ [Qi et al. 2017] is used to extract 3D features
full training data with length of latent code 𝐿 = 1024, ranging from for P-FID and P-KID assessments. Additionally, we utilize CLIP-ViT-
Tiny-base to XL-base. Based on Large-base and XL-base, we have L/14 [Radford et al. 2021] for evaluating text-rendering similarity
trained Large-P and XL-P on a high-quality subset of our training and ULIP-2 [Xue et al. 2023] for text-shape alignment. Specifically,
data including 300K objects, using length of latent code 𝐿 = 1024. ULIP-T is defined as ULIP-T(𝑇 , 𝑆) = ⟨E𝑇 , E𝑆 ⟩, corresponding to the
Based on Large-P and XL-P, we have further trained using the same inner product of normalized ULIP features of caption 𝑇 and gen-
subset data but with a longer length of latent code 𝐿 = 2048. For erated geometry 𝑆. Table. 3 reveals the apparent trend that larger
adaptations including LoRA fine-tuning and conditioning, we have models excel over the smaller ones in text-to-shape generation tasks,
trained these modules based on XL-P using the same high-quality demonstrated by higher scores and more accurate text-shape align-
subset data, with each module independently trained for 8 hours. ment.
Next, we demonstrate the generation results with various condi- We have also evaluated various conditioning modules, including
tioning using the XL-P model of CLAY. Fig. 8 illustrates a sample image, multi-View normal, bounding box, and voxel, using XL-P as
collection of 3D models generated by CLAY, demonstrating its ver- the base model. Additional metrics such as Chamfer Distance (CD),
satility in producing a wide range of objects with intricate details Earth Mover’s Distance (EMD), Voxel-IoU, and F-score are employed
and textures. From ancient tools to futuristic spacecraft, the col- to assess conditioned shape generation accuracy. We further intro-
lection traces through a fascinating human history of imagination, duce ULIP-I to evaluate alignment between the condition image and
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 11
Fig. 8. Evolution of human innovation, from primitive tools and cultural artifacts to modern electronics and futuristic imaginings, generated by CLAY.
Fig. 9. Sample creations using CLAY, with conditions marked in sky blue and input geometries for respective conditioning (if available) in sandy brown.
12 • Zhang and Wang, et al
Model name Latent length render-FID↓ render-KID(×103 )↓ P-FID↓ P-KID(×103 )↓ CLIP(I-T)↑ ULIP-T↑
Tiny-base 1024 12.2241 3.4861 2.3905 4.1187 0.2242 0.1321
Small-base 1024 11.2982 4.2074 1.9332 4.1386 0.2319 0.1509
Medium-base 1024 13.0596 5.4561 1.4714 2.7708 0.2311 0.1511
Large-base 1024 6.5732 2.3617 0.8650 1.6377 0.2358 0.1559
XL-base 1024 5.2961 1.8640 0.7825 1.3805 0.2366 0.1554
Large-P 1024 5.7080 1.9997 0.7148 1.2202 0.2360 0.1565
XL-P 1024 4.0196 1.2773 0.6360 1.0761 0.2371 0.1564
Large-P-HD 2048 5.5634 1.8234 0.6394 0.9170 0.2374 0.1578
XL-P-HD 2048 4.4779 1.4486 0.5072 0.5180 0.2372 0.1569
Table 4. Quantitative evaluation of Multi-modal-to-3D for different conditions and their combinations.
the generated shapes. Both ULIP-T and ULIP-I are assessed across Geometry Diversity. CLAY also excels at generating high-quality
all conditions, except a few, such as voxel, that do not utilize text geometries with rich diversity. In Fig. 11, we showcase the results
or image inputs. Table. 4 shows that with as few as a single con- generated by CLAY conditioned on either text or image inputs,
dition, CLAY already manages to generate geometry of very high alongside the most relevant samples retrieved from the dataset. To
fidelity. Applying additional conditions further improves geometric perform geometry retrieval, we utilize cosine similarity to compare
details while maintaining high alignment with the ground truth text the normalized ULIP feature of the generated geometry with that
or image at the feature level. It is worth mentioning that among of geometries in the dataset. With text inputs, CLAY manages to
all settings, our multi-view normal (MVN) conditioning model ex- generate novel shapes that differ from any existing ones in the
hibits one of the most outstanding performances. Therefore, CLAY dataset. When presented with image inputs, CLAY faithfully recon-
can be also deemed as a reliable reconstruction back-end for other structs the content of the image while introducing novel structural
multi-view generation models [Long et al. 2024; Shi et al. 2024]. combinations that are absent from the dataset. For instance, the
airplane depicted at the bottom of Fig. 11 represents a novel concept
art piece generated by AI. It features the fuselage of a passenger
airplane, uniquely merged with square air intakes and the tail fins
Prompt engineering. We further explore the effects of varied prompt reminiscent of a fighter jet — a design composite that is never seen
tags on geometry generation, as illustrated in Fig. 10. For example, in the training data. Nevertheless, CLAY accurately generates its 3D
by incorporating “asymmetric geometry” into our prompts, CLAY geometry, capturing a high degree of resemblance to the provided
successfully generates asymmetric table and church. Similarly, the image.
transition from “sharp edges” to “smooth edges” prompts manages
to modify Pikachu and a dog into more rounded shapes. Interest- Effectiveness of MVN Conditioning. While single image condition-
ingly, typical 3D models composed of high-polygon meshes such as ing tends to allow for more liberty in creation, multi-view condi-
aircrafts and tanks can be transformed into low-polygon variants tioning harnesses multiple perspectives to deliver more detailed and
using CLAY. In contrast, the “complex geometry” tag prompts the precise control over the targeted generation, akin to a pixel-align
generation of intricate details in a chandelier and a sofa. Adding sparse-view reconstruction approach. Fig. 12 shows an example
“character” will transform inanimate objects such as a fireplug and where we use an initial image of a panther’s head (top left) as a start-
a mailbox into anthropomorphic figures, reminiscent to magics ing point. This image, when processed through our single image
taught at the Hogwarts. This experiment further indicates that spe- conditioning, yields a solid 3D geometry (left column). In contrast,
cific annotated tags applied during training can effectively steer the when the concept is further solidified using Wonder3D to generate
model to produce geometries with desired complexities and styles, multi-view images and corresponding normal maps, it results in a
enhancing the quality and specificity of the generated shapes. panther face mask with a notably thin surface (top right). Based on
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 13
← symmetric
asymmetric →
← sharp
smooth →
← low-poly
high-poly →
← simple
complex →
← original
character →
whole community, developing strategies to ensure the responsible Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese.
use of CLAY. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruc-
tion. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 628–644.
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt,
Limitations and Future Work. It is important to note that CLAY is K. Ehsanit, A. Kembhavi, and A. Farhadi. 2023. Objaverse: A Universe of Anno-
not yet complete end-to-end, as it entails distinct stages for gener- tated 3D Objects. In 2023 IEEE/CVF Conference on Computer Vision and Pattern
ating geometry and materials, and requires additional steps such Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 13142–13153.
https://doi.org/10.1109/CVPR52729.2023.01263
as remeshing and UV unwrapping. An immediate future step is to Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for
explore integrated model architectures to integrate geometry and 3d object reconstruction from a single image. In Proceedings of the IEEE conference
PBR materials. This will require implementing automatic schemes on computer vision and pattern recognition. 605–613.
Andrea Gesmundo and Kaitlin Maile. 2023. Composable Function-preserving Expan-
to produce geometry with consistent topology. By far, CLAY has sions for Transformer Architectures. arXiv:2308.06103 [cs.LG]
been trained on a substantially large dataset. However, there is still Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry.
2018. A papier-mâché approach to learning 3d surface generation. In Proceedings
room for improvement in terms of both the quantity and quality of the IEEE conference on computer vision and pattern recognition. 216–224.
of the training data, especially compared with 2D image datasets Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian Laforte, Vikram Voleti, Guan
used to train Stable Diffusion. Further, we observe that CLAY shows Luo, Chia-Hao Chen, Zi-Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang.
2023. threestudio: A unified framework for 3D content generation. https://github.
robustness in generating assets composed of single objects but tends com/threestudio-project/threestudio.
to be vulnerable when dealing with complex “composed objects”, Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 2023. 3DGen:
such as “a tiger riding a motorcycle”, particularly with text-only Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371 [cs.CV]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilis-
inputs. The issue is largely attributed to insufficient training data tic Models. In Advances in Neural Information Processing Systems, H. Larochelle,
of composed objects and the lack of detailed textual descriptions M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Asso-
ciates, Inc., 6840–6851. https://proceedings.neurips.cc/paper_files/paper/2020/file/
of these objects. The issue can potentially be mitigated through a 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
text-to-image-to-3D workflow, akin to the approaches employed by Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan
Wonder3D [Long et al. 2024] and One-2-3-45++ [Liu et al. 2023d]. Sunkavalli, Trung Bui, and Hao Tan. 2024. LRM: Large Reconstruction Model
for Single Image to 3D. In The Twelfth International Conference on Learning
As the community augments the training dataset with a larger and Representations.
more diverse collection of 3D shapes along with corresponding text Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
descriptions, we expect CLAY as well as its concurrent works to Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language
Models. In International Conference on Learning Representations.
reach a new level of geometry generation, in both quality and com- Jingwei Huang, Hao Su, and Leonidas J. Guibas. 2018a. Robust Watertight Manifold
plexity. Finally, we intend to explore extends of CLAY to dynamic Surface Generation Method for ShapeNet Models. arXiv:1802.01698 http://arxiv.
org/abs/1802.01698
object generation. The generated results from CLAY indicate that it Jingwei Huang, Yichao Zhou, and Leonidas Guibas. 2020. ManifoldPlus: A Robust
may be possible to semantically partition the geometry into mean- and Scalable Watertight Manifold Surface Generation Method for Triangle Soups.
ingful parts, further facilitating motion and interaction, as in Singer arXiv:2005.11621 [cs.GR]
Jingwei Huang, Yichao Zhou, Matthias Niessner, Jonathan Richard Shewchuk, and
et al. [2023] and Ling et al. [2024]. Leonidas J. Guibas. 2018b. QuadriFlow: A Scalable and Robust Method for Quadran-
gulation. Computer Graphics Forum 37 (2018). https://doi.org/10.1111/cgf.13498
Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, and Lei Zhang. 2024.
REFERENCES DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Genera-
tion. In The Twelfth International Conference on Learning Representations.
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. 2023. MultiDiffusion: fusing Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit
diffusion paths for controlled image generation. , 16 pages. Functions. arXiv:2305.02463 [cs.CV]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023.
Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions
and Robin Rombach. 2023. Stable Video Diffusion: Scaling Latent Video Diffusion on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-
Models to Large Datasets. arXiv:2311.15127 [cs.CV] splatting/
Blender Online Community. 2024. Blender - a 3D modelling and rendering package. Sixu Li, Chaojian Li, Wenbo Zhu, Boyang (Tony) Yu, Yang (Katie) Zhao, Cheng Wan, Hao-
http://www.blender.org. ran You, Huihong Shi, and Yingyan (Celine) Lin. 2023. Instant-3D: Instant Neural Ra-
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, diance Field Training Towards On-Device AR/VR 3D Reconstruction. In Proceedings
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, of the 50th Annual International Symposium on Computer Architecture (Orlando,
Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. FL, USA) (ISCA ’23). Association for Computing Machinery, New York, NY, USA,
arXiv:1512.03012 [cs.GR] Article 6, 13 pages. https://doi.org/10.1145/3579371.3589115
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Weiyu Li, Rui Chen, Xuelin Chen, and Ping Tan. 2024. SweetDreamer: Aligning Geomet-
Nießner. 2023b. Text2Tex: Text-driven Texture Synthesis via Diffusion Models. In ric Priors in 2D diffusion for Consistent Text-to-3D. In The Twelfth International
2023 IEEE/CVF International Conference on Computer Vision (ICCV). 18512–18522. Conference on Learning Representations.
https://doi.org/10.1109/ICCV51070.2023.01701 C. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M. Liu,
R. Chen, Y. Chen, N. Jiao, and K. Jia. 2023a. Fantasia3D: Disentangling Geometry and T. Lin. 2023. Magic3D: High-Resolution Text-to-3D Content Creation. In
and Appearance for High-quality Text-to-3D Content Creation. In 2023 IEEE/CVF 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
International Conference on Computer Vision (ICCV). IEEE Computer Society, Los IEEE Computer Society, Los Alamitos, CA, USA, 300–309. https://doi.org/10.1109/
Alamitos, CA, USA, 22189–22199. https://doi.org/10.1109/ICCV51070.2023.02033 CVPR52729.2023.00037
Zilong Chen, Feng Wang, and Huaping Liu. 2024. Text-to-3D using Gaussian Splat- Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. 2024. Common diffusion noise
ting. In Proceedings of the IEEE/CVF conference on computer vision and pattern schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter
recognition. Conference on Applications of Computer Vision. 5404–5411.
Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape mod- Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. 2024.
eling. In Proceedings of the IEEE/CVF conference on computer vision and pattern Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed
recognition. 5939–5948. Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision
Y. Cheng, H. Lee, S. Tulyakov, A. Schwing, and L. Gui. 2023. SDFusion: Multimodal 3D and pattern recognition.
Shape Completion, Reconstruction, and Generation. In 2023 IEEE/CVF Conference Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Han-
on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los sheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. 2024b. One-2-3-45++: Fast
Alamitos, CA, USA, 4456–4465. https://doi.org/10.1109/CVPR52729.2023.00433
18 • Zhang and Wang, et al
Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffu- Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-
sion. In Proceedings of the IEEE/CVF conference on computer vision and pattern to-3D using 2D Diffusion. In The Eleventh International Conference on Learning
recognition. Representations.
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet++: deep hierarchical
Hao Su. 2023d. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without feature learning on point sets in a metric space. 30 (2017), 5105–5114.
Per-Shape Optimization. In Advances in Neural Information Processing Systems, Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li,
A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard
Curran Associates, Inc., 22226–22246. https://proceedings.neurips.cc/paper_files/ Ghanem. 2024. Magic123: One Image to High-Quality 3D Object Generation Us-
paper/2023/file/4683beb6bab325650db13afd05d1a14a-Paper-Conference.pdf ing Both 2D and 3D Diffusion Priors. In The Twelfth International Conference on
R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick. 2023c. Learning Representations.
Zero-1-to-3: Zero-shot One Image to 3D Object. In 2023 IEEE/CVF International Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Wei-
Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, hao Yuan, Zilong Dong, Liefeng Bo, and Xiaoguang Han. 2024. RichDreamer:
USA, 9264–9275. https://doi.org/10.1109/ICCV51070.2023.00853 A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-
Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui 3D. In Proceedings of the IEEE/CVF conference on computer vision and pattern
Liu, Ziwei Liu, and Sergey Tulyakov. 2023b. HyperHuman: Hyper-Realistic Human recognition.
Generation with Latent Structural Diffusion. arXiv:2310.08579 [cs.CV] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
Wenping Wang. 2024a. SyncDreamer: Generating Multiview-consistent Images 2021. Learning transferable visual models from natural language supervision. In
from a Single-view Image. In The Twelfth International Conference on Learning International conference on machine learning. PMLR, 8748–8763.
Representations. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad-
Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan ford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation.
Qi, Xiaoshui Huang, Ding Liang, and Wanli Ouyang. 2023a. UniDream: Unifying In Proceedings of the 38th International Conference on Machine Learning, ICML
Diffusion Priors for Relightable Text-to-3D Generation. arXiv:2312.08754 [cs.CV] 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research,
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://
Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. proceedings.mlr.press/v139/ramesh21a.html
2024. Wonder3D: Single Image to 3D using Cross-Domain Diffusion. In Proceedings J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny.
of the IEEE/CVF conference on computer vision and pattern recognition. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D
Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. 2022. Category Reconstruction. In 2021 IEEE/CVF International Conference on Computer
SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views. Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 10881–10891. https:
In Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, //doi.org/10.1109/ICCV48922.2021.01072
Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis
Cham, 210–227. Williams. 2024. XCube ( X 3 ): Large-Scale 3D Generative Modeling using Sparse
Kleineberg Marian. 2021. mesh_to_sdf: Calculate signed distance fields for arbitrary Voxel Hierarchies. In Proceedings of the IEEE/CVF conference on computer vision
meshes. https://github.com/marian42/mesh_to_sdf. and pattern recognition.
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and An- Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. 2023. TEX-
dreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function Ture: Text-Guided Texturing of 3D Shapes. In ACM SIGGRAPH 2023 Conference
space. In Proceedings of the IEEE/CVF conference on computer vision and pattern Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing
recognition. 4460–4470. Machinery, New York, NY, USA, Article 54, 11 pages. https://doi.org/10.1145/
G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or. 2023. Latent- 3588432.3591503
NeRF for Shape-Guided Generation of 3D Shapes and Textures. In 2023 IEEE/CVF R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 2022. High-Resolution
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on
Society, Los Alamitos, CA, USA, 12663–12673. https://doi.org/10.1109/CVPR52729. Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los
2023.01218 Alamitos, CA, USA, 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Den-
mamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields ton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim
for view synthesis. Commun. ACM 65, 1 (2021), 99–106. Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photo-
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. 2020. Polygen: realistic Text-to-Image Diffusion Models with Deep Language Understanding.
An autoregressive generative model of 3d meshes. In International conference on In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
machine learning. PMLR, 7220–7229. A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates,
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Inc., 36479–36494. https://proceedings.neurips.cc/paper_files/paper/2022/file/
2022. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf
arXiv:2212.08751 [cs.CV] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim,
OpenAI. 2023. GPT-4V: Generative Pre-trained Transformer 4 for Vision. https: Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. 2024. Let 2D Diffusion Model Know
//www.openai.com/. 3D-Consistency for Robust Text-to-3D Generation. In The Twelfth International
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Conference on Learning Representations.
Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep
Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang- Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Syn-
Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve thesis. In Advances in Neural Information Processing Systems, A. Beygelzimer,
Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.).
DINOv2: Learning Robust Visual Features without Supervision. Transactions on Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei,
Machine Learning Research (2024). Linghao Chen, Chong Zeng, and Hao Su. 2023. Zero123++: a Single Image to
J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. 2019. DeepSDF: Learning Consistent Multi-view Diffusion Base Model. arXiv:2310.15110 [cs.CV]
Continuous Signed Distance Functions for Shape Representation. In 2019 IEEE/CVF Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. 2024. MV-
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Dream: Multi-view Diffusion for 3D Generation. In The Twelfth International
Society, Los Alamitos, CA, USA, 165–174. https://doi.org/10.1109/CVPR.2019.00025 Conference on Learning Representations.
Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti,
2020. Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th Vladislav Rosov, Angela Dai, and Matthias Nießner. 2024. MeshGPT: Generating
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Triangle Meshes with Decoder-Only Transformers. In Proceedings of the IEEE/CVF
Springer, 523–540. conference on computer vision and pattern recognition.
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filip-
Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, pos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, and
et al. 2023. State of the art on diffusion models for visual computing. arXiv preprint Yaniv Taigman. 2023. Text-To-4D Dynamic Scene Generation. In International
arXiv:2310.07204 (2023). Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii,
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma
Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett
Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV] (Eds.). PMLR, 31915–31929. https://proceedings.mlr.press/v202/singer23a.html
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 19
Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Language Model. arXiv:2311.17618 [cs.CV]
Liu. 2024. DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, and Fan Wang. 2023b.
Prior. In The Twelfth International Conference on Learning Representations. Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable
Jiapeng Tang, Xiaoguang Han, Junyi Pan, Kui Jia, and Xin Tong. 2019. A skeleton- Text-to-3D Generation. In Proceedings of the 31st ACM International Conference
bridged deep learning approach for generating meshes of complex topologies from on Multimedia (, Ottawa ON, Canada,) (MM ’23). Association for Computing Ma-
single rgb images. In Proceedings of the ieee/cvf conference on computer vision chinery, New York, NY, USA, 6841–6850. https://doi.org/10.1145/3581783.3612232
and pattern recognition. 4541–4550. X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang, G. Chen, S.
Jiapeng Tang, Xiaoguang Han, Mingkui Tan, Xin Tong, and Kui Jia. 2021a. Skeletonnet: Cui, and X. Han. 2023a. MVImgNet: A Large-scale Dataset of Multi-view Images. In
A topology-preserving solution for learning mesh reconstruction of object surfaces 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
from rgb images. IEEE transactions on pattern analysis and machine intelligence IEEE Computer Society, Los Alamitos, CA, USA, 9150–9161. https://doi.org/10.
44, 10 (2021), 6454–6471. 1109/CVPR52729.2023.00883
Jiapeng Tang, Jiabao Lei, Dan Xu, Feiying Ma, Kui Jia, and Lei Zhang. 2021b. Sa-convonet: Biao Zhang, Jiapeng Tang, Matthias Nießner, and Peter Wonka. 2023c. 3DShape2VecSet:
Sign-agnostic optimization of convolutional occupancy networks. In Proceedings A 3D Shape Representation for Neural Fields and Generative Diffusion Models. ACM
of the IEEE/CVF International Conference on Computer Vision. 6504–6513. Trans. Graph. 42, 4, Article 92 (jul 2023), 16 pages. https://doi.org/10.1145/3592442
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2024. DreamGaussian: Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye
Generative Gaussian Splatting for Efficient 3D Content Creation. In The Twelfth Shi, Sibei Yang, Lan Xu, and Jingyi Yu. 2023a. DreamFace: Progressive Generation
International Conference on Learning Representations. of Animatable 3D Faces under Text Guidance. ACM Trans. Graph. 42, 4, Article 138
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N (jul 2023), 16 pages. https://doi.org/10.1145/3592094
Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In L. Zhang, A. Rao, and M. Agrawala. 2023b. Adding Conditional Control to Text-to-
Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, Image Diffusion Models. In 2023 IEEE/CVF International Conference on Computer
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 3813–3824. https:
Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ //doi.org/10.1109/ICCV51070.2023.00355
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Youjia Zhang, Junqing Yu, Zikai Song, and Wei Yang. 2023d. Optimized View and
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Geometry Distillation from Multi-view Diffuser. arXiv:2312.06198 [cs.CV]
Wang. 2021a. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao
Multi-view Reconstruction. Advances in Neural Information Processing Systems Chen, Gang Yu, and Shenghua Gao. 2023. Michelangelo: Conditional 3d shape
34 (2021), 27171–27183. generation based on shape-image-text aligned latent representation. Advances in
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping neural information processing systems (2023).
Wang, Zexiang Xu, and Kai Zhang. 2024. PF-LRM: Pose-Free Large Reconstruction Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung
Model for Joint Pose and Shape Prediction. In The Twelfth International Conference Shum. 2023. Locally Attentional SDF Diffusion for Controllable 3D Shape Generation.
on Learning Representations. ACM Trans. Graph. 42, 4, Article 91 (jul 2023), 13 pages. https://doi.org/10.1145/
Peng-Shuai Wang. 2022. mesh2sdf. https://github.com/wang-ps/mesh2sdf. Converts 3592103
an input mesh to a signed distance field (SDF). Junzhe Zhu, Peiye Zhuang, and Sanmi Koyejo. 2024. HIFA: High-fidelity Text-to-
Peng-Shuai Wang, Yang Liu, and Xin Tong. 2022. Dual octree graph networks for 3D Generation with Advanced Diffusion Guidance. In The Twelfth International
learning adaptive volumetric shape representations. ACM Trans. Graph. 41, 4, Conference on Learning Representations.
Article 103 (jul 2022), 15 pages. https://doi.org/10.1145/3528223.3530087 Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and
X. Wang, L. Xie, C. Dong, and Y. Shan. 2021b. Real-ESRGAN: Training Real-World Song-Hai Zhang. 2024. Triplane Meets Gaussian Splatting: Fast and Generalizable
Blind Super-Resolution with Pure Synthetic Data. In 2021 IEEE/CVF International Single-View 3D Reconstruction with Transformers. In Proceedings of the IEEE/CVF
Conference on Computer Vision Workshops (ICCVW). IEEE Computer Society, Los conference on computer vision and pattern recognition.
Alamitos, CA, USA, 1905–1914. https://doi.org/10.1109/ICCVW54120.2021.00217
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun
Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with
Variational Score Distillation. In Thirty-seventh Conference on Neural Information
Processing Systems.
Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo
Liu, and Errui Ding. 2024. Hd-fusion: Detailed text-to-3d generation leveraging
multiple noise estimation. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision. 3202–3211.
T. Wu, J. Zhang, X. Fu, Y. Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian,
D. Lin, and Z. Liu. 2023. OmniObject3D: Large-Vocabulary 3D Object Dataset for
Realistic Perception, Reconstruction and Generation. In 2023 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los
Alamitos, CA, USA, 803–814. https://doi.org/10.1109/CVPR52729.2023.00084
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai
Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization
in the transformer architecture. In International Conference on Machine Learning.
PMLR, 10524–10533.
Rui Xu, Zhiyang Dou, Ningna Wang, Shiqing Xin, Shuangmin Chen, Mingyan Jiang,
Xiaohu Guo, Wenping Wang, and Changhe Tu. 2023. Globally Consistent Normal
Orientation for Point Clouds by Regularizing the Winding-Number Field. ACM
Trans. Graph. 42, 4, Article 111 (jul 2023), 15 pages. https://doi.org/10.1145/3592129
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan
Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. 2024. DMV3D: Denois-
ing Multi-view Diffusion Using 3D Large Reconstruction Model. In The Twelfth
International Conference on Learning Representations.
Le Xue, Ning Yu, Shu Zhang, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming
Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. ULIP-2: Towards
Scalable Multimodal Pre-training for 3D Understanding. arXiv:2305.08275 [cs.CV]
Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, and Yaron Lipman. 2024. Mosaic-
SDF for 3D Generative Models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition.
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter:
Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.
arXiv:2308.06721 [cs.CV]
Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li,
and Tao Chen. 2023. ShapeGPT: 3D Shape Generation with A Unified Multi-modal