0% found this document useful (0 votes)
20 views19 pages

Clay

CLAY is a novel large-scale generative model designed for creating high-quality 3D assets, enabling users to effortlessly transform their imagination into intricate digital structures. It integrates advanced techniques such as a multi-resolution Variational Autoencoder and a minimalistic latent Diffusion Transformer to generate both geometry and physically-based rendering textures. By supporting various input modalities and providing controllable adaptations, CLAY aims to bridge the gap between 2D and 3D asset generation, making it accessible even for first-time users.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

Clay

CLAY is a novel large-scale generative model designed for creating high-quality 3D assets, enabling users to effortlessly transform their imagination into intricate digital structures. It integrates advanced techniques such as a multi-resolution Variational Autoencoder and a minimalistic latent Diffusion Transformer to generate both geometry and physically-based rendering textures. By supporting various input modalities and providing controllable adaptations, CLAY aims to bridge the gap between 2D and 3D asset generation, making it accessible even for first-time users.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CLAY: A Controllable Large-scale Generative Model for Creating

High-quality 3D Assets
LONGWEN ZHANG∗ , ShanghaiTech University and Deemos Technology Co., Ltd., China
ZIYU WANG∗ , ShanghaiTech University and Deemos Technology Co., Ltd., China
QIXUAN ZHANG† , ShanghaiTech University and Deemos Technology Co., Ltd., China
QIWEI QIU, ShanghaiTech University and Deemos Technology Co., Ltd., China
ANQI PANG, ShanghaiTech University, China
HAORAN JIANG, ShanghaiTech University and Deemos Technology Co., Ltd., China
arXiv:2406.13897v1 [cs.CV] 30 May 2024

WEI YANG, Huazhong University of Science and Technology, China


LAN XU‡ , ShanghaiTech University, China
JINGYI YU‡ , ShanghaiTech University, China

Fig. 1. Against the backdrop of the great digital expanse, CLAY orchestrates a vibrant explosion of 3D creativity, unleashing unlimited imagination.

In the realm of digital creativity, our potential to craft intricate 3D worlds


∗ Equalcontributions. from imagination is often hampered by the limitations of existing digital
† Projectleader.
‡ Corresponding author.
tools, which demand extensive expertise and efforts. To narrow this dispar-
ity, we introduce CLAY, a 3D geometry and material generator designed to
effortlessly transform human imagination into intricate 3D digital structures.
Authors’ addresses: Longwen Zhang, ShanghaiTech University and Deemos Technology CLAY supports classic text or image inputs as well as 3D-aware controls
Co., Ltd., Shanghai, China, [email protected]; Ziyu Wang, ShanghaiTech
University and Deemos Technology Co., Ltd., Shanghai, China, wangzy6@shanghaitech.
from diverse primitives (multi-view images, voxels, bounding boxes, point
edu.cn; Qixuan Zhang, ShanghaiTech University and Deemos Technology Co., Ltd., clouds, implicit representations, etc). At its core is a large-scale generative
Shanghai, China, [email protected]; Qiwei Qiu, ShanghaiTech University model composed of a multi-resolution Variational Autoencoder (VAE) and a
and Deemos Technology Co., Ltd., Shanghai, China, [email protected]; minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors
Anqi Pang, ShanghaiTech University, Shanghai, China, [email protected];
Haoran Jiang, ShanghaiTech University and Deemos Technology Co., Ltd., Shanghai,
China, [email protected]; Wei Yang, Huazhong University of Science and Shanghai, China, [email protected]; Jingyi Yu, ShanghaiTech University,
Technology, Wuhan, China, [email protected]; Lan Xu, ShanghaiTech University, Shanghai, China, [email protected].
2 • Zhang and Wang, et al

directly from a diverse range of 3D geometries. Specifically, it adopts neural as well as the inherent entanglement of geometry and appearance
fields to represent continuous and complete surfaces and uses a geometry of 3D assets.
generative module with pure transformer blocks in latent space. We present State-of-the-art 3D asset generation techniques largely build on
a progressive training scheme to train CLAY on an ultra large 3D model two distinct strategies: either lifting 2D generation into 3D or em-
dataset obtained through a carefully designed processing pipeline, resulting bracing 3D native strategies. In a nutshell, the former line of work
in a 3D native geometry generator with 1.5 billion parameters. For appear-
leverages 2D generative models [Rombach et al. 2022; Saharia et al.
ance generation, CLAY sets out to produce physically-based rendering (PBR)
textures by employing a multi-view material diffusion model that can gener- 2022] via intricate optimization techniques such as score distilla-
ate 2K resolution textures with diffuse, roughness, and metallic modalities. tions [Poole et al. 2023; Wang et al. 2023], or further refines 2D
We demonstrate using CLAY for a range of controllable 3D asset creations, models for multi-view generation [Liu et al. 2023c; Shi et al. 2024].
from sketchy conceptual designs to production ready assets with intricate They address the diverse appearance generation problem by em-
details. Even first time users can easily use CLAY to bring their vivid 3D ploying pretrained 2D generative models. As 2D priors do not easily
imaginations to life, unleashing unlimited creativity. translate to coherent 3D ones, methods based on 2D generation
generally lack concise 3D controls (preserving lines, angles, planes,
CCS Concepts: • Computing methodologies → Artificial intelligence.
etc) that one would expect in a foundational model and they conse-
Additional Key Words and Phrases: 3D Asset Generation, Multi-modal Con- quently fail to maintain high geometric fidelity. In comparison, 3D
trol, Physically-based Rendering, Diffusion Transformer, Large-scale Model native approaches attempt to train generative models directly from
3D datasets [Chang et al. 2015; Deitke et al. 2023] where 3D shapes
can be represented in explicit forms such as point clouds [Nichol
1 INTRODUCTION et al. 2022], meshes [Nash et al. 2020; Siddiqui et al. 2024] or implicit
Three-dimensional (3D) imagination allows us humans to visualize forms such as neural fields [Chen and Zhang 2019; Zhang et al.
and design structures, spaces, and systems before they are physically 2023c]. They can better “understand” and hence preserve geometric
constructed. When we were kids, we learned to build objects using features, but have limited generation ability unless they employ
this imagination, with as simple as clay, stones, or wood sticks, and much larger models, as shown in concurrent works [Ren et al. 2024;
for the lucky few, LEGO blocks. To us then, a building formed by Yariv et al. 2024]. Yet larger models subsequently require training
a few simple blocks can imaginatively transform to a magnificent on larger datasets, which are expensive to obtain, the problem that
castle and a wood stick attached to a stone into a LightSaber, Jedi’s 3D generation aims to address in the first place.
or Sith’s. In fact, with a diverse range of pieces in different shapes, In this paper, we aim to bring together the best of 2D-based and
sizes, and colors in hand, we once imagined having virtually unlim- 3D-based generations by following the “pretrain-then-adaptation”
ited capabilities for creating objects. This boundless imagination paradigm adopted in text/image generation, effectively mitigating
has fundamentally transformed the entertainment industry, from 3D data scarcity issue. We present CLAY, a novel Controllable and
feature films to computer games, and has led to significant advances Large-scale generative scheme to create 3D Assets with high-qualitY
in the field of computer graphics, from modeling to rendering. In geometry and appearance. CLAY manages to scale up the founda-
contrast, the capabilities of producing creative content by far fall tion model for 3D native geometry generation at an unprecedented
far behind our imagination. For example, the current 3D creation quality and variety, and at the same time it can generate appearance
workflow still requires immense artistic expertise and tedious man- with rich multi-view physically-based textures. The 3D assets gener-
ual labor. An ideal 3D creation tool should conveniently convert ated by CLAY contain not only geometric meshes but also material
our kid-like vibrant imagination into digital reality - it should effort- properties (diffuse, roughness, metallic, etc.), directly deployable to
lessly craft geometry and textures and support diverse controllable existing 3D asset production pipelines. As a versatile foundation
strategies for creation, translating abstract concepts into tangible, model, CLAY also supports a rich class of controllable adaptations
digital forms. and creations (from text prompts to 2D images, and to diverse 3D
Latest progresses on AI Generated Content (AIGC) [Po et al. 2023] primitives), to help conveniently convert a user’s imagination to
reignite the hope and enthusiasm to bridge imagination and creation, creation.
epitomized by the text-based 2D image generation that benefits from The core of CLAY is a large-scale generative model that extracts
the consolidation of large image datasets, effective neural network rich 3D priors directly from a diverse range of 3D geometries. Specifi-
architectures (e.g., Transformer [Vaswani et al. 2017], Diffusion cally, we adopt the neural field design from 3DShape2VecSet [Zhang
Model [Ho et al. 2020]), adaptation schemes (e.g., LoRA [Hu et al. et al. 2023c] to depict continuous and complete surfaces along with a
2022], ControlNet [Zhang et al. 2023b]), etc. It is not an exaggera- tailored multi-resolution geometry Variational Autoencoder (VAE).
tion that the 2D creation workflow has largely been revolutionized, We customize the geometry generative module in latent space with
perhaps symbolized by the controversial triumph of Midjourney’s an adaptive latent size. To conveniently scale up the model, we
AI-generated “Théâtre D’opéra Spatial” at a digital arts competition. adopt a minimalistic latent diffusion transformer (DiT) with pure
In a similar vein, we have also witnessed rapid progress in 3D asset transformer blocks to accommodate the adaptive latent size. We
generation. Yet compared with 2D generation, 3D generation has further propose a progressive training scheme to carefully increase
not yet reached the same level of progress that can fundamentally re- both the latent size and model parameters, resulting in a 3D na-
shape the 3D creation pipeline. Its model scalability and adaptation tive geometry generator with 1.5 billion parameters. The quality of
capabilities fall far behind mature 2D techniques. The challenges are training samples is crucial for fine-grained geometry generation,
multi-fold, stemming from the limited scale of quality 3D datasets especially considering the limited size of available 3D datasets. We
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 3

hence present a new data processing pipeline to standardize the image generation with viewpoint prompts to produce 3D shapes
diverse 3D data and enhance the data quality. Specifically, it includes via NeRF [Mildenhall et al. 2021] optimization. Although the idea is
a remeshing process that converts various 3D surfaces into occu- intriguing, earlier attempts struggled to consistently produce high-
pancy fields, preserving essential geometric features such as sharp quality and diverse results. Often, generating satisfactory results
edges and flat surfaces. At the same time, we harness the capabili- requires repeated adjustments to parameters and long waits of op-
ties of GPT-4V [OpenAI 2023] to produce robust annotations that timizations. Subsequent enhancements in SDS have explored the
accentuate these geometric characteristics. possibility of extending the idea to various neural fields [Chen et al.
The combination of new architecture, training scheme, and train- 2024; Huang et al. 2024; Lin et al. 2023; Wu et al. 2024; Yu et al. 2023b;
ing data in CLAY leads to a novel 3D native generative model that Zhu et al. 2024], ranging from DMTet [Shen et al. 2021] to the most
can create high-quality geometry, serving as the foundation to down- recent 3D Gaussian splatting [Kerbl et al. 2023], Various modifica-
stream model adaptations. For appearance generation, the scarcity tions managed to elevate the performance [Chen et al. 2023a; Li et al.
of abundant data poses a significant challenge for synthesizing ma- 2024; Metzer et al. 2023; Seo et al. 2024; Wang et al. 2023; Zhang
terial texture maps. To tackle this issue, CLAY sets out to generate et al. 2023a]. Yet a critical challenge remains: 2D image diffusion
multi-view physically-based rendering (PBR) textures, and subse- models utilized in SDS still lack an explicit understanding of neither
quently project them onto geometry. We construct a multi-view geometry nor viewpoint. The lack of perspective information and
material diffusion model analogous to 2D diffusion model [Rom- explicit 3D supervision can lead to the multi-head Janus problem,
bach et al. 2022] but trained on high-quality PBR textures from where realistic 3D renderings do not translate to view consistency
Objaverse [Deitke et al. 2023], to efficiently generate diffuse, rough- and every rendered view can be deemed as the front view.
ness, and metallic modalities while avoiding tedious distillation. We To mitigate the problem, Zero-1-to-3 [Liu et al. 2023c] proposes
further extend the diffusion model to support super-resolution as to integrate view information into the image generation process.
well as to accurately map the multi-view textures onto the generated This can be achieved by training an additional mapping from the
geometry. The modified model allows for much faster high-quality transformation matrix to the pretrained Stable Diffusion model,
textures generation than traditional optimization methods, produc- enabling the network to obtain some prior knowledge on view
ing 2K resolution in the UV space for realistic rendering. position and distribution. Alternative solutions attempt to employ
We further explore various adaptation schemes including LoRA- SDS to optimize a coherent neural field [Qian et al. 2024; Sun et al.
like fine-tuning and cross-attention-based conditioning, to support 2024; Tang et al. 2024; Zhang et al. 2023d], but they generally require
classic text or image-based creations as well as 3D-aware controls long optimization time. Latest developments [Blattmann et al. 2023;
from diverse primitives (multi-view images, voxels, bounding boxes, Li et al. 2023; Liu et al. 2024a; Long et al. 2024; Qiu et al. 2024; Shi
point clouds, implicit representations, etc). These extensive adapta- et al. 2023, 2024] have focused on directly generating multi-view
tion capabilities of CLAY hence enable controllable 3D asset creation images with view consistency, by employing enhanced attention
ranging from sketchy conceptual designs to more sophisticated ones mechanisms. These approaches have significantly improved multi-
with intricate details. Even first time users can use CLAY to bring view image generation, achieving a higher level of consistency.
their vivid 3D imaginations to life with our tailored interactive con- The downside there is the need to fine-tune Stable Diffusion
trols: a bustling village can be generated from scattered bounding using additional images either by conducting multi-view render-
boxes across a barren landscape, a spacecraft with futuristic wings ing [Deitke et al. 2023] or using auxiliary multi-view datasets [Reizen-
and propulsion system from craft blocks with textual descriptions, stein et al. 2021; Wu et al. 2023; Yu et al. 2023a]. Since the multi-view
and ultimately creations from imaginations. results can already be used to extract 3D shapes (e.g., via multi-view
stereo or neural methods), techniques such as SyncDreamer [Liu
2 RELATED WORK et al. 2024a] and Wonder3D [Long et al. 2024] employed NeuS [Wang
3D generation is undoubtedly the fastest-growing research arena in et al. 2021a] to accelerate generation. One-2-3-45 [Liu et al. 2023d]
AIGC. Efficient and high quality 3D asset creation via generation has gone one step further to train generalizable NeuS [Long et al.
benefits entertainment and gaming industry as well as film and 2022] on 3D datasets, to tackle sparse view inputs. Since the starting
animation productions. Previous practices have explored different point of all these approaches are 2D images, they unanimously focus
routes, ranging from directly training on 3D datasets, to imposing on the quality of generated images without attempting to preserve
generated 2D images as priors, and to imposing 3D priors on top of geometric fidelity. As a result, the generated geometry often suffers
2D generation. from incompleteness and lacks details.

Imposing 2D Images as Prior. 3D generation methods in this cate- Imposing 3D Geometry as Priors. To address challenges in 2D-
gory attempt to exploit significant strides made in 2D image gen- based techniques, an emerging class of solutions attempt to impose
eration, exemplified by latest advances such as DALL·E [Ramesh 3D shapes as priors. Even though One-2-3-45 [Liu et al. 2023d] is
et al. 2021], Imagen [Saharia et al. 2022] and Stable Diffusion [Rom- viewed as using 2D image priors, the clever use NeuS as geometry
bach et al. 2022]. Extending this prowess to 3D generation, many proxy reveals the possibility of imposing 3D shape priors. For ex-
approaches have adopted image-based techniques, focusing on trans- ample, Instant3D [Li et al. 2023], LRM [Hong et al. 2024; Wang et al.
forming 2D images into 3D structures or imposing 2D images as 2024], DMV3D [Xu et al. 2024] and TGS [Zou et al. 2024] further
priors. DreamFusion [Poole et al. 2023] pioneered this practice by utilized sparse-view or single-view reconstructors that leverage a Vi-
introducing Score Distillation Sampling (SDS) and employed 2D sion Transformer (ViT) as the vision backbone, coupled with a deep
4 • Zhang and Wang, et al

Fig. 2. An overview of our CLAY framework for 3D generation. Central to the framework is a large generative model trained on extensive 3D data, capable of
transforming textual descriptions into detailed 3D geometries. The model is further enhanced by physically-based material generation and versatile modal
adaptation, to enable the creation of 3D assets from diverse concepts and ensure their realistic rendering in digital environments.

transformer architecture to directly reconstruct NeRF with both techniques to create unique representations for each geometry in the
color and density attributes. They are hence commonly referred to training dataset, which is not efficient during training as they do not
Large Reconstruction Models (LRMs). Yet these techniques still fo- benefit from autoencoders. Other models such as SDFusion [Cheng
cus on minimizing the volume rendering loss rather than explicitly et al. 2023] and ShapeGPT [Yin et al. 2023] adopt an intuitive 3D
generating surfaces, resulting in coarse or noisy geometry. VAE (Variational Autoencoder) for encoding geometries and recon-
Apparently, the most straightforward practice to generate 3D structing SDF fields. These methods, primarily trained or tested on
would be to train on 3D datasets, rather than 2D images or image- the ShapeNet [Chang et al. 2015] dataset, are limited in the diversity
induced 3D shapes. Early approaches [Choy et al. 2016; Fan et al. and variety of shapes they can generate. 3DGen [Gupta et al. 2023]
2017; Groueix et al. 2018; Mescheder et al. 2019; Tang et al. 2019, employs a triplane VAE for both encoding and decoding SDF fields
2021a] primarily utilized 3D convolutional networks to understand whereas Shap-E [Jun and Nichol 2023], 3DShape2VecSet [Zhang
the 3D grid structure. Point-E [Nichol et al. 2022] took a pioneering et al. 2023c], and Michelangelo [Zhao et al. 2023] adopt a differ-
step by leveraging a pure transformer-based diffusion model for ent trajectory by utilizing transformers to encode the input point
denoising directly on the point clouds. This method is notable for its clouds into parameters for the decoding networks, signifying a shift
simplicity and efficiency, yet it faces great difficulties in transforming towards more sophisticated neural network architectures in 3D
the generated point clouds into precise, common mesh surfaces. generative models.
Polygen [Nash et al. 2020] and MeshGPT [Siddiqui et al. 2024] By far methods that aim to direct learning from 3D datasets, while
take a different approach by natively representing meshes through capable of producing better geometries than 2D-based generation,
points and surface sequences. These models are capable of producing still cannot match the hand-crafted ones by artists, in either detail
extremely high-quality meshes, but their dependence on small, high- or complexity. We observe, through the development of CLAY, this
quality datasets restricts their broader applicability. XCube [Ren is mainly because they have not sufficiently explored rich geometric
et al. 2024] introduces a strategy that simplifies geometry into multi- features embedded in the datasets. In addition, their small model
resolution voxels before diffusion. It streamlines the process but size limits the capability of generalization and diversification. In
faces challenges in managing complex prompts and supporting a CLAY, we resort to tailored geometry processing to mine a variety
broad range of downstream tasks, limiting its overall flexibility. groups of 3D datasets as well as discuss effective techniques to scale
It is worth mentioning that different 3D generation techniques up the generation model.
have relied on different datasets. This is not surprising as they are
based on different geometric representation but problematic as it is
essential to have a unified dataset that includes all available shapes. 3 LARGE-SCALE 3D GENERATIVE MODEL
One such attempt is to represent geometry uniformly in terms An effective 3D generative model should be able to generate 3D con-
of Signed Distance Field (SDF) [Park et al. 2019; Yariv et al. 2024], tents from different conditional inputs such as text, images, point
occupancy fields [Peng et al. 2020; Tang et al. 2021b], or both [Liu clouds, and voxels. As aforementioned, the task is challenging in
et al. 2024b; Zheng et al. 2023], and train directly on 3D datasets. how to define a 3D model: should 3D asset be viewed in terms of
Such approaches provide a more explicit mechanism than NeRF for geometry with per-vertex color or geometry with a texture map?
learning and extracting surfaces but require the latent encoding of should the 3D geometry be inferred from the generated appear-
watertight meshes for generation. Models such as DeepSDF [Park ance data or be directly generated? In CLAY, we adopt a minimalist
et al. 2019] and Mosaic-SDF [Yariv et al. 2024] utilize optimization approach, i.e., we separate the geometry and texture generation
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 5

processes. This indicates that we choose not to use 2D generation


techniques which potentially help 3D geometry generation (e.g.,
through reconstruction). In our experiment, we find that once we
manage to scale up the 3D generation model and train it with suffi-
ciently large amount of high quality data, the directly generated 3D
geometry by CLAY exceeds previous 2D generation based/assisted
techniques by a large margin, in both diversity and quality (e.g.,
geometric details).
In a nutshell, CLAY is a large 3D generative model with 1.5 bil-
lion parameters, pretrained on high-quality 3D data. The significant
upscaling from prior art is key to improving its capabilities in gen-
eration diversity and quality. Architecture-wise, CLAY extends the
generative model in 3DShape2VecSet [Zhang et al. 2023c] with a
new multi-resolution Variational Autoencoder (VAE). This exten-
sion enables more efficient geometric data encoding and decoding.
In addition, we complement CLAY with an advanced latent Dif-
fusion Transformer (DiT) for probabilistic geometry generation.
Dataset-wise, we have developed a remeshing pipeline, along with
annotation schemes powered by GPT-4V [OpenAI 2023], to stan-
dardize and unify existing 3D datasets. These datasets historically
have not been used together for training a 3D generation model as
they are in different formats and lack consistencies. Our combined
dataset after processing maintains a consistent representation and
coherent annotations. We show that putting the model architecture
and training dataset together greatly improves 3D generation.

3.1 Representation and Model Architecture


Our approach for a 3D generative model emphasizes on learning
to denoise 3D data in a compressed latent space, analogous to the Fig. 3. Network design of our VAE and DiT. With a minimalist design, our
foundation 2D generative models. This strategy significantly re- DiT supports scalable training and VAE operates effectively across various
duces the complexity and is computationally much more efficient geometric resolutions.
than directly working in 3D space. We adopt the representation
and architecture from 3DShape2VecSet but augment it with new where X̃ denotes a down-sampled version of X at 1/4 scale, effec-
scaling-up strategies. Specificially, we encode a 3D geometry into tively reducing the latent code’s length 𝐿 to a quarter of the input
latent space by sampling a point cloud X from a 3D mesh surface point cloud size 𝑁 . The VAE’s decoder, consisting of 24 self-attention
M. This point cloud is encoded into a latent code with dynamic layers and a cross-attention layer, processes these latent codes and
shape Z = R𝐿×64 with a length 𝐿 and channel size 64 using the a list of query points p, outputting occupancy logits:
encoder E of a transformer-based VAE, expressed as Z = E (X).
We then learn a DiT to denoise the latent code Z𝑡 with noise at D (Z, p) = CrossAttn(PosEmb(p), SelfAttn24 (Z)). (2)
step 𝑡. Finally, the VAE decoder D decodes the generated latent Our VAE is dimensioned at 512 with 8 attention heads, culminating
codes from DiT into a neural field, as D (Z0, p) → [0, 1], where p in a total of 82 million parameters. The latent code size is configured
is a testing coordinate in space, and D determines if p is inside or as 𝐿 × 64, with 𝐿 varying based on the input point cloud size.
outside the 3D shape. Recall our objective is to achieve substantial In 3DShape2VecSet, the point clouds are generally of small sizes
scaling-up of this architectural model. To maintain robust scale-up and therefore are insufficient to capture fine geometric details. We
while facilitating effective training, we develop a new scheme based adopt a multi-resolution approach. At each iteration, we first ran-
on multi-resolution encoding. Such an extension not only enhances domly choose a sampling size 𝑁 from 2048, 4096, or 8192, to ensure
the model’s capacity to manage large-scale data but also ensures variability. Next, we sample the corresponding number of surface
refined training outcomes, underpinning the model’s performance, points from the input mesh M.
scalability, and adaptability.
Coarse-to-fine DiT. Our DiT employs a minimalistic yet effective
Multi-resolution VAE. In the design of our VAE module, we follow structure, consisting of a 24-layer pure transformer, with added
the structure outlined in 3DShape2VecSet. This involves embedding cross-attention mechanisms for accommodating text prompt con-
the input point cloud X ∈ R𝑁 ×3 sampled from a mesh M into a latent ditions. The encoding process involves sampling 𝑁 = 4𝐿 surface
code using a learnable embedding function and a cross-attention points from a 3D mesh, which are subsequently encoded into a
encoding module: latent code Z ∈ R𝐿×64 using E (·). In parallel, a pretrained language
Z = E (X) = CrossAttn(PosEmb( X̃), PosEmb(X)), (1) model, specifically CLIP-ViT-L/14 [Radford et al. 2021], processes
6 • Zhang and Wang, et al

Table 1. DiT specifications and training hyper parameters.

Model size 𝑛 params 𝑛 layers 𝑑 model 𝑛 heads 𝑑 head Latent length Batch size Learning rate
Tiny 227M 24 768 12 64 512 1024 1e-4
512 16384 1e-5
Small 392M 24 1024 16 64
1024 8192 5e-6
512 16384 1e-4
Medium 600M 24 1280 16 80
1024 8192 5e-5
512 8192 1e-4
Large 853M 24 1536 16 96 1024 4096 1e-5
2048 2048 5e-6
512 4096 1e-4
XL 1.5B 24 2048 16 128 1024 2048 1e-5
2048 1024 5e-6
the text prompt into textual features c. The DiT’s role, defined as Our model, once trained on our expanded dataset (Sec. 3.2), demon-
𝜖 (·), is to predict the noise in Z𝑡 at timestep 𝑡: strates strong capabilities to generate 3D objects from text prompts
at a high quality and accuracy. During inference, we utilize a 100-
𝜖 (Z𝑡 , 𝑡, c) = {CrossAttn(SelfAttn(Z𝑡 ##t), c)}24, (3) timestep denoising process with linear-space timestep spacing for
efficient 3D geometry generation. The model then engages in dense
where the symbol ## signifies concatenation, and for clarity, certain sampling at a 5123 grid resolution with our VAE’s geometry decoder,
elements like projection and feed-forward layers are omitted from precisely determining occupancy values for detailed geometry cap-
this description. To efficiently capture fine geometric details, we ture, which are then converted to mesh using Marching Cubes.
optimize the DiT on high-dimensional latent sets. Specifically, we
employ a progressive training scheme, varying the latent code length 3.2 Data Standardization for Pretraining
for quicker convergence and time efficiency. Starting with a length of The effectiveness and robustness of large-scale 3D generative mod-
latent code 𝐿 = 512 at a higher learning rate, we gradually increase els rely on the quality and the scale of 3D datasets. Unlike text
to 1024, then to 2048, each time reducing the learning rate based and 2D images which are abundant and hence can support Stable
on empirical observations. This progressive scaling method ensures Diffusion, 3D datasets such as ShapeNet [Chang et al. 2015] and
robust and efficient training of our DiT. Objaverse [Deitke et al. 2023] are limited in size or quality. To obtain
large-scale high quality 3D data, it is essential to overcome chal-
Scaling-up Scheme. Scaling-up CLAY requires enhancing both
lenges such as non-watertight meshes, inconsistent orientations and
the VAE and DiT architectures with pre-normalization and GeLU
inaccurate annotation. Our solution is to apply a remeshing method
activation, to facilitate faster computation of attention mechanism.
for geometry unification and GPT-4V [OpenAI 2023] for precise
The feed-forward dimension is four times of the model dimension.
automatic annotation. Our standardization starts with filtering out
For noise scheduling, a discrete scheduler with 1000 timesteps is
unsuitable data, such as complex scenes and fragmented scans, re-
employed, and a cosine beta schedule is utilized during training.
sulting in a refined collection of 527K objects from ShapeNet and
Following the latest practice on diffusion training [Lin et al. 2024],
Objaverse, laying a robust groundwork for enhanced model perfor-
we implement zero terminal SNR by rescaling betas and opt for “v-
mance through tailored unification and annotation techniques.
prediction” as our training objective, a strategy that promotes stable
inference. To evaluate the impact of model size on performance, we Geometry Unification. To address the challenge of predicting a 3D
train five DiTs with sizes varying from 227 million to 1.5 billion shape’s occupancy field in the presence of non-watertight meshes
parameters, as outlined in Table. 1. Our smallest model, designed after data filtration, we propose a standardized geometry remesh-
for verification, can be trained on a single node with 8 NVidia ing protocol to ensure watertightness while avoiding discarding
A800 GPUs due to its smaller batch size, to support preliminary useful data in the training set. Popular remeshing tools such as
experiments. For larger models, we employed larger batch sizes, Manifold [Huang et al. 2018a], while efficient, tend to smooth edges
resulting in improved training stability and faster convergence rates. and corners, with its updated version, ManifoldPlus [Huang et al.
Our largest model, the XL, was trained on a cluster of 256 NVidia 2020], showing improved but inconsistent results. Alternatives such
A800 GPUs, for approximately 15 days, with progressive training. as “mesh-to-sdf” [Marian 2021] and Dual Octree Graph Networks
Following the insights in Gesmundo and Maile [2023] of Head (DOGN) [Wang 2022; Wang et al. 2022] set out to compute Signed
addition, Heads expansion and Hidden dimension expansion, we pro- and Unsigned Distance Fields but they are computationally costly.
gressively scale up the DiT during training. This approach offers As depicted in Fig. 4, the quality of training data for advanced 3D
benefits such as enhanced time efficiency, improved knowledge re- models is affected by these remeshing techniques, underscoring the
tention, and a reduced risk of the model trapped in the local optima. need for a strategy that balances precision and efficiency. Specific
This scaling-up process in DiT training, leveraging the suggested criteria for effective remeshing include: (1) Geometric Preservation
training techniques, is designed to optimize the model’s learning - maintaining essential geometric features with minimal alteration;
trajectory and overall performance. (2) Volume Conservation - ensuring the integrity of all structural
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 7

materials. Together, these steps transform coarse meshes into more


engaging assets in digital environments.
Mesh Quadrification and Atlasing. In CLAY, the initial geomet-
ric meshes via the Marching Cubes algorithm typically consist of
millions of uneven triangles. While suitable for early stages, such
structure poses challenges in editing and application, notably when
Input Manifold [2018a] ManifoldPlus [2020] exported to mesh editing tools or game engines. In addition, it would
require complicated automatic UV unwrapping — a crucial step in
texture mapping and material synthesis. To overcome these chal-
lenges, we transform these triangle-faced meshes into quad-faced
ones using off-the-shelf tools [Blender Online Community 2024;
Huang et al. 2018b], preserving key geometric features such as
sharp edges and flat surfaces. This quadrification process is highly
mesh-to-sdf [2021] DOGN [2022] Ours crucial for yielding high-quality final meshes, facilitating the effec-
Fig. 4. Comparison against existing mesh preprocessing methods using tive conversion from coarse 3D models to the refined assets.
cross-sectional analysis. The input is a non-watertight chair with its surface Material Synthesis. In addition to geometry generation, it is equally
not closed. Red lines correspond to the faces of meshes, light gray indicates
important to produce high quality textures in 3D generation. The
“outside” and dark gray indicates “inside”. Our method maximizes posi-
physically-based rendering (PBR) materials, typically consisting
tive volume while faithfully preserving geometric features. This robustness
extends to non-watertight input meshes, ensuring consistent and reliable of diffuse, metallic, and roughness textures, are essential for con-
results. veying convincing visual experiences in digital environments. Ex-
isting methods in PBR texture generation by far have focused on
creating a very small subset of these materials. In addition, these
elements; and (3) Adaptability to Non-Watertight Meshes - profi- approaches lack supervision on specific material attributes, limiting
ciently managing non-watertight models to preserve volumetric the rendering quality. For example, RichDreamer [Qiu et al. 2024]
accuracy essential for model training. generates diffuse maps without roughness and metallic predictions.
Inspired by DOGN [Wang 2022; Wang et al. 2022], we adopt the Fantasia3D [Chen et al. 2023a] and UniDream [Liu et al. 2023a] can
Unsigned Distance Field (UDF) representation because of its seam- produce roughness and metallic attributes but do not consider richer
less conversion capabilities between mesh formats and correction attributes. Therefore they cannot generate richer material types.
of inconsistencies in vertex and face density. In addition, the tra- We aim to synthesize a wide range of PBR materials including
ditional Marching Cubes algorithm for isosurface extraction can diffuse, roughness, and metallic textures. From Objaverse [Deitke
produce a mere thin shell in scenarios involving mesh holes. To et al. 2023], we carefully choose over 40,000 objects, each charac-
address this, we employ a grid-based visibility computation before terized by high-quality PBR materials. Utilizing this dataset, we
isosurface extraction. Specifically, we label a grid point as “inside” developed a multi-view Material Diffusion to synthesize textures
when completely obscured from all angles, maximizing volume for with a significantly speed-up over existing methods, which are then
stable VAE training. accurately mapped onto the geometries’ UV space in a way similar
to TEXTure [Richardson et al. 2023].
Geometry Annotation. The impact of text prompts on 2D image We modify MVDream [Shi et al. 2024], originally designed for
generation by models such as Stable Diffusion [Rombach et al. 2022] image space generation, to suit the need for generation from tex-
and SDXL [Podell et al. 2023] reveals the importance of precise ture attributes with additional channels and modalities. Inspired
prompts in any successful 3D generative model. Previous studies by HyperHuman [Liu et al. 2023b], we integrate three branches
have demonstrated how “magic prompts” guide specific content and into its UNet’s outer most convolutional layers, each with skip
style. Recognizing this, we emphasize accurate textual prompts in connections, allowing concurrent denoising across various texture
our 3D model to capture geometric and stylistic details of objects. We modalities and ensuring view consistency. Similar to MVDream, our
have developed unique prompt tags and utilized GPT-4V [OpenAI training process includes selecting orthogonal-view rendered tex-
2023] for producing detailed annotation, enhancing the model’s ture images for each 3D object in training data, and applying both
capability to interpret and generate complex 3D geometries with full-parameter for add-on layers and LoRA-based fine-tuning for
nuanced details and diverse styles. inside layers, focusing on generating high-quality, view-consistent
PBR materials. Following the same training regimen, our model
4 ASSET ENHANCEMENT capably synthesizes texture images from four camera viewpoints,
To make the generated digital assets directly usable in existing CG aligned precisely with the input geometry. This is achieved by ap-
pipelines, we further adopt a two-stage scheme: post-generation plying the pretrained ControlNet [Zhang et al. 2023b], with each
geometry optimization and material synthesis. Geometry optimiza- target view’s rendered normal map as inputs. Such an approach not
tion ensures structural integrity and compatibility as well as refines only ensures geometric accuracy but also allows for image-based
the model’s form aesthetically and functionally. Material synthesis input customization via IPAdapter [Ye et al. 2023]. To further en-
is crucial for adding lifelike qualities through realistic textures and hance texture detail, we employ a targeted inpainting approach as
8 • Zhang and Wang, et al

Fig. 5. Our Material Diffusion architecture and Asset Enhancement pipeline. Our Material Diffusion network, derived from existing diffusion models, facilitates
efficient fine-tuning. Following mesh quadrification and atlasing, it generates textures through a multi-view approach and subsequently back-projecte
them onto UV maps. The resultant materials, closely aligned with geometries and user inputs (text/image), faithfully respond to diverse lighting conditions,
culminating in realistic renderings.

5 MODEL ADAPTATION
CLAY, when pretrained, also serves as a versatile foundation model.
For example, CLAY directly supports Low-Rank Adaptation (LoRA)
on the attention layers of our DiT. This allows for efficient fine-
tuning, enabling the generation of 3D content targeted to specific
styles, as illustrated in Fig. 6. Further, the minimalistic architecture
enables us to efficiently support various conditional modalities to
support conditioned generation. We implement several exemplary
conditions that can be easily provided by a user, including text,
which is natively supported, as well as image/sketch, voxel, multi-
view images, point cloud, bounding box, and partial point cloud
with an extension box. These conditions, which can be applied
individually or in combination, enable the model to either faithfully
Fig. 6. Generation after LoRA fine-tuning on different specific datasets generate content based on a single condition or create 3D content
including the rock dataset and the pocket monster dataset. After generating with styles and user controls blended from multiple conditions,
a LEGO duck (center), which was one of the first toys designed by LEGO offering a wide range of creative possibilities.
founder Ole Kirk Kristiansen, CLAY can further generate variants in stone
styles (left) and pocket monster styles (right). 5.1 Conditioning Scheme
Building upon our existing text prompt conditioning, we extend the
model to incorporate additional conditions in parallel. Our use of
introduced in Text2Tex [Chen et al. 2023b], and integrate advanced pre-normalization [Xiong et al. 2020] converts the attention results
super-resolution techniques Real-ESRGAN [Wang et al. 2021b] and into residuals, enabling the addition of extra conditions as parallel
MultiDiffusion [Bar-Tal et al. 2023], achieving 2K texture resolution residuals alongside the text condition, which can be expressed as:
sufficient for most realistic rendering tasks. Our Material Diffusion 𝑛
∑︁
scheme enables the creation of high-quality textures, resulting in Z ← Z + CrossAttn(Z, c) + 𝛼𝑖 CrossAttn𝑖 (Z, c𝑖 ), (4)
production quality rendering. Our generation results are of a much 𝑖=1
higher quality and visual pleasantness than previous 3D generation where CrossAttn denotes the original text conditioning, CrossAttn𝑖
schemes enhancing engagement and realism of the generated 3D denotes the 𝑖-th additional trainable module and c𝑖 is the 𝑖-th condi-
assets. tion. The inclusion of scalar 𝛼𝑖 in this residual framework allows
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 9

Fig. 7. Illustration of our network’s conditioning design across various modalities. When used together, they support the creation of cinematic scenes with
lifelike renderings.

Table 2. Conditioning module specifications.


for direct manipulation of the influence exerted by each additional
Conditioning 𝑛 params 𝑀 𝐶 Backbone
condition.
While this conditioning scheme is general, obtaining the em- Image/Sketch 352M 257 1536 DINOv2-Giant
bedded condition c𝑖 requires careful calibration. For image/sketch Voxel 260M 83 512 /
conditions, we utilize the pretrained DINOv2 [Oquab et al. 2024] Multi-view images 358M 83 768 DINOv2-Small
model to extract features as conditions and directly integrate using Point cloud 252M 512 512 /
the cross-attention in the above equation. However, for spatially Bounding box 252M 8 512 /
related modalities such as voxel, multi-view images, point cloud, Partial point cloud 252M 2048+8 512 /
bounding box, and partial point cloud with an extension box, directly
applying cross-attention on features do not guarantee to preserve where PosEmb(·) is the learnable positional embedding. This method
spatial information pertaining to those conditions. To maintain allows for the effective integration of various 3D modalities into
spatial integrity, we have devised a specific learning strategy. our model.

5.2 Implementation
Spatial Control. Our 3D geometry generative model incorporates
We discuss how to implement a variety of conditions for controlled
conditions in 3D modalities, a unique feature absent in previous
3D content generation. Each condition involves independently train-
approaches. This allows for spatial controls similar to those in 2D
ing an additional CrossAttn𝑖 (·) while keeping other parameters
diffusion models. However, different from 3D UNet structures with
fixed. Fig. 7 and Table. 2 showcase the specifications and hyperpa-
convolutional backbones that naturally maintain spatial resolution,
rameters of training for each condition. The base model and training
our approach uses a VAE that dynamically generates latent codes in-
data is described in Sec. 6.
terwoven with spatial coordinates, imposing a new set of challenges
for achieving precise spatial controls. Images and Sketches. For image and sketch conditions, we use
To address the integration of 3D conditions, we set out to learn the pretrained Vision Transformer (ViT) DINOv2 to extract both
additional positional embeddings for spatial features. This allows patch and global features. These features are integrated into CLAY
our attention layer to differentiate point coordinates from their via cross-attention, as indicated in Eqn. 4. This module is trained
features effectively. We start by associating the feature embedding using rendered RGB images and corresponding sketches from our
f ∈ R𝑀 ×𝐶 , learned during fine-tuning or extracted from a backbone dataset, ensuring alignment between the generated 3D models and
network, with sparse 3D points p ∈ R𝑀 ×3 sampled based on the type the visual characteristics of the conditioning images or sketches.
of condition being used, where 𝑀 and 𝐶 are the length and channels
of specific conditioning embedding. The exact sampling strategy is Voxel. Voxels represent spatial cubes and provide an intuitive
tailored to each condition type and will be detailed subsequently. medium for 3D construction. To integrate voxel-based guidance,
We then apply cross attention more specifically as: we initially construct a 163 voxel grid for each 3D object in our
dataset, marking each cell as occupied or vacant. These voxel grids
are down-sampled to a 83 feature volume using 3D convolution. The
3
CrossAttn𝑖 (Z, f + PosEmb(p)), (5) volume features f ∈ R8 ×𝐶 , added with positional embeddings of
10 • Zhang and Wang, et al

volume centers PosEmb(p), are then flattened and integrated into celebrating the fusion of art, tech, and human ingenuity as well as
the DiT through cross-attention. After training, CLAY can generate embracing our rich cultural heritage. The array also includes techno-
3D geometries that correspond to user-defined voxel structures, logically advanced vehicles, cultural artifacts, everyday items, and
effectively translating abstract voxel designs into intricate 3D forms. imaginative elements, all of which highlight the model’s capacity
for high-fidelity and varied 3D creations suitable for applications in
Bounding Boxes. Bounding boxes provide a straightforward way gaming, film, and virtual simulations.
for users to control the aspect ratio and position of 3D objects, Fig. 9 showcases CLAY’s conditioning capabilities across different
essential in interactive generation applications. The bounding box modalities. With image conditioning, CLAY generates geometric
features f ∈ R8×𝐶 , added with positional embeddings PosEmb(p), entities that faithfully resemble the input images, be it real-world
are learned during condition fine-tuning, enabling precise spatial photos, AI-generated concepts, or hand-drawn sketches. CLAY also
control. allows for the creation of entire towns or bedrooms from scattered
Sparse Point Cloud. Point clouds offer an easily accessible abstrac- bounding boxes. Using multi-view images, it reliably reconstructs
tion for 3D shapes. CLAY can use sparse point clouds as conditions, 3D geometries from multiple perspectives and normal maps. CLAY
to generate variants from input meshes or points. For this, we set further manages to generate from sparse point cloud, indicating it
feature embeddings f = 0, which indicates no feature embedding, can also serve as an effective surface reconstruction tool, analogous
and sample 512 points as p and learn the corresponding positional but outperforming GCNO [Xu et al. 2023] from as few as 512 points
embedding PosEmb(p). This allows CLAY to generate detailed 3D in the “knot” case. Additionally, CLAY can be used to further improve
geometries based on sparse surface point clouds while maintaining 3D geometries generated by existing techniques while maintaining
the overall shape and appearance. sharp edges and flat surfaces largely missing in prior art. Diversity
wise, CLAY excels in generating rich varieties in shapes from the
Multi-view Images. CLAY also supports multi-view images or same voxel input, transforming the same coarse shape into anything
multi-view normal maps as conditions, offering spatial control from a futuristic monument to a Medieval Castle, from an SUV
through projected views of 3D geometries. As a demonstration, to a space shuttle, resembling our unlimited imagination. Finally,
we use DINOv2 to extract features from various views’ images gen- CLAY can be used to complete missing parts from partially available
erated by the Wonder3D. These features are back-projected into geometry and therefore serves as both a geometry completion tool
a 3D volume similar to previous method [Liu et al. 2024a], then and an editing tool. For example, it allows us to alter a monster’s
down-sampled and flattened for integration into the DiT using body or turn a companion robot into a battle-ready counterpart, a
cross-attention, a similar procedure to the voxel condition. Star Wars fantasy for many.

Partial Point Cloud with Extension Box. This condition specifically 6.1 Evaluations
aims to address the point cloud completion task, where a certain
We have conducted comprehensive evaluations on CLAY, focusing
bounding box indicates the generation region of missing parts. We
on various aspects including model sizes, conditioning types, prompt
merge the input point cloud with the corner points of an exten-
engineering, multi-view conditioning, and geometry diversity.
sion box, applying a similar approach for learning bounding box
conditioning and sparse point cloud conditioning by concatenating Quantitative Evaluations. Here we evaluate nine versions of CLAY
these two set of features. This integration is instrumental in the as illustrated in Table. 3. The text-to-shape evaluation employs met-
effective reconstruction of incomplete geometries, precisely within rics including render-FID, render-KID, P-FID, P-KID, CLIP, and ULIP-
the specified extension areas. T, using a 16K text-shape pair validation set. We apply FID and KID
to both 2D (image rendering) and 3D (point cloud) feature spaces.
6 RESULTS For render-FID and render-KID, images are rendered from eight
We have trained five base models of different model sizes using our views, and PointNet++ [Qi et al. 2017] is used to extract 3D features
full training data with length of latent code 𝐿 = 1024, ranging from for P-FID and P-KID assessments. Additionally, we utilize CLIP-ViT-
Tiny-base to XL-base. Based on Large-base and XL-base, we have L/14 [Radford et al. 2021] for evaluating text-rendering similarity
trained Large-P and XL-P on a high-quality subset of our training and ULIP-2 [Xue et al. 2023] for text-shape alignment. Specifically,
data including 300K objects, using length of latent code 𝐿 = 1024. ULIP-T is defined as ULIP-T(𝑇 , 𝑆) = ⟨E𝑇 , E𝑆 ⟩, corresponding to the
Based on Large-P and XL-P, we have further trained using the same inner product of normalized ULIP features of caption 𝑇 and gen-
subset data but with a longer length of latent code 𝐿 = 2048. For erated geometry 𝑆. Table. 3 reveals the apparent trend that larger
adaptations including LoRA fine-tuning and conditioning, we have models excel over the smaller ones in text-to-shape generation tasks,
trained these modules based on XL-P using the same high-quality demonstrated by higher scores and more accurate text-shape align-
subset data, with each module independently trained for 8 hours. ment.
Next, we demonstrate the generation results with various condi- We have also evaluated various conditioning modules, including
tioning using the XL-P model of CLAY. Fig. 8 illustrates a sample image, multi-View normal, bounding box, and voxel, using XL-P as
collection of 3D models generated by CLAY, demonstrating its ver- the base model. Additional metrics such as Chamfer Distance (CD),
satility in producing a wide range of objects with intricate details Earth Mover’s Distance (EMD), Voxel-IoU, and F-score are employed
and textures. From ancient tools to futuristic spacecraft, the col- to assess conditioned shape generation accuracy. We further intro-
lection traces through a fascinating human history of imagination, duce ULIP-I to evaluate alignment between the condition image and
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 11

Fig. 8. Evolution of human innovation, from primitive tools and cultural artifacts to modern electronics and futuristic imaginings, generated by CLAY.

Fig. 9. Sample creations using CLAY, with conditions marked in sky blue and input geometries for respective conditioning (if available) in sandy brown.
12 • Zhang and Wang, et al

Table 3. Quantitative evaluation of Text-to-3D for models of different sizes.

Model name Latent length render-FID↓ render-KID(×103 )↓ P-FID↓ P-KID(×103 )↓ CLIP(I-T)↑ ULIP-T↑
Tiny-base 1024 12.2241 3.4861 2.3905 4.1187 0.2242 0.1321
Small-base 1024 11.2982 4.2074 1.9332 4.1386 0.2319 0.1509
Medium-base 1024 13.0596 5.4561 1.4714 2.7708 0.2311 0.1511
Large-base 1024 6.5732 2.3617 0.8650 1.6377 0.2358 0.1559
XL-base 1024 5.2961 1.8640 0.7825 1.3805 0.2366 0.1554
Large-P 1024 5.7080 1.9997 0.7148 1.2202 0.2360 0.1565
XL-P 1024 4.0196 1.2773 0.6360 1.0761 0.2371 0.1564
Large-P-HD 2048 5.5634 1.8234 0.6394 0.9170 0.2374 0.1578
XL-P-HD 2048 4.4779 1.4486 0.5072 0.5180 0.2372 0.1569

Table 4. Quantitative evaluation of Multi-modal-to-3D for different conditions and their combinations.

Condition CD(×103 )↓ EMD(×102 )↓ Voxel-IoU↑ F-Score↑ P-FID↓ P-KID(×103 )↓ ULIP-T↑ ULIP-I↑


Image 12.4092 17.6155 0.4513 0.4070 0.9946 1.9889 0.1329 0.2066
MVN 0.9924 5.7283 0.7697 0.8218 0.3038 0.2420 0.1393 0.2220
Voxel 0.5676 8.4254 0.6273 0.6049 2.6963 5.0008 0.1186 0.1837
Image-Bbox 5.4733 14.0811 0.5122 0.4909 1.5884 3.2994 0.1275 0.2028
Image-Voxel 0.7491 8.1174 0.6514 0.6541 2.4866 6.8767 0.1262 0.2017
Text-Image 7.7198 14.5489 0.4980 0.4609 0.7996 1.4489 0.1407 0.2122
Text-MVN 0.7301 5.4034 0.7842 0.8358 0.2184 0.1233 0.1424 0.2240
Text-Bbox 5.6421 14.6170 0.4921 0.4659 2.0074 4.0355 0.1417 0.1838
Text-Voxel 0.6090 7.4981 0.6737 0.6689 1.0427 1.0903 0.1397 0.2036

the generated shapes. Both ULIP-T and ULIP-I are assessed across Geometry Diversity. CLAY also excels at generating high-quality
all conditions, except a few, such as voxel, that do not utilize text geometries with rich diversity. In Fig. 11, we showcase the results
or image inputs. Table. 4 shows that with as few as a single con- generated by CLAY conditioned on either text or image inputs,
dition, CLAY already manages to generate geometry of very high alongside the most relevant samples retrieved from the dataset. To
fidelity. Applying additional conditions further improves geometric perform geometry retrieval, we utilize cosine similarity to compare
details while maintaining high alignment with the ground truth text the normalized ULIP feature of the generated geometry with that
or image at the feature level. It is worth mentioning that among of geometries in the dataset. With text inputs, CLAY manages to
all settings, our multi-view normal (MVN) conditioning model ex- generate novel shapes that differ from any existing ones in the
hibits one of the most outstanding performances. Therefore, CLAY dataset. When presented with image inputs, CLAY faithfully recon-
can be also deemed as a reliable reconstruction back-end for other structs the content of the image while introducing novel structural
multi-view generation models [Long et al. 2024; Shi et al. 2024]. combinations that are absent from the dataset. For instance, the
airplane depicted at the bottom of Fig. 11 represents a novel concept
art piece generated by AI. It features the fuselage of a passenger
airplane, uniquely merged with square air intakes and the tail fins
Prompt engineering. We further explore the effects of varied prompt reminiscent of a fighter jet — a design composite that is never seen
tags on geometry generation, as illustrated in Fig. 10. For example, in the training data. Nevertheless, CLAY accurately generates its 3D
by incorporating “asymmetric geometry” into our prompts, CLAY geometry, capturing a high degree of resemblance to the provided
successfully generates asymmetric table and church. Similarly, the image.
transition from “sharp edges” to “smooth edges” prompts manages
to modify Pikachu and a dog into more rounded shapes. Interest- Effectiveness of MVN Conditioning. While single image condition-
ingly, typical 3D models composed of high-polygon meshes such as ing tends to allow for more liberty in creation, multi-view condi-
aircrafts and tanks can be transformed into low-polygon variants tioning harnesses multiple perspectives to deliver more detailed and
using CLAY. In contrast, the “complex geometry” tag prompts the precise control over the targeted generation, akin to a pixel-align
generation of intricate details in a chandelier and a sofa. Adding sparse-view reconstruction approach. Fig. 12 shows an example
“character” will transform inanimate objects such as a fireplug and where we use an initial image of a panther’s head (top left) as a start-
a mailbox into anthropomorphic figures, reminiscent to magics ing point. This image, when processed through our single image
taught at the Hogwarts. This experiment further indicates that spe- conditioning, yields a solid 3D geometry (left column). In contrast,
cific annotated tags applied during training can effectively steer the when the concept is further solidified using Wonder3D to generate
model to produce geometries with desired complexities and styles, multi-view images and corresponding normal maps, it results in a
enhancing the quality and specificity of the generated shapes. panther face mask with a notably thin surface (top right). Based on
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 13

← symmetric
asymmetric →

← sharp
smooth →

← low-poly
high-poly →

← simple
complex →

← original
character →

Fig. 10. Evaluation of the CLAY’s ability to alter generated content by


incorporating different geometric feature tags in the prompt. We showcase Input CLAY Nearest dataset samples
precise controls over the geometry style, in the extreme case transforming
a fireplug into a T-pose character.
Fig. 11. Evaluation of the geometry diversity. We present top-3 nearest
samples retrieved from the dataset. CLAY generates high-quality geometries
that match the description but are distinct from the ones in the dataset.
these multi-view images, our multi-view images conditioning using
normal maps successfully harnesses these multiple views, leading to Table 5. Quantitative comparison with state-of-the-art methods.
a faithful yet efficient synthesis of the thin surface (center column),
Method CLIP CLIP ULIP-T ULIP-I Time
distinct from the traditional NeuS method applied to Wonder3D’s
outputs (right column). This comparison underscores the precision Text-to-3D (N-T) (I-T)
and efficiency of our multi-view image conditioning in guiding the Shap-E 0.1761 0.2081 0.1160 / ∼10s
generation of detailed 3D geometries. DreamFusion 0.1549 0.1781 0.0566 / ∼1.5h
Magic3d 0.1553 0.2034 0.0661 / ∼1.5h
Running Time. Regarding the inference timing breakdown, on a MVDream 0.1786 0.2237 0.1351 / ∼1.5h
single Nvidia A100 GPU, it takes CLAY about 4 seconds for shape RichDreamer 0.1891 0.2281 0.1503 / ∼2h
latent generation, 1 seconds to decode the latent due to the efficient CLAY 0.1948 0.2324 0.1705 / ∼45s
adaptive sampling, 8 seconds for mesh processing, and 32 seconds Image-to-3D (N-I) (I-I)
for PBR generation, cumulatively resulting in a total generation Shap-E 0.6315 0.6971 / 0.1307 ∼10s
time of 45 seconds. Wonder3D 0.6489 0.7220 / 0.1520 ∼4min
DreamCraft3D 0.6641 0.7718 / 0.1706 ∼4h
6.2 Comparisons with SOTA One-2-3-45++ 0.6271 0.7574 / 0.1743 ∼90s
We compare our methods with leading text-to-3D approaches, namely Michelangelo 0.6726 / / 0.1899 ∼10s
Shap-E [Jun and Nichol 2023], DreamFusion [Poole et al. 2023], CLAY 0.6848 0.7769 / 0.2140 ∼45s
Magic3D [Lin et al. 2023], MVDream [Shi et al. 2024], and Rich-
Dreamer [Qiu et al. 2024]. We utilize the open-source code for
14 • Zhang and Wang, et al

CLAY-single image CLAY-MVN Wonder3D-NeuS


(from 4 views) (from 6 views)
Fig. 12. Geometry generation via single image and multi-view image condi- Shap-E Dream- Magic3D MVDream Rich- CLAY
tioning with multi-view RGB and normal images generated by Wonder3D. (∼10s) Fusion (∼1.5h) (∼1.5h) Dreamer (∼45s)
(∼1.5h) (∼2h)
Fig. 13. Comparisons of CLAY vs. state-of-the-art methods on text-
Shap-E, MVDream, and RichDreamer, while for DreamFusion and conditioned generation. From top to bottom: “Mythical creature dragon”,
Magic3D, we employ a third-party implementation [Guo et al. 2023]. “Stag deer”, “Interstellar warship”, “Space rocket”, and “Eagle, wooden
statue”.
Qualitative Comparison. On Text-to-3D tasks, Fig. 13 illustrates
the comparison using normal maps, with text inputs such as “Myth-
ical creature dragon”, “Stag deer”, “Interstellar warship”, “Space among the multi-view output. One-2-3-45++ is efficient in creat-
rocket”, and “Eagle, wooden statue”. Shap-E exhibits faster genera- ing smooth geometries but lacked details and does not fully main-
tion but lacks complete geometry structures. Pure SDS optimization tain symmetry, especially on complex objects such as Chairs and
methods like DreamFusion and Magic3D exhibit the multi-face Janus Dragons. DreamCraft3D is an SDS optimization method that pro-
artifacts. MVDream and RichDreamer, which generate multi-view duces high-quality output, but is time-consuming and still results
images for SDS, produce consistent geometries but exhibit a defi- in uneven surfaces. CLAY in contrast manages to quickly generate
ciency in surface smoothness and require long optimization times. detailed and high-quality geometries along with high quality PBR
In contrast, CLAY manages to produce high-quality 3D assets in textures.
roughly 45 seconds (5 seconds for geometry and 40 seconds for Quantitative Comparisons. We perform additional quantitative
texture). The generated geometries exhibit smooth surfaces without comparison using a GPT-4 generated test dataset that includes 50
compromising intricate details, better matching the text prompts. images and 50 text prompts tailored for text-to-3D and image-to-3D
We have further compared the image-to-3D generation quality evaluations, respectively. In addition to from ULIP-T and ULIP-I, we
of between CLAY and SOTA (Shap-E [Jun and Nichol 2023], Won- render 30 views of RGB images and normal maps for each gener-
der3D [Long et al. 2024], One-2-3-45++ [Liu et al. 2024b], Dream- ated 3D asset, respectively. We apply four CLIP-based metrics to
Craft3D [Sun et al. 2024], and Michelangelo [Zhao et al. 2023]). We these views, calculating the average to provide a comprehensive
use the official code of respective techniques except One-2-3-45++ assessment. CLIP(N-I) and CLIP(N-T) gauge the geometric align-
where only its online demo is available. Our evaluations include ment of the normal map with the input image and text, respectively
inputs like Chair, Car, Dragon Head, and Sword, detailed in Fig. 14. whereas CLIP(I-I) and CLIP(I-T) evaluate the appearance by mea-
Note that Michelangelo produces only geometries and we manu- suring the similarity of rendered images with the input images and
ally assign a similar color for rendering. Shap-E, while fast, fails text. As shown in Table. 5, CLAY outperforms SOTA techniques in
to accurately reconstruct the input images, resulting in incomplete all metrics.
geometries. Wonder3D, which relies on multi-view images and nor-
mal prediction followed by NeuS [Wang et al. 2021a] reconstruction, PBR Material Comparison. Another key component in CLAY is
produces coarse and incomplete geometries due to inconsistencies material generation. Here we show visual comparisons between
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 15

Input Shap-E Wonder3D One-2-3-45++ DreamCraft3D Michelangelo CLAY


(∼10s) (∼4min) (∼90s) (∼4h) (∼10s) (∼45s)
Fig. 14. Comparison with state-of-the-art methods on image-conditioned generation. Even without performing optimization using the target view, CLAY still
generates high-quality and detailed geometries that faithfully resemble the input image, preserving essential geometric features, including straight lines and
matching surface curvatures. Note that all input images are generated by Stable Diffusion. Colors of Michelangelo are manually set.
16 • Zhang and Wang, et al

MVDream RichDreamer CLAY


Fig. 16. User studies of CLAY vs. state-of-the-art methods indicates strong
Fig. 15. Comparison of rendering results under two distinct lighting preferences of CLAY in generating both geometry and appearance.
conditions. The light probes are displayed at the top-right corner. Our
method showcases high-quality rendering with accurate specular high-
lights, whereas MVDream lacks matching highlights and RichDreamer
misses view dependency by modeling highlights as fixed surface textures. 7 DISCUSSIONS AND CONCLUSIONS
We have presented CLAY, a large-scale 3D generative model that
supports multi-modal controls for high quality 3D asset genera-
tion, further bridging the gap between the vivid realms of human
CLAY and two leading methods, MVDream [Shi et al. 2024] and imagination and the tangible world of digital creation. By enabling
RichDreamer [Qiu et al. 2024], using the text prompt “Space rocket”. users to effortlessly craft and manipulate digital geometry and tex-
Fig. 15 illustrates that, under varying lighting conditions MVDream tures, CLAY empowers both experts and novices alike to facilitate
without PBR materials cannot fully reproduce specular highlights. the seamless transformation of abstract concepts into detailed and
RichDreamer, employing an albedo diffusion model, attempts to realistic 3D models, expanding the horizons of digital artistry and
distinguish the albedo from complex lighting effects. In this case design. At CLAY’s core is a large-scale generative framework en-
though, the highlights are modeled as fixed surface textures un- abled by a multi-resolution VAE and a DiT to accurately depict
der changing environment lighting, e.g., on the rocket’s head. In continuous surfaces and complex shapes. We have shown how to
contrast, CLAY faithfully models PBR materials where the rocket’s scale up CLAY efficiently through a progressive training scheme
metallic surfaces exhibit realistic highlights that moves consistently to become a large 3D generative model. Its success is also largely
with the moving environment lighting. This also showcases the po- attributed to our elaborately designed geometric data processing
tential advantages of separating generating geometry and texture. pipeline, including a standardized geometry remeshing protocol to
ensure consistency in training, and the automatic annotation capa-
User studies. We have conduct a comprehensive user study, struc- bilities by GPT-4V. Comprehensive experimental evaluations and
tured around two primary evaluations: appearance quality for vi- user studies have demonstrated CLAY’s efficacy and adaptability.
sualization and geometry quality for modeling. We have created Its high geometry quality, diversity in variety, and material richness
a test set consisting of 5 text prompts generated by GPT-4 and 15 position CLAY as one of the leading 3D generator in the field.
images generated by Stable Diffusion. A total of 150 volunteers par-
ticipated in the study, each evaluating 15 randomly chosen questions Ethics Statement. Same as 2D contents, 3D generation models
to determine their preferred method. We compare CLAY with lead- have the potential to producing deceptive contents. Although we
ing approaches on Text-to-3D and Image-to-3D tasks respectively. have implemented rigorous scrutiny processes for our training data,
Fig 16 shows that CLAY outperforms others in both appearance and the utilization of pretrained feature encoders (CLIP [Radford et al.
geometry in text-to-3D and image-to-3D tasks. Specifically, CLAY 2021] for text encoding and DINO [Oquab et al. 2024] for image
secured 67.4% of votes for appearance and 78.9% for geometry in encoding) in CLAY introduces a high-level of generalization capa-
text-to-3D, surpassing the second-ranked RichDreamer, which had bility that carries the risk of potential misuse. This means there is a
a notably longer optimization time of ∼2 hours compared to our ∼45 possibility that our model could be used to generate virtual assets or
seconds. In Image-to-3D, CLAY further garnered 85.4% and 91.2% scenes that violate regulations and propagate false information. We
votes in appearance and geometry, respectively. are committed to addressing these ethical issues, and along with the
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 17

whole community, developing strategies to ensure the responsible Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese.
use of CLAY. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruc-
tion. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 628–644.
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt,
Limitations and Future Work. It is important to note that CLAY is K. Ehsanit, A. Kembhavi, and A. Farhadi. 2023. Objaverse: A Universe of Anno-
not yet complete end-to-end, as it entails distinct stages for gener- tated 3D Objects. In 2023 IEEE/CVF Conference on Computer Vision and Pattern
ating geometry and materials, and requires additional steps such Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 13142–13153.
https://doi.org/10.1109/CVPR52729.2023.01263
as remeshing and UV unwrapping. An immediate future step is to Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for
explore integrated model architectures to integrate geometry and 3d object reconstruction from a single image. In Proceedings of the IEEE conference
PBR materials. This will require implementing automatic schemes on computer vision and pattern recognition. 605–613.
Andrea Gesmundo and Kaitlin Maile. 2023. Composable Function-preserving Expan-
to produce geometry with consistent topology. By far, CLAY has sions for Transformer Architectures. arXiv:2308.06103 [cs.LG]
been trained on a substantially large dataset. However, there is still Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry.
2018. A papier-mâché approach to learning 3d surface generation. In Proceedings
room for improvement in terms of both the quantity and quality of the IEEE conference on computer vision and pattern recognition. 216–224.
of the training data, especially compared with 2D image datasets Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian Laforte, Vikram Voleti, Guan
used to train Stable Diffusion. Further, we observe that CLAY shows Luo, Chia-Hao Chen, Zi-Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang.
2023. threestudio: A unified framework for 3D content generation. https://github.
robustness in generating assets composed of single objects but tends com/threestudio-project/threestudio.
to be vulnerable when dealing with complex “composed objects”, Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 2023. 3DGen:
such as “a tiger riding a motorcycle”, particularly with text-only Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371 [cs.CV]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilis-
inputs. The issue is largely attributed to insufficient training data tic Models. In Advances in Neural Information Processing Systems, H. Larochelle,
of composed objects and the lack of detailed textual descriptions M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Asso-
ciates, Inc., 6840–6851. https://proceedings.neurips.cc/paper_files/paper/2020/file/
of these objects. The issue can potentially be mitigated through a 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
text-to-image-to-3D workflow, akin to the approaches employed by Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan
Wonder3D [Long et al. 2024] and One-2-3-45++ [Liu et al. 2023d]. Sunkavalli, Trung Bui, and Hao Tan. 2024. LRM: Large Reconstruction Model
for Single Image to 3D. In The Twelfth International Conference on Learning
As the community augments the training dataset with a larger and Representations.
more diverse collection of 3D shapes along with corresponding text Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
descriptions, we expect CLAY as well as its concurrent works to Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language
Models. In International Conference on Learning Representations.
reach a new level of geometry generation, in both quality and com- Jingwei Huang, Hao Su, and Leonidas J. Guibas. 2018a. Robust Watertight Manifold
plexity. Finally, we intend to explore extends of CLAY to dynamic Surface Generation Method for ShapeNet Models. arXiv:1802.01698 http://arxiv.
org/abs/1802.01698
object generation. The generated results from CLAY indicate that it Jingwei Huang, Yichao Zhou, and Leonidas Guibas. 2020. ManifoldPlus: A Robust
may be possible to semantically partition the geometry into mean- and Scalable Watertight Manifold Surface Generation Method for Triangle Soups.
ingful parts, further facilitating motion and interaction, as in Singer arXiv:2005.11621 [cs.GR]
Jingwei Huang, Yichao Zhou, Matthias Niessner, Jonathan Richard Shewchuk, and
et al. [2023] and Ling et al. [2024]. Leonidas J. Guibas. 2018b. QuadriFlow: A Scalable and Robust Method for Quadran-
gulation. Computer Graphics Forum 37 (2018). https://doi.org/10.1111/cgf.13498
Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, and Lei Zhang. 2024.
REFERENCES DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Genera-
tion. In The Twelfth International Conference on Learning Representations.
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. 2023. MultiDiffusion: fusing Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit
diffusion paths for controlled image generation. , 16 pages. Functions. arXiv:2305.02463 [cs.CV]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023.
Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions
and Robin Rombach. 2023. Stable Video Diffusion: Scaling Latent Video Diffusion on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-
Models to Large Datasets. arXiv:2311.15127 [cs.CV] splatting/
Blender Online Community. 2024. Blender - a 3D modelling and rendering package. Sixu Li, Chaojian Li, Wenbo Zhu, Boyang (Tony) Yu, Yang (Katie) Zhao, Cheng Wan, Hao-
http://www.blender.org. ran You, Huihong Shi, and Yingyan (Celine) Lin. 2023. Instant-3D: Instant Neural Ra-
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, diance Field Training Towards On-Device AR/VR 3D Reconstruction. In Proceedings
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, of the 50th Annual International Symposium on Computer Architecture (Orlando,
Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. FL, USA) (ISCA ’23). Association for Computing Machinery, New York, NY, USA,
arXiv:1512.03012 [cs.GR] Article 6, 13 pages. https://doi.org/10.1145/3579371.3589115
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Weiyu Li, Rui Chen, Xuelin Chen, and Ping Tan. 2024. SweetDreamer: Aligning Geomet-
Nießner. 2023b. Text2Tex: Text-driven Texture Synthesis via Diffusion Models. In ric Priors in 2D diffusion for Consistent Text-to-3D. In The Twelfth International
2023 IEEE/CVF International Conference on Computer Vision (ICCV). 18512–18522. Conference on Learning Representations.
https://doi.org/10.1109/ICCV51070.2023.01701 C. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M. Liu,
R. Chen, Y. Chen, N. Jiao, and K. Jia. 2023a. Fantasia3D: Disentangling Geometry and T. Lin. 2023. Magic3D: High-Resolution Text-to-3D Content Creation. In
and Appearance for High-quality Text-to-3D Content Creation. In 2023 IEEE/CVF 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
International Conference on Computer Vision (ICCV). IEEE Computer Society, Los IEEE Computer Society, Los Alamitos, CA, USA, 300–309. https://doi.org/10.1109/
Alamitos, CA, USA, 22189–22199. https://doi.org/10.1109/ICCV51070.2023.02033 CVPR52729.2023.00037
Zilong Chen, Feng Wang, and Huaping Liu. 2024. Text-to-3D using Gaussian Splat- Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. 2024. Common diffusion noise
ting. In Proceedings of the IEEE/CVF conference on computer vision and pattern schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter
recognition. Conference on Applications of Computer Vision. 5404–5411.
Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape mod- Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. 2024.
eling. In Proceedings of the IEEE/CVF conference on computer vision and pattern Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed
recognition. 5939–5948. Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision
Y. Cheng, H. Lee, S. Tulyakov, A. Schwing, and L. Gui. 2023. SDFusion: Multimodal 3D and pattern recognition.
Shape Completion, Reconstruction, and Generation. In 2023 IEEE/CVF Conference Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Han-
on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los sheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. 2024b. One-2-3-45++: Fast
Alamitos, CA, USA, 4456–4465. https://doi.org/10.1109/CVPR52729.2023.00433
18 • Zhang and Wang, et al

Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffu- Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-
sion. In Proceedings of the IEEE/CVF conference on computer vision and pattern to-3D using 2D Diffusion. In The Eleventh International Conference on Learning
recognition. Representations.
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet++: deep hierarchical
Hao Su. 2023d. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without feature learning on point sets in a metric space. 30 (2017), 5105–5114.
Per-Shape Optimization. In Advances in Neural Information Processing Systems, Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li,
A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard
Curran Associates, Inc., 22226–22246. https://proceedings.neurips.cc/paper_files/ Ghanem. 2024. Magic123: One Image to High-Quality 3D Object Generation Us-
paper/2023/file/4683beb6bab325650db13afd05d1a14a-Paper-Conference.pdf ing Both 2D and 3D Diffusion Priors. In The Twelfth International Conference on
R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick. 2023c. Learning Representations.
Zero-1-to-3: Zero-shot One Image to 3D Object. In 2023 IEEE/CVF International Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Wei-
Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, hao Yuan, Zilong Dong, Liefeng Bo, and Xiaoguang Han. 2024. RichDreamer:
USA, 9264–9275. https://doi.org/10.1109/ICCV51070.2023.00853 A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-
Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui 3D. In Proceedings of the IEEE/CVF conference on computer vision and pattern
Liu, Ziwei Liu, and Sergey Tulyakov. 2023b. HyperHuman: Hyper-Realistic Human recognition.
Generation with Latent Structural Diffusion. arXiv:2310.08579 [cs.CV] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
Wenping Wang. 2024a. SyncDreamer: Generating Multiview-consistent Images 2021. Learning transferable visual models from natural language supervision. In
from a Single-view Image. In The Twelfth International Conference on Learning International conference on machine learning. PMLR, 8748–8763.
Representations. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad-
Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan ford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation.
Qi, Xiaoshui Huang, Ding Liang, and Wanli Ouyang. 2023a. UniDream: Unifying In Proceedings of the 38th International Conference on Machine Learning, ICML
Diffusion Priors for Relightable Text-to-3D Generation. arXiv:2312.08754 [cs.CV] 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research,
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://
Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. proceedings.mlr.press/v139/ramesh21a.html
2024. Wonder3D: Single Image to 3D using Cross-Domain Diffusion. In Proceedings J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny.
of the IEEE/CVF conference on computer vision and pattern recognition. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D
Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. 2022. Category Reconstruction. In 2021 IEEE/CVF International Conference on Computer
SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views. Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 10881–10891. https:
In Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, //doi.org/10.1109/ICCV48922.2021.01072
Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis
Cham, 210–227. Williams. 2024. XCube ( X 3 ): Large-Scale 3D Generative Modeling using Sparse
Kleineberg Marian. 2021. mesh_to_sdf: Calculate signed distance fields for arbitrary Voxel Hierarchies. In Proceedings of the IEEE/CVF conference on computer vision
meshes. https://github.com/marian42/mesh_to_sdf. and pattern recognition.
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and An- Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. 2023. TEX-
dreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function Ture: Text-Guided Texturing of 3D Shapes. In ACM SIGGRAPH 2023 Conference
space. In Proceedings of the IEEE/CVF conference on computer vision and pattern Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing
recognition. 4460–4470. Machinery, New York, NY, USA, Article 54, 11 pages. https://doi.org/10.1145/
G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or. 2023. Latent- 3588432.3591503
NeRF for Shape-Guided Generation of 3D Shapes and Textures. In 2023 IEEE/CVF R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 2022. High-Resolution
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on
Society, Los Alamitos, CA, USA, 12663–12673. https://doi.org/10.1109/CVPR52729. Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los
2023.01218 Alamitos, CA, USA, 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Den-
mamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields ton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim
for view synthesis. Commun. ACM 65, 1 (2021), 99–106. Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photo-
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. 2020. Polygen: realistic Text-to-Image Diffusion Models with Deep Language Understanding.
An autoregressive generative model of 3d meshes. In International conference on In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
machine learning. PMLR, 7220–7229. A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates,
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Inc., 36479–36494. https://proceedings.neurips.cc/paper_files/paper/2022/file/
2022. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf
arXiv:2212.08751 [cs.CV] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim,
OpenAI. 2023. GPT-4V: Generative Pre-trained Transformer 4 for Vision. https: Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. 2024. Let 2D Diffusion Model Know
//www.openai.com/. 3D-Consistency for Robust Text-to-3D Generation. In The Twelfth International
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Conference on Learning Representations.
Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep
Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang- Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Syn-
Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve thesis. In Advances in Neural Information Processing Systems, A. Beygelzimer,
Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.).
DINOv2: Learning Robust Visual Features without Supervision. Transactions on Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei,
Machine Learning Research (2024). Linghao Chen, Chong Zeng, and Hao Su. 2023. Zero123++: a Single Image to
J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. 2019. DeepSDF: Learning Consistent Multi-view Diffusion Base Model. arXiv:2310.15110 [cs.CV]
Continuous Signed Distance Functions for Shape Representation. In 2019 IEEE/CVF Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. 2024. MV-
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Dream: Multi-view Diffusion for 3D Generation. In The Twelfth International
Society, Los Alamitos, CA, USA, 165–174. https://doi.org/10.1109/CVPR.2019.00025 Conference on Learning Representations.
Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti,
2020. Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th Vladislav Rosov, Angela Dai, and Matthias Nießner. 2024. MeshGPT: Generating
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Triangle Meshes with Decoder-Only Transformers. In Proceedings of the IEEE/CVF
Springer, 523–540. conference on computer vision and pattern recognition.
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filip-
Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, pos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, and
et al. 2023. State of the art on diffusion models for visual computing. arXiv preprint Yaniv Taigman. 2023. Text-To-4D Dynamic Scene Generation. In International
arXiv:2310.07204 (2023). Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii,
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma
Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett
Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV] (Eds.). PMLR, 31915–31929. https://proceedings.mlr.press/v202/singer23a.html
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets • 19

Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Language Model. arXiv:2311.17618 [cs.CV]
Liu. 2024. DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, and Fan Wang. 2023b.
Prior. In The Twelfth International Conference on Learning Representations. Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable
Jiapeng Tang, Xiaoguang Han, Junyi Pan, Kui Jia, and Xin Tong. 2019. A skeleton- Text-to-3D Generation. In Proceedings of the 31st ACM International Conference
bridged deep learning approach for generating meshes of complex topologies from on Multimedia (, Ottawa ON, Canada,) (MM ’23). Association for Computing Ma-
single rgb images. In Proceedings of the ieee/cvf conference on computer vision chinery, New York, NY, USA, 6841–6850. https://doi.org/10.1145/3581783.3612232
and pattern recognition. 4541–4550. X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang, G. Chen, S.
Jiapeng Tang, Xiaoguang Han, Mingkui Tan, Xin Tong, and Kui Jia. 2021a. Skeletonnet: Cui, and X. Han. 2023a. MVImgNet: A Large-scale Dataset of Multi-view Images. In
A topology-preserving solution for learning mesh reconstruction of object surfaces 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
from rgb images. IEEE transactions on pattern analysis and machine intelligence IEEE Computer Society, Los Alamitos, CA, USA, 9150–9161. https://doi.org/10.
44, 10 (2021), 6454–6471. 1109/CVPR52729.2023.00883
Jiapeng Tang, Jiabao Lei, Dan Xu, Feiying Ma, Kui Jia, and Lei Zhang. 2021b. Sa-convonet: Biao Zhang, Jiapeng Tang, Matthias Nießner, and Peter Wonka. 2023c. 3DShape2VecSet:
Sign-agnostic optimization of convolutional occupancy networks. In Proceedings A 3D Shape Representation for Neural Fields and Generative Diffusion Models. ACM
of the IEEE/CVF International Conference on Computer Vision. 6504–6513. Trans. Graph. 42, 4, Article 92 (jul 2023), 16 pages. https://doi.org/10.1145/3592442
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2024. DreamGaussian: Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye
Generative Gaussian Splatting for Efficient 3D Content Creation. In The Twelfth Shi, Sibei Yang, Lan Xu, and Jingyi Yu. 2023a. DreamFace: Progressive Generation
International Conference on Learning Representations. of Animatable 3D Faces under Text Guidance. ACM Trans. Graph. 42, 4, Article 138
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N (jul 2023), 16 pages. https://doi.org/10.1145/3592094
Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In L. Zhang, A. Rao, and M. Agrawala. 2023b. Adding Conditional Control to Text-to-
Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, Image Diffusion Models. In 2023 IEEE/CVF International Conference on Computer
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 3813–3824. https:
Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ //doi.org/10.1109/ICCV51070.2023.00355
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Youjia Zhang, Junqing Yu, Zikai Song, and Wei Yang. 2023d. Optimized View and
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Geometry Distillation from Multi-view Diffuser. arXiv:2312.06198 [cs.CV]
Wang. 2021a. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao
Multi-view Reconstruction. Advances in Neural Information Processing Systems Chen, Gang Yu, and Shenghua Gao. 2023. Michelangelo: Conditional 3d shape
34 (2021), 27171–27183. generation based on shape-image-text aligned latent representation. Advances in
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping neural information processing systems (2023).
Wang, Zexiang Xu, and Kai Zhang. 2024. PF-LRM: Pose-Free Large Reconstruction Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung
Model for Joint Pose and Shape Prediction. In The Twelfth International Conference Shum. 2023. Locally Attentional SDF Diffusion for Controllable 3D Shape Generation.
on Learning Representations. ACM Trans. Graph. 42, 4, Article 91 (jul 2023), 13 pages. https://doi.org/10.1145/
Peng-Shuai Wang. 2022. mesh2sdf. https://github.com/wang-ps/mesh2sdf. Converts 3592103
an input mesh to a signed distance field (SDF). Junzhe Zhu, Peiye Zhuang, and Sanmi Koyejo. 2024. HIFA: High-fidelity Text-to-
Peng-Shuai Wang, Yang Liu, and Xin Tong. 2022. Dual octree graph networks for 3D Generation with Advanced Diffusion Guidance. In The Twelfth International
learning adaptive volumetric shape representations. ACM Trans. Graph. 41, 4, Conference on Learning Representations.
Article 103 (jul 2022), 15 pages. https://doi.org/10.1145/3528223.3530087 Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and
X. Wang, L. Xie, C. Dong, and Y. Shan. 2021b. Real-ESRGAN: Training Real-World Song-Hai Zhang. 2024. Triplane Meets Gaussian Splatting: Fast and Generalizable
Blind Super-Resolution with Pure Synthetic Data. In 2021 IEEE/CVF International Single-View 3D Reconstruction with Transformers. In Proceedings of the IEEE/CVF
Conference on Computer Vision Workshops (ICCVW). IEEE Computer Society, Los conference on computer vision and pattern recognition.
Alamitos, CA, USA, 1905–1914. https://doi.org/10.1109/ICCVW54120.2021.00217
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun
Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with
Variational Score Distillation. In Thirty-seventh Conference on Neural Information
Processing Systems.
Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo
Liu, and Errui Ding. 2024. Hd-fusion: Detailed text-to-3d generation leveraging
multiple noise estimation. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision. 3202–3211.
T. Wu, J. Zhang, X. Fu, Y. Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian,
D. Lin, and Z. Liu. 2023. OmniObject3D: Large-Vocabulary 3D Object Dataset for
Realistic Perception, Reconstruction and Generation. In 2023 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los
Alamitos, CA, USA, 803–814. https://doi.org/10.1109/CVPR52729.2023.00084
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai
Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization
in the transformer architecture. In International Conference on Machine Learning.
PMLR, 10524–10533.
Rui Xu, Zhiyang Dou, Ningna Wang, Shiqing Xin, Shuangmin Chen, Mingyan Jiang,
Xiaohu Guo, Wenping Wang, and Changhe Tu. 2023. Globally Consistent Normal
Orientation for Point Clouds by Regularizing the Winding-Number Field. ACM
Trans. Graph. 42, 4, Article 111 (jul 2023), 15 pages. https://doi.org/10.1145/3592129
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan
Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. 2024. DMV3D: Denois-
ing Multi-view Diffusion Using 3D Large Reconstruction Model. In The Twelfth
International Conference on Learning Representations.
Le Xue, Ning Yu, Shu Zhang, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming
Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. ULIP-2: Towards
Scalable Multimodal Pre-training for 3D Understanding. arXiv:2305.08275 [cs.CV]
Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, and Yaron Lipman. 2024. Mosaic-
SDF for 3D Generative Models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition.
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter:
Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.
arXiv:2308.06721 [cs.CV]
Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li,
and Tao Chen. 2023. ShapeGPT: 3D Shape Generation with A Unified Multi-modal

You might also like