0% found this document useful (0 votes)
64 views16 pages

3 DShape 2 Vec Set

The document introduces 3DShape2VecSet, a novel shape representation for neural fields aimed at enhancing generative diffusion models for 3D shape generation. It emphasizes the advantages of using neural fields over traditional representations like point clouds and voxels, and demonstrates improved performance in various generative applications such as text-conditioned and category-conditioned generation. The work highlights the importance of learned representations and presents a two-stage training strategy involving an autoencoder followed by a diffusion model training in the learned latent space.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views16 pages

3 DShape 2 Vec Set

The document introduces 3DShape2VecSet, a novel shape representation for neural fields aimed at enhancing generative diffusion models for 3D shape generation. It emphasizes the advantages of using neural fields over traditional representations like point clouds and voxels, and demonstrates improved performance in various generative applications such as text-conditioned and category-conditioned generation. The work highlights the importance of learned representations and presents a two-stage training strategy involving an autoencoder followed by a diffusion model training in the learned latent space.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

3DShape2VecSet: A 3D Shape Representation for Neural Fields and

Generative Diffusion Models


BIAO ZHANG, KAUST, Saudi Arabia
JIAPENG TANG, TU Munich, Germany
MATTHIAS NIESSNER, TU Munich, Germany
PETER WONKA, KAUST, Saudi Arabia

Input Reconstruction Input Reconstruction Condition Generation


arXiv:2301.11445v3 [cs.CV] 1 May 2023

car

“the tallest chair”

Fig. 1. Left: Shape autoencoding results (surface reconstruction from point clouds) Right: the various down-stream applications of 3DShape2VecSet (from
top to down): (a) category-conditioned generation; (b) point clouds conditioned generation (shape completion from partial point clouds); (c) image conditioned
generation (shape reconstruction from single-view images); (d) text-conditioned generation.

We introduce 3DShape2VecSet, a novel shape representation for neural fields 1 INTRODUCTION


designed for generative diffusion models. Our shape representation can en- The ability to generate realistic and diverse 3D content has many
code 3D shapes given as surface models or point clouds, and represents them
potential applications, including computer graphics, gaming, and vir-
as neural fields. The concept of neural fields has previously been combined
with a global latent vector, a regular grid of latent vectors, or an irregular grid
tual reality. To this end, many generative models have been explored,
of latent vectors. Our new representation encodes neural fields on top of a set e.g., generative adversarial networks, variational autoencoders, nor-
of vectors. We draw from multiple concepts, such as the radial basis function malizing flows, and autoregressive models. Recently, diffusion mod-
representation and the cross attention and self-attention function, to design els have emerged as one of the most popular method with fantastic
a learnable representation that is especially suitable for processing with results in the 2D image domain [Ho et al. 2020; Rombach et al. 2022]
transformers. Our results show improved performance in 3D shape encod- and have shown their superiority over other generative methods. For
ing and 3D shape generative modeling tasks. We demonstrate a wide variety instance, it is possible to do unconditional generation [Karras et al.
of generative applications: unconditioned generation, category-conditioned 2022; Rombach et al. 2022], text conditioned generation [Rombach
generation, text-conditioned generation, point-cloud completion, and image- et al. 2022; Saharia et al. 2022], and generative image inpainting [Lug-
conditioned generation. Code: https://1zb.github.io/3DShape2VecSet/.
mayr et al. 2022]. However, the success in the 2D domain has not
Additional Key Words and Phrases: 3D Shape Generation, 3D Shape Repre-
yet been matched in the 3D domain.
sentation, Diffusion Models, Shape Reconstruction, Generative models
In this work, we will study diffusion models for 3D shape genera-
tion. One major challenge in adapting 2D diffusion models to 3D is
the design of a suitable shape representation. The design of such a
Authors’ addresses: Biao Zhang, KAUST, Saudi Arabia, [email protected]; Jia- shape representation is the major focus of our work, and we will
peng Tang, TU Munich, Germany, [email protected]; Matthias Nießner, TU Munich, discuss several design choices that lead to the development of our
Germany, [email protected]; Peter Wonka, KAUST, Saudi Arabia, peter.wonka@kaust.
edu.sa. proposed representation.
Different from 2D images, there are several predominant ways (linear combination). Inspired by cross attention in the transformer
to represent 3D data, e.g., voxels, point clouds, meshes, and neu- network [Vaswani et al. 2017], we derived the proposed latent rep-
ral fields. In general, we believe that surface-based representations resentation which is a fixed-size set of latent vectors. There are
are more suitable for downstream applications than point clouds. two main reasons that we believe contribute to the success of the
Among the available choices, we choose to build on neural fields as representations. First, the representation is well-suited for the use
they have many advantages: they are continuous, represent com- with transformer-based networks. As transformer-based networks
plete surfaces and not only point samples, and they enable many tend to outperform current alternatives, we can better benefit from
interesting combinations of traditional data structure design and this network architecture. Instead of only using MLPs to process
representation learning using neural networks. latent information, we use a linear layer and cross-attention. Sec-
Two major approaches for 2D diffusion models are to either use a ond, the representation no longer uses explicitly designed positional
compressed latent space, e.g., latent diffusion [Rombach et al. 2022], features, but only gives the network the option to encode positional
or to use a sequence of diffusion models of increasing resolution, information in any form it considers suitable. This is in line with our
e.g., [Ramesh 2022; Saharia et al. 2022]. While both of these ap- design principle of favoring learned representations over manually
proaches seem viable in 3D, our initial experiments indicated that it designed ones. See Fig. 2 e) for the proposed latent representation.
is much easier to work with a compressed latent space. We therefore Using our novel shape representation, we can train diffusion mod-
follow the latent diffusion approach. els in the learned 3D shape latent space. Our results demonstrate an
A subsequent design choice for a latent diffusion approach is to de- improved shape encoding quality and generation quality compared
cide between a learned representation or a manually designed repre- to the current state of the art. While pioneering work in 3D shape
sentation. A manually designed representation such as wavelets [Hui generation using diffusion models already showed unconditional
et al. 2022] is easier to design and more lightweight, but in many con- 3D shape generation, we show multiple novel applications of 3D dif-
texts learned representations have shown to outperform manually fusion models: category-conditioned generation, text-conditioned
designed ones. We therefore opt to explore learned representations. shape generation, shape reconstruction from single-view image, and
This requires a two-stage training strategy. The first stage is an shape reconstruction from partial point clouds.
autoencoder (variational autoencoder) to encode 3D shapes into a To sum up, our contributions are as follows:
latent space. The second stage is training a diffusion model in the (1) We propose a new representation for 3D shapes. Any shape
learned latent space. can be represented by a fixed-length array of latents and
In the case of training diffusion models for 3D neural fields, it processed with cross-attention and linear layers to yield a
is even more necessary to generate in latent space. First, diffusion neural field.
models often work with data of fixed size (e.g., images of a given (2) We propose a new network architecture to process shapes
fixed resolution). Second, a neural field is a continuous real-valued in the proposed representation, including a building block to
function that can be seen as an infinite-dimensional vector. For both aggregate information from a large point cloud using cross-
reasons, we decide to find a way to encode shapes into latent space attention.
before all else (as well as a decoding method for reverting latents (3) We improve the state of the art in 3D shape autoencoding to
back to shapes). yield a high fidelity reconstruction including local details.
Finally, we have to design a suitable learned neural field rep- (4) We propose a latent set diffusion framework that improves
resentation that provides a good trade-off between compression the state of the art in 3D shape generation as measured by
and reconstruction quality. Such a design typically requires three FID, KID, FPD, and KPD.
components: a spatial data structure to store the latent information, (5) We show 3D shape diffusion for category-conditioned gener-
a spatial interpolation method, and a neural network architecture. ation, text-conditioned generation, point-cloud completion,
There are multiple options proposed in the literature shown in Fig. 2. and image-conditioned generation.
Early methods used a single global latent vector in combination
with an MLP network [Mescheder et al. 2019; Park et al. 2019]. This 2 RELATED WORK
concept is simple and fast but generally struggles to reconstruct
In this section, we briefly review the literature of 3D shape learning
high-quality shapes. Better shape details can be achieved by using a
with various data representations and 3D shape generative models.
3D regular grid of latents [Peng et al. 2020] together with tri-linear
interpolation and an MLP. However, such a representation is too 2.1 3D Shape Representations
large for generative models and it is only possible to use grids of
We mainly discuss the following representations for 3D shapes,
very low resolution (e.g., 8×8×8). By introducing sparsity, e.g., [Yan
including voxels, point clouds, and neural fields.
et al. 2022; Zhang et al. 2022], latents are arranged in an irregular
grid. The latent size is largely reduced, but there is still a lot of Voxels. Voxel grids, extended from 2D pixel grids, simply repre-
room for improvement which we capitalize on in the design of sent a 3D shape as a discrete volumetric grid. Due to their regular
3DShape2VecSet. structure, early works take advantage of 3D transposed convolution
The design of 3DShape2VecSet combines ideas from neural fields, operators for shape prediction [Brock et al. 2016; Choy et al. 2016;
radial basis functions, and the network architecture of attention Dai et al. 2017; Girdhar et al. 2016; Wu et al. 2016, 2015]. A draw-
layers. Similar to radial basis function representation for continuous back of the voxels-based decoders is that the computational and
functions, we can also re-write existing methods in a similar form memory costs of neural networks cubicly increases with respect to

2
 √ 
𝜙 (x, f𝑖 ) = exp q(x) ⊺ k(f𝑖 ) / 𝑑

x 𝜙 (x, x (x𝑖 , 𝜆𝑖 ) x x 𝜙 (x, x (x𝑖 , f𝑖 )


𝑖) 𝑖)
(x𝑖 , f𝑖 )
f𝑖 x

(a) RBF (b) Global Latent (c) Latent Grid (d) Irregular Latent Grid (e) Latent Set (Ours)
Fig. 2. Continuous function representations. Scalars are represented with spheres while vectors are cubes. The arrows show how spatial interpolation is
computed. x𝑖 and x are the coordinates of an anchor and a querying point respectively. 𝜆𝑖 is the SDF value of the anchor point x𝑖 in (a). f𝑖 is the associate
feature vector located in x𝑖 in (c)(d). The queried SDF/feature of x is based on the distance function 𝜙 (x, x𝑖 ) in (a)(c)(d), while our proposed latent set
representation (e) utilizes the similarity 𝜙 (x, f𝑖 ) between querying coordinate and anchored features via a cross attention mechanism.

Table 1. Neural fields for 3D shapes. We categorize methods according Table 2. Generative models for 3d shapes.
to the position of the latents.

Generative 3D
# Latents Latent Position Methods Models Representation
OccNet [Mescheder et al. 2019]
Single Global DeepSDF [Park et al. 2019] 3D-GAN [Wu et al. 2016] GAN Voxels
IM-Net [Chen and Zhang 2019] l-GAN [Achlioptas et al. 2018] GAN★ Point Clouds
ConvOccNet [Peng et al. 2020] IM-GAN [Chen and Zhang 2019] GAN★ Fields
IF-Net [Chibane et al. 2020] PointFlow [Yang et al. 2019] NF Point Clouds
LIG [Jiang et al. 2020] GenVoxelNet [Xie et al. 2020] EBM Voxels
Multiple Regular Grid
DeepLS [Chabra et al. 2020]
PointGrow [Sun et al. 2020] AR Point Clouds
SA-ConvOccNet [Tang et al. 2021]
PolyGen [Nash et al. 2020] AR Meshes
NKF [Williams et al. 2022]
LDIF [Genova et al. 2020] GenPointNet [Xie et al. 2021] EBM Point Clouds
Point2Surf [Erler et al. 2020] 3DShapeGen [Ibing et al. 2021] GAN★ Fields
Multiple Irregular Grid DCC-DIF [Li et al. 2022] DPM [Luo and Hu 2021] DM Point Clouds
3DILG [Zhang et al. 2022] PVD [Zhou et al. 2021] DM Point Clouds
POCO [Boulch and Marlet 2022] AutoSDF[Mittal et al. 2022] AR★ Voxels
Multiple Global Ours CanMap [Cheng et al. 2022] AR★ Point Clouds
ShapeFormer[Yan et al. 2022] AR★ Fields
3DILG [Zhang et al. 2022] AR★ Fields
the grid resolution. Thus, most voxel-based methods are limited to
LION [Zeng et al. 2022] DM★ Point Clouds
low-resolution. Octree-based decoders [Häne et al. 2017; Meagher
1980; Riegler et al. 2017b,a; Tatarchenko et al. 2017; Wang et al. SDF-StyleGAN [Zheng et al. 2022] GAN Fields
2017, 2018] and sparse hash-based decoders [Dai et al. 2020] take NeuralWavelet [Hui et al. 2022] DM★ Fields
3D space sparsity into account, alleviating the efficiency issues and TriplaneDiffusion [Shue et al. 2022]⋄ DM★ Fields
supporting high-resolution outputs. DiffusionSDF [Chou et al. 2022]⋄ DM★ Fields
Point Clouds. Early works on neural-network-based point cloud Ours DM★ Fields
processing include PointNet [Qi et al. 2017a,b] and DGCNN [Wang ★
et al. 2019]. These works are built upon per-point fully connected Generative models in latent space.
⋄ Works in submission.
layers. More recently, transformers [Vaswani et al. 2017] were pro-
posed for point cloud processing, e.g., [Guo et al. 2021; Zhang et al.
2022; Zhao et al. 2021]. These works are inspired by Vision Trans-
formers (ViT) [Dosovitskiy et al. 2021] in the image domain. Points Zhang 2019; Mescheder et al. 2019; Michalkiewicz et al. 2019; Park
are firstly grouped into patches to form tokens and then fed into et al. 2019] or a vector [Chan et al. 2022; Mildenhall et al. 2020]. A
a transformer with self-attention. In this work, we also introduce 3D object is then implicitly defined by this neural network. Neural
a network for processing point clouds. Improving upon previous fields have gained lots of popularity as they can generate objects
works, we compress a given point cloud to a small representation with arbitrary topologies and infinite resolution. The methods are
that is more suitable for generative modeling. also called neural implicit representations or coordinate-based net-
works. For neural fields for 3d shape modeling, we can categorize
Neural Fields. A recent trend is to use neural fields as a 3d data methods into global methods and local methods. 1) The global meth-
representation. The key building block is a neural network which ods encode a shape with a single global latent vector [Mescheder
accepts a 3D coordinate as input, and outputs a scalar [Chen and et al. 2019; Park et al. 2019]. Usually the capacity of these kind of

3
⊺ √
methods is limited and they are unable to encode shape details. 2) produce coefficients q 𝑗 k𝑖 / 𝑑 (they need to be normalized with the
The local methods use localized latent vectors which are defined for softmax function),
3D positions defined on either a regular [Chibane et al. 2020; Jiang ⊺ √
et al. 2020; Peng et al. 2020; Tang et al. 2021] or irregular grid [Boulch q 𝑗 k𝑖 / 𝑑
𝐴𝑖,𝑗 = Í  √  (1)
and Marlet 2022; Genova et al. 2020; Li et al. 2022; Zhang et al. 2022]. 𝑁𝑘 ⊺
𝑖=1 exp q 𝑗 k𝑖 / 𝑑.
In contrast, we propose a latent representation where latent vectors
do not have associated 3D positions. Instead, we learn to represent The coefficients are then used to (linearly) combine values V =
a shape as a list of latent vectors. See Tab. 1. [v1, v2, . . . , v𝑁𝑘 ] ∈ R𝑑 𝑣 ×𝑁𝑘 . We can write the output of an attention
layer as follows,
2.2 Generative models. Attention(Q, K, V)
We have seen great success in different 2D image generative mod-
= o1 o2 · · · o𝑁𝑞 ∈ R𝑑 𝑣 ×𝑁𝑞
 
els in the past decade. Popular deep generative methods include "𝑁 # (2)
generative adversarial networks (GANs) [Goodfellow et al. 2014], 𝑘
∑︁ 𝑁𝑘
∑︁ 𝑁𝑘
∑︁
variational autoencoers (VAEs) [Kingma and Welling 2014], nor- = 𝐴𝑖,1 v𝑖 𝐴𝑖,2 v𝑖 · · · 𝐴𝑖,𝑁𝑞 v𝑖
malizing flows (NFs) [Rezende and Mohamed 2015], energy-based 𝑖=1 𝑖=1 𝑖=1

models [LeCun et al. 2006; Xie et al. 2016], autoregressive models Cross Attention. Given two sets A = a1, a2, . . . , a𝑁𝑎 ∈ R𝑑𝑎 ×𝑁𝑎
 
(ARs) [Esser et al. 2021; Van Den Oord et al. 2017] and more re- and B = b1, b2, . . . , b𝑁𝑏 ∈ R𝑑𝑏 ×𝑁𝑏 , the query vectors Q are con-
 
cently, diffusion models (DMs) [Ho et al. 2020] which are the chosen
structed with a linear function q(·) : R𝑑𝑎 → R𝑑 by taking elements
generative model in our work.
of A as input. Similarly, we construct K and V with k(·) : R𝑑𝑏 → R𝑑
In 3D domain, GANs have been popular for 3D generation [Achliop-
and v(·) : R𝑑𝑏 → R𝑑 , respectively. The inputs of both k(·) and v(·)
tas et al. 2018; Chen and Zhang 2019; Ibing et al. 2021; Wu et al.
are from B. Each column in the output of Eq. (2) can be written as,
2016; Zheng et al. 2022], while only a few works are using NFs [Yang
et al. 2019] and VAEs [Mo et al. 2019]. A lot of recent work employs 𝑁𝑏
∑︁ 1  √ 
ARs [Cheng et al. 2022; Mittal et al. 2022; Nash et al. 2020; Sun et al. o(a 𝑗 , B) = v(b𝑖 ) · exp q(a 𝑗 ) ⊺ k(b𝑖 )/ 𝑑 , (3)
𝑍 (a 𝑗 , B)
2020; Yan et al. 2022; Zhang et al. 2022]. DMs for 3D shapes are 𝑖=1
relatively unexplored compared to other generative methods.
 √ 
exp q(a 𝑗 ) ⊺ k(b𝑖 )/ 𝑑 is a normalizing fac-
Í𝑁𝑏
where 𝑍 (a 𝑗 , B) = 𝑖=1
There are several DMs dealing with point cloud data [Luo and Hu
2021; Zeng et al. 2022; Zhou et al. 2021]. Due to the high freedom tor. The cross attention operator between two sets is,
degree of regressed coordinates, it is always difficult to obtain clean
 
CrossAttn(A, B) = o(a1, B) o(a2, B) · · · o(a𝑁𝑎 , B) ∈ R𝑑×𝑁𝑎
manifold surfaces via post-processing. As mentioned before, we (4)
believe that neural fields are generally more suitable than point
clouds for 3D shape generation. The area of combining DMs and Self Attention. In the case of self attention, we let the two sets be
neural fields is still underexplored. the same A = B,
DreamFusion [Poole et al. 2022] explores how to extract 3D in- SelfAttn(A) = CrossAttn(A, A). (5)
formation from a pretrained 2D image diffusion model. The recent
NeuralWavelet [Hui et al. 2022] first encodes shapes (represented as
signed distance fields) into the frequency domain with the wavelet 4 LATENT REPRESENTATION FOR NEURAL FIELDS
transform, and then train DMs on the frequency coefficients. While Our representation is inspired by radial basis functions (RBFs). We
this formulation is elegant, generative models generally work bet- will therefore describe our surface representation design using RBFs
ter on learned representations. Some concurrent works [Chou et al. as a starting point, and how we extended them using concepts
2022; Shue et al. 2022] in submission also utilize DMs in a latent space from neural fields and the transformer architecture. A continuous
for neural field generation. The TriplaneDiffusion [Shue et al. 2022] function can be represented with a set of weighted points in 3D
trains an autodecoder first for each shape. DiffusionSDF [Chou et al. using RBFs:
2022] runs a shape autoencoder based on triplane features [Peng 𝑀
∑︁
et al. 2020]. ÔRBF (x) = 𝜆𝑖 · 𝜙 (x, x𝑖 ) (6)
𝑖=1
Summary of 3D generation methods. We list several 3d generation where 𝜙 (x, x𝑖 ) is a radial basis function (RBF) and typically repre-
methods in Tab. 2, highlighting the choice of generative model (GAN, sents the similarity (or dissimilarity) between two inputs,
DM, EBM, NF, or AR) and the choice of data structure to represent
𝜙 (x, x𝑖 ) = 𝜙 (∥x − x𝑖 ∥). (7)
3D shapes (point clouds, meshes, voxels or fields).
Given ground-truth occupancies of x𝑖 , the values of 𝜆𝑖 can be ob-
3 PRELIMINARIES tained by solving a system of linear equations. In this way, we
can represent the continuous function O (·) as a set of 𝑀 points
An attention layer [Vaswani et al. 2017] has three types of inputs:
including their corresponding weights,
queries, keys, and values. Queries Q = [q1, q2, . . . , q𝑁𝑞 ] ∈ R𝑑×𝑁𝑞
𝜆𝑖 ∈ R, x𝑖 ∈ R3 𝑖=1 .
 𝑀
and keys K = [k1, k2, . . . , k𝑁𝑘 ] ∈ R𝑑×𝑁𝑘 are first compared to (8)

4
Point Cloud Position Embeddings Query Points Position Embeddings Target

Q
Cross Attention

K, V
···

Surface Sampling
K, V
Isosurface
latents

KL Regularization
latent queries

Cross Attention

Self Attention
Self Attention
Self Attention
···
.. Q .. .. .. ..
. . . . .

Shape Encoding (Sec. 5.1) KL (Sec. 5.2) Shape Decoding (Sec. 5.3)

Fig. 3. Shape autoencoding pipeline. Given a 3D ground-truth surface mesh as the input, we first sample a point cloud that is mapped to positional
embeddings and encode them into a set of latent codes through a cross-attention module (Sec. 5.1). Next, we perform (optional) compression and KL-
regularization in the latent space to obtain structured and compact latent shape representations (Sec. 5.2). Finally, the self-attention is carried out to aggregate
and exchange the information within the latent set. And a cross-attention module is designed to calculate the interpolation weights of query points. The
interpolated feature vectors are fed into a fully connected layer for occupancy prediction (Sec. 5.3).

  Í
However, in order to retain the details of a 3d shape, we often need 𝑀
where 𝑍 x, {x𝑖 }𝑖=1 = 𝑖=1 𝑀 𝜙 (x, x ) is a normalizing factor. Thus
𝑖
a very large number of points (e.g., 𝑀 = 80, 000 in [Carr et al. 2001]). the representation for a 3D shape can be written as
This representation does not benefit from recent advances in repre- n o𝑀
sentation learning and cannot compete with more compact learned f𝑖 ∈ R𝐶 , x𝑖 ∈ R3 . (11)
representations. We therefore want to modify the representation to 𝑖=1
change it into a neural field. After that, an MLP : R𝐶 → [0, 1] is applied to project the approxi-
One approach to neural fields is to represent each shape as a mated feature F̂KN (x) to occupancy,
separate neural network (making the network weights of a fixed  
size network the representation of a shape) and train a diffusion Ô3DILG (x) = MLP F̂KN (x) . (12)
process as hypernetwork. A second approach is to have a shared
encoder-decoder network for all shapes and represent each shape as Neural networks with latent sets (proposed). We initially explored
a latent computed by the encoder. We opt for the second approach, many variations for 3D shape representation based on irregular
as it leads to more compact representations because it is jointly and regular grids as well as tri-planes, frequency compositions, and
learned from all shapes in the data set and the network weights other factored representations. Ultimately, we could not improve
themselves do not count towards the latent representation. Such a on existing irregular grids. However, we were able to achieve a
neural field takes a tuple of coordinates x and 𝐶-dimensional latent significant improvement with the following change. We aim to keep
f as input and outputs occupancy, the structure of an irregular grid and the interpolation, but without
representing the actual spatial position explicitly. We let the net-
ÔNN (x) = NN(x, f), (9) work encode spatial information. Both the representations (RBF in
Eq. (6) and 3DILG in Eq. (10)) are composed by two parts, values and
where NN : R3 × R𝐶
→ [0, 1] is a neural network. A first approach
similarities. We keep the structure of the interpolation, but eliminate
was to use a single global latent f, but a major limitation is the
explicit point coordinates and integrate cross attention from Eq. (3).
ability to encode shape details [Mescheder et al. 2019]. Some follow-
The result is the following learnable function approximator,
up works study coordinate-dependent latents [Chibane et al. 2020;
Peng et al. 2020; Sajjadi et al. 2022] that combine traditional data 𝑀 √
∑︁ 1 ⊺
structures such as regular grids with the neural field concept. Latent F̂ (x) = v(f𝑖 ) ·   𝑒 q(x) k(f𝑖 )/ 𝑑 , (13)
𝑀
𝑍 x, {f𝑖 }𝑖=1
vectors are arranged in a spatial data structure and then interpo- 𝑖=1
lated (trilinearly) to obtain the coordinate-dependent latent fx . A   Í √
𝑀
where 𝑍 x, {f𝑖 }𝑖=1 = 𝑖=1𝑀 𝑒 q(x) ⊺ k(f𝑖 )/ 𝑑 is a normalizing factor.
recent work 3DILG [Zhang et al. 2022] proposed a sparse represen-
tation for 3D shapes, using latents f𝑖 arranged in an irregular grid at Similar to the MLP in Eq. 12, we apply a single fully connected layer
point locations x𝑖 . The final coordinate-dependent latent fx is then to get desired occupancy values,
estimated by kernel regression,  
Ô (x) = FC F̂ (x) . (14)
𝑀
∑︁ 1
fx = F̂KN (x) = f𝑖 ·   𝜙 (x, x𝑖 ), (10) Compared to 3DILG and all other coordinate-latent-based methods,
𝑀
𝑍 x, {x𝑖 }𝑖=1 𝑀 , the new
𝑖=1 we dropped the dependency of the coordinate set {x𝑖 }𝑖=1

5
Subsample and Copy
which can also be seen as a “partial” self attention. See Fig. 4 for
an illustration of both design choices. Intuitively, the number 𝑀
K, V

K, V
affects the reconstruction performance: the larger the 𝑀, the better
Q Q reconstruction. However, 𝑀 strongly affects the training time due
Learnable

Cross Attention Cross Attention to the transformer architecture, so it should not be too large. In our
final model, the number of latents 𝑀 is set as 512, and the number
of channels 𝐶 is 512 to provide a trade off between reconstruction
(a) Learnable Queries (b) Point Queries
quality and training time.
Fig. 4. Two ways to encode a point cloud. (a) uses a learnable query set;
(b) uses a downsampled version of input point embeddings as the query set. 5.2 KL regularization block
Latent diffusion [Rombach et al. 2022] proposed to use a variational
autoencoder (VAE) [Kingma and Welling 2014] to compress images.
representation only contains a set of latents, We adapt this design idea for our 3D shape representation and
also regularize the latents with KL-divergence. We should note
n o𝑀
f𝑖 ∈ R𝐶 . (15)
𝑖=1 that the KL regularization is optional and only necessary for the
An alternative view of our proposed function approximator is to second-stage diffusion model training. If we just want a method for
see it as cross attention between query points x and a set of latents. surface reconstruction from point clouds, we do not need the KL
regularization.
5 NETWORK ARCHITECTURE FOR SHAPE We first linear project latents to mean and variance by two net-
REPRESENTATION LEARNING work branches, respectively,
In this section, we will discuss how we design a variational autoen- 
FC𝜇 (f𝑖 ) = 𝜇𝑖,𝑗 𝑗 ∈ [1,2,··· ,𝐶0 ]
coder based on the latent representation proposed in Sec. 4. The
architecture has three components discussed in the following: a 3D

2
 (18)
FC𝜎 (f𝑖 ) = log 𝜎𝑖,𝑗
shape encoder, KL regularization block, and a 3D shape decoder. 𝑗 ∈ [1,2,··· ,𝐶 0 ]

5.1 Shape encoding where FC𝜇 : R𝐶 → R𝐶0 and FC𝜎 : R𝐶 → R𝐶0 are two linear
projection layers. We use a different size of output channels 𝐶 0 ,
We sample the surfaces of 3D input shapes in a 3D shape dataset.
where 𝐶 0 ≪ 𝐶. This compression enables us to train diffusion
This results in a point cloud of size 𝑁 for each shape, {x𝑖 ∈ R3 }𝑖=1
𝑁
models on smaller latents of total size 𝑀 · 𝐶 0 ≪ 𝑀 · 𝐶. We can
or in matrix form X ∈ R 3×𝑁 . While the dataset used in the paper write the bottleneck of the VAE formally, ∀𝑖 ∈ [1, 2, · · · , 𝑀], 𝑗 ∈
originally represents shapes as triangle meshes, our framework [1, 2, · · · , 𝐶 0 ],
is directly compatible with other surface representations, such as
scanned point clouds, spline surfaces, or implicit surfaces. 𝑧𝑖,𝑗 = 𝜇𝑖,𝑗 + 𝜎𝑖,𝑗 · 𝜖, (19)
In order to learn representations in the form of Eq. (15), the first where 𝜖 ∼ N (0, 1). The KL regularization can be written as,
challenge is to aggregate the information contained in a possibly
𝑁 into a smaller set of latent vectors {f }𝑀 .
large point cloud {x𝑖 }𝑖=1 𝑀 𝐶0
𝑖 𝑖=1 
𝑀
 1 ∑︁ ∑︁ 1 2 2 2

We design a set-to-set network to this effect. Lreg {f𝑖 }𝑖=1 = 𝜇𝑖,𝑗 + 𝜎𝑖,𝑗 − log 𝜎𝑖,𝑗 . (20)
𝑀 · 𝐶 0 𝑖=1 𝑗=1 2
A popular solution to this problem in previous work is to divide
the large point cloud into a smaller set of patches and to learn one
In practice, we set the weight for KL loss as 0.001 and report the
latent vector per patch. Although this is a very well researched
performance for different values of 𝐶 0 in Sec. 8.1. Our recommended
and standard component in many networks, we discovered a more
setting is 𝐶 0 = 32.
successful way to aggregate features from a large point cloud that is
better compatible with the transformer architecture. We considered
two options. 5.3 Shape decoding
One way is to define a learnable query set. Inspired by DETR [Car- To increase the expressivity of the network, we add a latent learning
ion et al. 2020] and Perceiver [Jaegle et al. 2021], we use the cross network between the two parts. Because our latents are a set of
attention to encode X, vectors, it is natural to use transformer networks here. Thus, the
proposed network here is a series of self attention blocks,
Enclearnable (X) = CrossAttn(L, PosEmb(X)) ∈ R𝐶×𝑀 , (16)
where L ∈ R𝐶×𝑀
is a learnable query set where each entry is 𝐶-  
𝑀
{f𝑖 }𝑖=1 ← SelfAttn (𝑙) {f𝑖 }𝑖=1
𝑀
, for 𝑖 = 1, · · · , 𝐿. (21)
dimensional, and PosEmb : R3 → R𝐶 is a column-wise positional
embedding function. The SelfAttn(·) with a superscript (𝑙) here means 𝑙-th block. The
Another way is to utilize the point cloud itself. We first subsample 𝑀 obtained using either Eq. (16) or Eq. (17) are fed into
latents {f𝑖 }𝑖=1
the point cloud X to a smaller one with furthest point sampling, the self attention blocks. Given a query x, the corresponding latent
X0 = FPS(X) ∈ R3×𝑀 . The cross attention is applied to X0 and X, is interpolated using Eq. (13), and the occupancy is obtained with a
Encpoints (X) = CrossAttn(PosEmb(X0 ), PosEmb(X)), (17) fully connected layer as shown in Eq. (14).

6
Shape Encoding (Sec. 5.1) Latent Decoding (Sec. 5.3)

Self Attention

Self Attention
𝑀
{f𝑖 }𝑖=1
FC𝜇 𝐶0 𝑀
𝑀
{f𝑖 }𝑖=1 ···
{z𝑖 }𝑖=1

𝐶 𝑀 Sample 𝐶0 𝐶
FCup
FC𝜎 𝐶0 𝑀 (a) Unconditional Denoising Network
𝑀 𝑀
𝑀 Condition
Fig. 5. KL regularization. Given a set of latents {f𝑖 ∈ R𝐶 }𝑖=1
𝑀 obtained
KV
from the shape encoding in Sec. 5.1, we employ two linear projection layers Q

Cross Attention
Self Attention
FC𝜇 , FC𝜎 to predict the mean and variance of a low-dimensional latent
space, where a KL regularization commonly used in VAE training is applied
to constrain the feature diversity. Then, we obtain smaller latents {z𝑖 ∈
···
R𝐶0 } of size 𝑀 · 𝐶 0 ≪ 𝑀 · 𝐶 via reparametrization sampling. Finally, the
compressed latents are mapped back to the original space by FCup to obtain
a higher dimensionality for the shape decoding in Sec. 5.3. (b) Conditional Denoising Network
Fig. 7. Denoising network. Our denoising network is composed of several
Forward Diffusion Process
denoising layers (a box in the figure denotes a layer). The denoising layer
Add Noise Add Noise Add Noise
for unconditional generation contains two sequential self attention blocks.

Condition
The denoising layer for conditional generation contains a self attention
Reverse Diffusion Process and a cross attention block. The cross attention is for injecting condition
Denoise Denoise Denoise information such as categories, images or partial point clouds.

Fig. 6. Latent set diffusion models. The diffusion model operates on


compressed 3D shapes in the form of a regularized set of latent vectors
{z𝑖 }𝑖=1
𝑀 .
The function Denoiser(·, ·, ·) is a set denoising network (set-to-set
function). The network can be easily modeled by a self-attention
Loss. We optimize the binary cross entropy loss between our transformer. Each layer consists of two attention blocks. The first
approximated function and the ground-truth indicator function as one is a self attention for attentive learning of the latent set. The
in prior works [Mescheder et al. 2019]. second one is for injecting the condition information C (Fig. 7 (b))

𝑀
 h  i as in prior works [Rombach et al. 2022]. For simple information
Lrecon {f𝑖 }𝑖=1 , O = Ex∈R3 BCE Ô (x), O (x) . (22) like categories, C is a learnable embedding vector (e.g., 55 different
Surface reconstruction. We sample query points in a grid of res- embedding vectors for 55 categories). For a single-view image , we
olution 1283 . The final surface is reconstructed with Marching use ResNet-18 [He et al. 2016] as the context encoder to extract
Cubes [Lorensen and Cline 1987]. a global feature vector as condition C. For text conditioning, we
use BERT [Devlin et al. 2018] to learn a global feature vector as
6 SHAPE GENERATION C. For partial point clouds, we use the shape encoder introduced
Our proposed diffusion model combines design decisions from latent in Sec. 5.1 to obtain a set of latent embeddings as C. In the case
diffusion (the idea of the compressed latent space), EDM [Karras et al. of unconditional generation, the cross attention degrades to self
2022] (most of the training details), and our shape representation attention (Fig. 7 (a)).
design (the architecture is based on attention and self-attention
instead of convolution). 7 EXPERIMENTAL SETUP
We train diffusion models in the latent space, i.e., the bottleneck We use the dataset of ShapeNet-v2 [Chang et al. 2015] as a bench-
in Eq. (19). Following the diffusion formulation in EDM [Karras et al. mark, containing 55 categories of man-made objects. We use the
2022], our denoising objective is training/val splits in [Zhang et al. 2022]. We preprocess shapes as
𝑀
1 ∑︁   2 in [Mescheder et al. 2019]. Each shape is first converted to a water-
𝑀
En𝑖 ∼N (0,𝜎 2 I) Denoiser {z𝑖 + n𝑖 }𝑖=1 , 𝜎, C − z𝑖 , (23) tight mesh, and then normalized to its bounding box, from which
𝑀 𝑖=1 𝑖 2
we further sample a dense surface point cloud of size 500,000. To
where Denoiser(·, ·, ·) is our denoising neural network, 𝜎 is the noise learn the neural fields, we randomly sample 500,000 points with
level, and C is the optional conditional information (e.g., categories, occupancies in the 3D space, and 500,000 points with occupancies in
images, partial point clouds and texts). We denote the corresponding the near surface region. For the single-view object reconstruction,
output of z𝑖 + n𝑖 with the subscript 𝑖, i.e. Denoiser(·, ·, ·)𝑖 . We should we use the 2D rendering dataset provided by 3D-R2N2 [Choy et al.
minimize the loss for every noise level 𝜎. The sampling is done by 2016], where each shape is rendered into RGB images of size of
solving ordinary/stochastic differential equations (ODE/SDE). See 224 × 224 from 24 random viewpoints. For text-driven shape gener-
Fig. 6 for an illustration and EDM [Karras et al. 2022] for a detailed ation, we use the text prompts of ShapeGlot [Achlioptas et al. 2019].
description for both the forward (training) and reverse (sampling) For data preprocess of shape completion training, we create partial
process. point clouds by sampling point cloud patches.

7
Table 3. Shape autoencoding (surface reconstruction from point clouds) on ShapeNet. We show averaged metrics on all 55 categories and individual
metrics for the 7 largest categories. We compare with existing representative methods, OccNet (global latent), ConvOccNet (local latent grid), IF-Net
(multiscale local latent grid), and 3DILG (irregular latent grid). For our method, we show two different designs. The column Learned Queries shows results of
using Eq. (16), while the column Point Queries means we are using a subsampled point set as queries in Eq. (17). The results of Point Queries are generally
better than Learned Queries. This is expected because input-dependent queries (Point Queries) are better than fixed queries (Learned Queries).
Ours
OccNet ConvOccNet IF-Net 3DILG
Learned Queries Point Queries
table 0.823 0.847 0.901 0.963 0.965 0.971
car 0.911 0.921 0.952 0.961 0.966 0.969
chair 0.803 0.856 0.927 0.950 0.957 0.964
airplane 0.835 0.881 0.937 0.952 0.962 0.969
IoU ↑ sofa 0.894 0.930 0.960 0.975 0.975 0.982
rifle 0.755 0.871 0.914 0.938 0.947 0.960
lamp 0.735 0.859 0.914 0.926 0.931 0.956
mean (selected) 0.822 0.881 0.929 0.952 0.957 0.967
mean (all) 0.825 0.888 0.934 0.953 0.955 0.965
table 0.041 0.036 0.029 0.026 0.026 0.026
car 0.082 0.083 0.067 0.066 0.062 0.062
chair 0.058 0.044 0.031 0.029 0.028 0.027
airplane 0.037 0.028 0.020 0.019 0.018 0.017
Chamfer ↓ sofa 0.051 0.042 0.032 0.030 0.030 0.029
rifle 0.046 0.025 0.018 0.017 0.016 0.014
lamp 0.090 0.050 0.038 0.036 0.035 0.032
mean (selected) 0.058 0.040 0.034 0.032 0.031 0.030
mean (all) 0.072 0.052 0.041 0.040 0.039 0.038
table 0.961 0.982 0.998 0.999 0.999 0.999
car 0.830 0.852 0.888 0.892 0.898 0.899
chair 0.890 0.943 0.990 0.992 0.994 0.997
airplane 0.948 0.982 0.994 0.993 0.994 0.995
F-Score ↑ sofa 0.918 0.967 0.988 0.986 0.986 0.990
rifle 0.922 0.987 0.998 0.997 0.998 0.999
lamp 0.820 0.945 0.970 0.971 0.970 0.975
mean (selected) 0.898 0.951 0.975 0.976 0.977 0.979
mean (all) 0.858 0.933 0.967 0.966 0.966 0.970

7.1 Baselines 7.2 Evaluation metrics


For shape auto-encoding, we conduct experiments against state- To evaluate the reconstruction accuracy of shape auto-encoding
of-the-art methods for implicit surface reconstruction from point from point clouds, we adopt Chamfer distance, volumetric Intersection-
clouds. We use OccNet [Mescheder et al. 2019], ConvOccNet [Peng over-Union (IoU), and F-score as primary evaluation metrics. IoU
et al. 2020], IF-Net [Chibane et al. 2020], and 3DILG [Zhang et al. is computed based on the occupancy predictions of 50𝑘 querying
2022] as baselines. The OccNet is the first work of learning neural points sampled in 3D space. Chamfer distance and F-score are cal-
fields from a single global latent vector. ConvOccNet and IF-Net culated between two sampled point clouds with the size of 50𝑘
learn local neural fields based on latent vectors arranged in a regular respectively from reconstructed and ground-truth surfaces. For IoU
grid, while 3DILG uses latent vectors on an irregular grid. and F-score, higher is better, while for Chamfer, lower is better.
For 3D shape generation, we compare against recent state-of-the- To measure the mesh quality of unconditional and conditional
art generative models, including PVD [Zhou et al. 2021], 3DILG [Zhang shape generation, we follow [Ibing et al. 2021; Shue et al. 2022; Zhang
et al. 2022], and NeuralWavelet [Hui et al. 2022]. PVD is a diffusion et al. 2022] to adapt the Fréchet Inception Distance (FID) and Kernel
model for 3D point cloud generation, and 3DILG utilizes autore- Inception Distance (KID) commonly used to assess the image gener-
gressive models. NeuralWavelet utilized diffusion models in the ative models to rendered images of 3d shapes. To calculate FID and
frequency domain of shapes. KID of rendered images, we render each shape from 10 viewpoints.
The metrics are named as Rendering-FID and Rendering-KID.
The Rendering-FID is defined as,

 
Rendering-FID = ∥𝜇g − 𝜇 r ∥ + 𝑇𝑟 Σ𝑔 + Σ𝑟 − 2(Σ𝑔 Σ𝑟 ) 1/2 (24)

8
Proposed
Input GT OccNet ConvONet IF-Net 3DILG
Learnable Queries Point Queries

Fig. 8. Visualization of shape autoencoding results (surface reconstruction from point clouds from ShapeNet).

9
PVD
Grid-83
3DILG
Ours

Fig. 9. Unconditional generation. All models are trained on full ShapeNet.

where 𝑔 and 𝑟 denotes the generated and training datasets respec- 𝑙𝑟 max = 5𝑒 − 5 in the first 𝑡 0 = 80 epochs, and then gradually
𝑡 −𝑡 0
tively. 𝜇 and Σ are the statistical mean and covariance matrix of the decreased using the cosine decay schedule 𝑙𝑟 max ∗ 0.5
1+𝑐𝑜𝑠 (
𝑇 −𝑡 0 )
feature distribution extracted by the Inception network. until reaching the minimum value of 1𝑒 − 6. The diffusion models
The Rendering-KID is defined as, are trained on 4 A100 with batch size of 256 for 𝑇 = 8, 000 epochs.
!2 The learning rate is linearly increased to 𝑙𝑟𝑚𝑎𝑥 = 1𝑒 − 4 in the first
1 ∑︁
Rendering-KID = MMD max 𝐷 (x, y) (25) 𝑡 0 = 800 epochs, and then gradually decreased using the above
|R| y∈ G mentioned decay schedule until reaching 1𝑒 − 6. We use the default
x∈R
settings for the hyperparameters of EDM [Karras et al. 2022]. During
where 𝐷 (x, y) is a polynomial kernel function to evaluate the simi-
sampling, we obtain the final latent set via only 18 denoising steps.
larity of two samples, G and R are feature distributions of generated
set and reference set, respectively. The function MMD(·) is Maxi-
mum Mean Discrepancy. However, the rendering-based FID and KID 8 RESULTS
are essentially designed to understand 3D shapes from 2D images. We present our results for multiple applications: 1) shape auto-
Thus, they have the inherent issue of not accurately understanding encoding, 2) unconditional generation, 3) category-conditioned
shape compositions in the 3D world. To compensate their draw- generation, 4) text-conditioned generation, 5) shape completion,
backs, we also adapt the FID and KID to 3D shapes directly. For each 6) image-conditioned generation. Finally, we perform a shape nov-
generated or ground-truth shape, we sample 4096 points (with nor- elty analysis to validate that we are not overfitting to the dataset.
mals) from the surface mesh and then feed them into a pre-trained
PointNet++ [Qi et al. 2017b] to extract a global latent vector, repre- 8.1 Shape Auto-Encoding
senting the global structure of the 3D shape. The PointNet++ is first We show the quantitative results in Tab. 3 for a deterministic au-
pretrained on shape classification on ShapeNet-55. As we use point toencoder without the KL block described in Sec. 5.2. In particular,
clouds, we call the FID and KID for 3D shapes as Fréchet PointNet++ we show results for the largest 7 categories as well as averaged re-
Distance (FPD) and Kernel PointNet++ Distance (KPD). The two sults over the categories. The two design choices of shape encoding
metrics are defined similarly as in Eq. (24) and Eq. (25), except that described in Sec. 5.1 are also investigated. The case of using the
the features are extracted from a PointNet++ network. subsampled point cloud as queries is better than learnable queries in
all categories. Thus we use subsampled point clouds in our later ex-
7.3 Implementation periments. The visualization of reconstruction results can be found
For the shape auto-encoder, we use the point cloud of size 2048 as in Fig. 8. We visualize some extremely difficult shapes from the
input. At each iteration, we individually sample 1024 query points datasets (test split). These shapes often contain some thin structures.
from the bounding volume ([−1, 1] 3 ) and the other 1024 points However, our method still performs well.
from near surface region for the occupancy values prediction. The Both our method and the competitor 3DILG use transformer as
shape auto-encoder is trained on 8 A100, with batch size of 512 the main backbone. However, we differ in nature. 1) For encoding,
for 𝑇 = 1, 600 epochs. The learning rate is linearly increased to 3DILG uses KNN to aggregate local information and we use cross

10
Grid-83
3DILG
NW
Ours
Grid-83
3DILG
NW
Ours
Grid-83
3DILG
NW
Ours

Fig. 10. Category-conditional generation. From top to bottom, we show category (airplane, chair, table) conditioned generation results.

attention. KNN manually selects neighboring points according to The numerical results for the reconstruction are significant. The
spatial similarities (distances) while cross attention learns the sim- maximum achievable number for the metrics IoU and F1 is 1. The
ilarities on the go. 2) 3DILG uses a set of points and one latent improvement has to be interpreted in how much closer we get to 1.
per point. Our representation only contains a set of latents. This The visualizations also highlight the improvement.
simplification makes the second-stage generative model training
Ablation study of the number of latents. The number 𝑀 is the
easier. 3) For decoding, 3DILG applies spatial interpolation and we
number of latent vectors used in the network. Intuitively, a larger
use interpolation in feature space. The used cross attention can be
𝑀 leads to a better reconstruction. We show results of 𝑀 in Tab. 4.
seen as learnable interpolation. This gives us more flexibility.
Thus, in all of our experiments, 𝑀 is set to 512. We are limited by
computation time to work with larger 𝑀.

11
Ablation study of the KL block. We described the KL block in Sec. 5.2 AutoSDF Ours
that leads to additional compression. In addition, this block changes
the deterministic shape encoding into a variational autoencoder.
The introduced hyperparameter is 𝐶 0 . A smaller 𝐶 0 leads to a higher
compression rate. The choice of 𝐶 0 is ablated in Tab. 5. Clearly, larger “horizontal slats on top of back”
𝐶 0 gives better results. The reconstruction results of 𝐶 0 = 8, 16, 32, 64
are very close. However, they differ significantly in the second stage,
because a larger latent size could make the training of diffusion
models more difficult. This result is very encouraging for our model,
because it indicates that aggressively increasing the compression
in the KL block does not decrease reconstruction performance too “one big hole between back and seat”
much. We can also see that compressing with the KL block by de-
creasing 𝐶 0 is much better than compressing using fewer latent
vectors 𝑀.
“this chair has wheels”
8.2 Unconditional Shape Generation
Comparison with surface generation. We evaluate the task of un-
conditional shape generation with the proposed metrics in Tab. 6.
We also compared our method with a baseline method proposed
in [Zhang et al. 2022]. The method is called Grid-83 because the
“vertical back ribs”
latent grid size is 83 , which is exactly the same as in AutoSDF [Mittal
et al. 2022]. The table also shows the results of different 𝐶 0 . Our Fig. 11. Text conditioned generation. For each text prompt, we generate
3 shapes. Our results (Right) are compared with AutoSDF (Left).
results are best when 𝐶 0 = 32 in all metrics. When 𝐶 0 = 64 the
results become worse. This also aligns with our conjecture that a
larger latent size makes the training more difficult.
better recall, which means our method can generate a higher per-
Comparison with point cloud generation. Additionally, we compare centage of the training data. For 3DShapeGen and AutoSDF, both
our method with PVD [Zhou et al. 2021] which is a point cloud precision and recall are low compared to other methods. Second,
diffusion method. We re-train PVD using the official released code we show other metrics based on point cloud distances (CD and
on our preprocessed dataset and splits. We use the same evaluation EMD) [Achlioptas et al. 2018]. The smaller the better for MMD
protocol as before but with one major difference. Since PVD can only and the larger the better for COV. These metrics are often used to
generate point clouds without normals, we use another pretrained evaluate point cloud generation.
PointNet++ (without normals) as the feature extractor to calculate
Surface-FPD and Surface-KPD. The Tab. 7 shows we can beat PVD 8.4 Text-conditioned generation
by a large margin. Additionally, we also show the metrics calculated The results of our text-conditioned generation model can be found
on rendered images. Visualization of generated results can be found in Fig. 11. Since the model is a probabilistic model, we can sample
in Fig. 9. shapes given a text prompt. The results are very encouraging and
they constitute the first demonstration of text-conditioned 3D shape
8.3 Category-conditioned generation generation using diffusion models. To the best of our knowledge,
there are no published competing methods at the point of submitting
We train a category-conditioned generation model using our method.
this work.
We evaluate our models in Tab. 8. We should note that the competitor
method NeuralWavelet [Hui et al. 2022] trains models for categories
8.5 Probabilistic shape completion
separately; thus, NeuralWavelet is not a true category-conditioned
model. We also visualize some results (airplane, chair, and table) We also extend our diffusion model for probabilistic shape comple-
in Fig. 10. Our training is more challenging, as we train on a dataset tion by using a partial point cloud as conditioning input. The compar-
that is an order of magnitude larger and we train for all classes ison against ShapeFormer [Yan et al. 2022] is depicted in Fig. 12. As
jointly. While NeuralWavelet already has good results, the joint seen, our latent set diffusion can predict more accurate completion,
training is necessary / beneficial for many subsequent applications. and we also have the ability to achieve more diverse generations.
Additionally, we show evaluation metrics and more competitor
methods in Tab. 9. First, we use precision and recall (P&R) [Sajjadi 8.6 Image-conditioned shape generation.
et al. 2018] to quantify the percentage of generated samples that We also provide comparisons on the task of single-view 3D object
are similar to training and the percentage of training data that can reconstruction in Fig. 13. Compared to other deterministic methods
be generated, respectively. 3DILG, NeuralWavelet, and our method, including OccNet [Mescheder et al. 2019] and IM-Net [Chen and
can achieve high precision which means they can generate similar Zhang 2019], our latent set diffusion can not only reconstruct more
shapes to training. However, our method also shows significantly accurate surface details, (e.g. long rods and tiny holes in the back),

12
Table 4. Ablation study for different number of latents 𝑀 for Table 5. Ablation study for different number of channels 𝐶 0 for shape (variational)
shape autoencoding autoencoding.

𝑀 = 512 𝑀 = 256 𝑀 = 128 𝑀 = 64 𝐶0 = 1 𝐶0 = 2 𝐶0 = 4 𝐶 0 = 8 𝐶 0 = 16 𝐶 0 = 32 𝐶 0 = 64


IoU ↑ 0.965 0.956 0.940 0.916 IoU ↑ 0.727 0.816 0.957 0.960 0.962 0.963 0.964
Chamfer ↓ 0.038 0.039 0.043 0.049 Chamfer ↓ 0.133 0.087 0.038 0.038 0.038 0.038 0.038
F-Score ↑ 0.970 0.965 0.953 0.929 F-Score ↑ 0.703 0.815 0.967 0.967 0.970 0.969 0.970

Table 6. Unconditional generation on full ShapeNet. Table 7. Unconditional generation on full ShapeNet.

Ours
Grid-83 3DILG PVD Ours
𝐶 0 = 8 𝐶 0 = 16 𝐶 0 = 32 𝐶 0 = 64
Surface-FPD ↓ 4.03 1.89 2.71 1.87 0.76 0.97 Surface-FPD ↓ 2.33 0.63
Surface-KPD (×103 ) ↓ 6.15 2.17 3.48 2.42 0.66 1.11 Surface-KPD (×103 ) ↓ 2.65 0.53
Rendering-FID ↓ 32.78 24.83 28.25 27.26 17.08 24.24 Rendering-FID ↓ 270.64 17.08
Rendering-KID (×103 ) ↓ 14.12 10.51 14.60 19.37 6.75 11.76 Rendering-KID (×103 ) ↓ 281.54 6.75

Table 8. Category conditioned generation. NW is short for NeuralWavelet. The dash sign “-” means the method NeuralWavelet does not release models
trained on these categories.

airplane chair table car sofa


3DILG NW Ours 3DILG NW Ours 3DILG NW Ours 3DILG NW Ours 3DILG NW Ours
Surface-FID 0.71 0.38 0.62 0.96 1.14 0.76 2.10 1.12 1.19 2.93 - 2.04 1.83 - 0.77
Surface-KID (×103 ) 0.81 0.53 0.83 1.21 1.50 0.70 3.84 1.55 1.87 7.35 - 3.90 3.36 - 0.70

Table 9. Category conditioned generation II. We show results for additional metrics and additional methods for category conditioned generation.
chair table
3DILG 3DShapeGen AutoSDF NW Ours 3DILG 3DShapeGen AutoSDF NW Ours
Precision ↑ 0.87 0.56 0.42 0.89 0.86 0.85 0.64 0.64 0.83 0.83
Recall ↑ 0.65 0.45 0.23 0.57 0.86 0.59 0.52 0.69 0.68 0.89
MMD-CD (×102 ) ↓ 1.78 2.14 7.27 2.14 1.78 2.85 2.65 2.77 2.68 2.38
MMD-EMD (×102 ) ↓ 9.43 10.55 19.57 11.15 9.41 11.02 9.53 9.63 9.60 8.81
COV-CD (×102 ) ↑ 31.95 28.01 6.31 29.19 37.48 18.54 23.61 21.55 21.71 25.83
COV-EMD (×102 ) ↑ 36.29 36.69 18.34 34.91 45.36 27.73 43.26 29.16 30.74 43.58

but also support multi-modal prediction, which is a desired property consideration changes, and for the second stage – the core of our
to deal with severe occlusions. diffusion architecture – training time is also relatively high. Overall,
we believe that there is significant potential for future research av-
8.7 Shape novelty analysis enues to speed up training, in particular, in the context of diffusion
models.
We use shape retrieval to demonstrate that we are not simply over-
fitting to the training set. Given a generated shape, we measure the
Chamfer distance between it and training shapes. The visualization
of retrieved shapes can be found in Fig. 14. Clearly, the model can
synthesize new shapes with realistic structures. 9 CONCLUSION
We have introduced 3DShape2VecSet, a novel shape representation
8.8 Limitations for neural fields that is tailored to generative diffusion models. To
While our method shows convincing results on a variety of tasks, this end, we combine ideas from radial basis functions, previous
our design choices also have drawbacks that we would like to dis- neural field architectures, variational autoencoding, as well as cross
cuss. For instance, we require a two stage training strategy. While attention and self-attention to design a learnable representation.
this leads to improved performance in terms of generation quality, Our shape representation can take a variety of inputs including
training the first stage is more time consuming than relying on triangle meshes and point clouds and encode 3D shapes as neu-
manually-designed features such as wavelets [Hui et al. 2022]. In ral fields on top of a set of latent vectors. As a result, our method
addition, the first stage might require retraining if the shape data in demonstrates improved performance in 3D shape encoding and 3D

13
GT Condition ShapeFormer Ours shape generative modeling tasks, including unconditioned genera-
tion, category-conditioned generation, text-conditioned generation,
point-cloud completion, and image-conditioned generation.
In future work, we see many exciting possibilities. Most impor-
tantly, we believe that our model further advances the state of the
art in point cloud and shape processing on a large variety of tasks.
In particular, we would like to employ the network architecture of
3DShape2VecSet to tackle the problem of surface reconstruction
from scanned point clouds. In addition, we can see many applica-
tions for content-creation tasks, for example 3D shape generation
of textured models along with their material properties. Finally, we
would like to explore editing and manipulation tasks leveraging
pretrained diffusion models for prompt to prompt shape editing,
leveraging the recent advances in image diffusion models.

ACKNOWLEDGMENTS
We would like to acknowledge Anna Frühstück for helping with
Fig. 12. Point cloud conditioned generation. We show three generated
figures and the video voiceover. This work was supported by the
results given a partial cloud. The ground-truth point cloud and the partial
point cloud used as condition are shown in Left. We compare our results SDAIA-KAUST Center of Excellence in Data Science and Artificial
(Right) with ShapeFormer (Middle). Intelligence (SDAIA-KAUST AI) as well as the ERC Starting Grant
Scan2CAD (804724).
Condition IM-Net OccNet Ours
REFERENCES
Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learn-
ing representations and generative models for 3d point clouds. In International
conference on machine learning. PMLR, 40–49.
Panos Achlioptas, Judy Fan, Robert Hawkins, Noah Goodman, and Leonidas J Guibas.
2019. ShapeGlot: Learning language for shape differentiation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision. 8938–8947.
Alexandre Boulch and Renaud Marlet. 2022. Poco: Point convolution for surface
reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 6302–6314.
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2016. Generative and
discriminative voxel modeling with convolutional neural networks. arXiv preprint
arXiv:1608.04236 (2016).
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In
European conference on computer vision. Springer, 213–229.
Jonathan C Carr, Richard K Beatson, Jon B Cherrie, Tim J Mitchell, W Richard Fright,
Bruce C McCallum, and Tim R Evans. 2001. Reconstruction and representation of
3D objects with radial basis functions. In Proceedings of the 28th annual conference
on Computer graphics and interactive techniques. 67–76.
Rohan Chabra, Jan E Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove,
and Richard Newcombe. 2020. Deep local shapes: Learning local sdf priors for
detailed 3d reconstruction. In European Conference on Computer Vision. Springer,
Fig. 13. Image conditioned generation. In the left column we show the
608–625.
condition image. In the middle we show results obtained by the method Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De
IM-Net and OccNet. Our generated results are shown on the right. Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero
Karras, and Gordon Wetzstein. 2022. Efficient Geometry-aware 3D Generative
Ref Gen Ref Gen Ref Gen Ref Gen Adversarial Networks. In CVPR.
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet:
An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape
modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 5939–5948.
An-Chieh Cheng, Xueting Li, Sifei Liu, Min Sun, and Ming-Hsuan Yang. 2022. Autore-
gressive 3d shape generation via canonical mapping. arXiv preprint arXiv:2204.01955
(2022).
Fig. 14. Shape generation novelty. For a generated shape, we retrieve Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Implicit functions in
the top-1 similar shape in the training set. The similarity is measured using feature space for 3d shape reconstruction and completion. In Proceedings of the
Chamfer distance of sampled surface point clouds. In each pair, we show IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6970–6981.
Gene Chou, Yuval Bahat, and Felix Heide. 2022. DiffusionSDF: Conditional Generative
the retrieved shape (left) and the generated shape (right). The generated
Modeling of Signed Distance Functions. arXiv preprint arXiv:2211.13757 (2022).
shapes are from our category-conditioned generation results. Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio
Savarese. 2016. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object
Reconstruction. european conference on computer vision (2016), 628–644.

14
Angela Dai, Christian Diller, and Matthias Nießner. 2020. Sg-nn: Sparse generative 4460–4470.
neural networks for self-supervised scene completion of rgb-d scans. In Proceedings Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 849–858. Anders Eriksson. 2019. Deep level sets: Implicit surface representations for 3d shape
Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion inference. arXiv preprint arXiv:1901.06802 (2019).
using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ra-
conference on computer vision and pattern recognition. 5868–5877. mamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre- for View Synthesis. In ECCV.
training of deep bidirectional transformers for language understanding. arXiv Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shubham Tulsiani. 2022. Autosdf:
preprint arXiv:1810.04805 (2018). Shape priors for 3d completion, reconstruction and generation. In Proceedings of the
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua IEEE/CVF Conference on Computer Vision and Pattern Recognition. 306–315.
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and Leonidas
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Guibas. 2019. StructureNet: Hierarchical Graph Networks for 3D Shape Generation.
Words: Transformers for Image Recognition at Scale. ICLR (2021). ACM Transactions on Graphics (TOG), Siggraph Asia 2019 38, 6 (2019), Article 242.
Philipp Erler, Paul Guerrero, Stefan Ohrhallinger, Niloy J Mitra, and Michael Wim- Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. 2020. Polygen: An
mer. 2020. Points2surf learning implicit surfaces from point clouds. In European autoregressive generative model of 3d meshes. In International conference on machine
Conference on Computer Vision. Springer, 108–124. learning. PMLR, 7220–7229.
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high- Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-
resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer grove. 2019. Deepsdf: Learning continuous signed distance functions for shape
Vision and Pattern Recognition. 12873–12883. representation. In Proceedings of the IEEE/CVF conference on computer vision and
Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, and Thomas Funkhouser. 2020. pattern recognition. 165–174.
Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger.
on Computer Vision and Pattern Recognition. 4857–4866. 2020. Convolutional occupancy networks. In European Conference on Computer
Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learning a Vision. Springer, 523–540.
predictable and generative vector representation for objects. In European Conference Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
on Computer Vision. Springer, 484–499. Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep
Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. learning on point sets for 3d classification and segmentation. In Proceedings of the
Advances in Neural Information Processing Systems 27 (2014), 2672–2680. IEEE conference on computer vision and pattern recognition. 652–660.
Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet++: Deep
Shi-Min Hu. 2021. Pct: Point cloud transformer. Computational Visual Media 7, 2 hierarchical feature learning on point sets in a metric space. Advances in neural
(2021), 187–199. information processing systems 30 (2017).
Christian Häne, Shubham Tulsiani, and Jitendra Malik. 2017. Hierarchical surface Aditya Ramesh. 2022. Hierarchical Text-Conditional Image Generation with CLIP
prediction for 3d object reconstruction. In 2017 International Conference on 3D Vision Latents.
(3DV). IEEE, 412–420. Danilo Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning Flows. In International Conference on Machine Learning. 1530–1538.
for image recognition. In Proceedings of the IEEE conference on computer vision and Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. 2017b. Octnetfu-
pattern recognition. 770–778. sion: Learning depth fusion from data. In 2017 International Conference on 3D Vision
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic (3DV). IEEE, 57–66.
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017a. Octnet: Learning deep
Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. 2022. Neural wavelet-domain 3d representations at high resolutions. In Proceedings of the IEEE Conference on
diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers. 1–9. Computer Vision and Pattern Recognition, Vol. 3.
Moritz Ibing, Isaak Lim, and Leif Kobbelt. 2021. 3d shape generation with grid-based Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
implicit functions. In Proceedings of the IEEE/CVF Conference on Computer Vision 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of
and Pattern Recognition. 13559–13568. the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton,
Joao Carreira. 2021. Perceiver: General perception with iterative attention. In Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gon-
International conference on machine learning. PMLR, 4651–4664. tijo Lopes, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep
Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner, Thomas Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
Funkhouser, et al. 2020. Local implicit grid representations for 3d scenes. In Pro- Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. Assessing generative models via precision and recall. Advances in neural
6001–6010. information processing systems 31 (2018).
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the Design Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Rad-
Space of Diffusion-Based Generative Models. In Proc. NeurIPS. wan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. 2022.
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Scene representation transformer: Geometry-free novel view synthesis through set-
International Conference on Learning Representations (ICLR), Yoshua Bengio and latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer
Yann LeCun (Eds.). Vision and Pattern Recognition. 6229–6238.
Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. 2006. A tutorial on J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon
energy-based learning. Predicting structured data 1, 0 (2006). Wetzstein. 2022. 3D Neural Field Generation using Triplane Diffusion. arXiv
Tianyang Li, Xin Wen, Yu-Shen Liu, Hua Su, and Zhizhong Han. 2022. Learning deep preprint arXiv:2211.16677 (2022).
implicit functions for 3D shapes with dynamic code clouds. In Proceedings of the Yongbin Sun, Yue Wang, Ziwei Liu, Joshua Siegel, and Sanjay Sarma. 2020. Pointgrow:
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12840–12850. Autoregressively learned point cloud generation with self-attention. In Proceedings
William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution of the IEEE/CVF Winter Conference on Applications of Computer Vision. 61–70.
3D surface construction algorithm. ACM siggraph computer graphics 21, 4 (1987), Jiapeng Tang, Jiabao Lei, Dan Xu, Feiying Ma, Kui Jia, and Lei Zhang. 2021. Sa-convonet:
163–169. Sign-agnostic optimization of convolutional occupancy networks. In Proceedings of
Andreas Lugmayr, Martin Danelljan, Andrés Romero, Fisher Yu, Radu Timofte, and the IEEE/CVF International Conference on Computer Vision. 6504–6513.
Luc Van Gool. 2022. RePaint: Inpainting using Denoising Diffusion Probabilistic Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. 2017. Octree Generating
Models. ArXiv abs/2201.09865 (2022). Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.
Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud In 2017 IEEE International Conference on Computer Vision (ICCV). 2107–2115.
generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.
Pattern Recognition. 2837–2845. Advances in neural information processing systems 30 (2017).
Donald JR Meagher. 1980. Octree encoding: A new technique for the representation, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jon es, Aidan N
manipulation and display of arbitrary 3-d objects by computer. Electrical and Systems Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances
Engineering Department Rensseiaer Polytechnic . . . . in neural information processing systems 30 (2017).
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-cnn:
Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. on Graphics (TOG) 36, 4 (2017), 72.

15
Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong. 2018. Adaptive O-CNN: a (2020).
patch-based deep representation of 3D shapes. In SIGGRAPH Asia 2018 Technical Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischinski, Daniel Cohen-Or, and
Papers. ACM, 217. Hui Huang. 2022. Shapeformer: Transformer-based shape completion via sparse
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Solomon. 2019. Dynamic graph cnn for learning on point clouds. Acm Transactions Pattern Recognition. 6239–6249.
On Graphics (tog) 38, 5 (2019), 1–12. Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath
Francis Williams, Zan Gojcic, Sameh Khamis, Denis Zorin, Joan Bruna, Sanja Fidler, Hariharan. 2019. Pointflow: 3d point cloud generation with continuous normalizing
and Or Litany. 2022. Neural fields as learnable kernels for 3d reconstruction. In flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4541–4550.
18500–18510. Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016. Karsten Kreis. 2022. LION: Latent Point Diffusion Models for 3D Shape Generation.
Learning a probabilistic latent space of object shapes via 3d generative-adversarial arXiv preprint arXiv:2210.06978 (2022).
modeling. In Advances in Neural Information Processing Systems. 82–90. Biao Zhang, Matthias Nießner, and Peter Wonka. 2022. 3DILG: Irregular Latent Grids
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and for 3D Generative Modeling. In Advances in Neural Information Processing Systems.
Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. https://openreview.net/forum?id=RO0wSr3R7y-
In Proceedings of the IEEE conference on computer vision and pattern recognition. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. 2021. Point
1912–1920. transformer. In Proceedings of the IEEE/CVF International Conference on Computer
Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. 2016. A theory of generative Vision. 16259–16268.
convnet. In International Conference on Machine Learning. PMLR, 2635–2644. Xin-Yang Zheng, Yang Liu, Peng-Shuai Wang, and Xin Tong. 2022. SDF-StyleGAN:
Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. 2021. Genera- Implicit SDF-Based StyleGAN for 3D Shape Generation. In Comput. Graph. Forum
tive pointnet: Deep energy-based learning on unordered point sets for 3d generation, (SGP).
reconstruction and classification. In Proceedings of the IEEE/CVF Conference on Com- Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3d shape generation and completion through
puter Vision and Pattern Recognition. 14976–14985. point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on
Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Computer Vision. 5826–5835.
Wu. 2020. Generative VoxelNet: learning energy-based models for 3D shape syn-
thesis and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence

16

You might also like