3 DShape 2 Vec Set
3 DShape 2 Vec Set
car
Fig. 1. Left: Shape autoencoding results (surface reconstruction from point clouds) Right: the various down-stream applications of 3DShape2VecSet (from
top to down): (a) category-conditioned generation; (b) point clouds conditioned generation (shape completion from partial point clouds); (c) image conditioned
generation (shape reconstruction from single-view images); (d) text-conditioned generation.
2
√
𝜙 (x, f𝑖 ) = exp q(x) ⊺ k(f𝑖 ) / 𝑑
(a) RBF (b) Global Latent (c) Latent Grid (d) Irregular Latent Grid (e) Latent Set (Ours)
Fig. 2. Continuous function representations. Scalars are represented with spheres while vectors are cubes. The arrows show how spatial interpolation is
computed. x𝑖 and x are the coordinates of an anchor and a querying point respectively. 𝜆𝑖 is the SDF value of the anchor point x𝑖 in (a). f𝑖 is the associate
feature vector located in x𝑖 in (c)(d). The queried SDF/feature of x is based on the distance function 𝜙 (x, x𝑖 ) in (a)(c)(d), while our proposed latent set
representation (e) utilizes the similarity 𝜙 (x, f𝑖 ) between querying coordinate and anchored features via a cross attention mechanism.
Table 1. Neural fields for 3D shapes. We categorize methods according Table 2. Generative models for 3d shapes.
to the position of the latents.
Generative 3D
# Latents Latent Position Methods Models Representation
OccNet [Mescheder et al. 2019]
Single Global DeepSDF [Park et al. 2019] 3D-GAN [Wu et al. 2016] GAN Voxels
IM-Net [Chen and Zhang 2019] l-GAN [Achlioptas et al. 2018] GAN★ Point Clouds
ConvOccNet [Peng et al. 2020] IM-GAN [Chen and Zhang 2019] GAN★ Fields
IF-Net [Chibane et al. 2020] PointFlow [Yang et al. 2019] NF Point Clouds
LIG [Jiang et al. 2020] GenVoxelNet [Xie et al. 2020] EBM Voxels
Multiple Regular Grid
DeepLS [Chabra et al. 2020]
PointGrow [Sun et al. 2020] AR Point Clouds
SA-ConvOccNet [Tang et al. 2021]
PolyGen [Nash et al. 2020] AR Meshes
NKF [Williams et al. 2022]
LDIF [Genova et al. 2020] GenPointNet [Xie et al. 2021] EBM Point Clouds
Point2Surf [Erler et al. 2020] 3DShapeGen [Ibing et al. 2021] GAN★ Fields
Multiple Irregular Grid DCC-DIF [Li et al. 2022] DPM [Luo and Hu 2021] DM Point Clouds
3DILG [Zhang et al. 2022] PVD [Zhou et al. 2021] DM Point Clouds
POCO [Boulch and Marlet 2022] AutoSDF[Mittal et al. 2022] AR★ Voxels
Multiple Global Ours CanMap [Cheng et al. 2022] AR★ Point Clouds
ShapeFormer[Yan et al. 2022] AR★ Fields
3DILG [Zhang et al. 2022] AR★ Fields
the grid resolution. Thus, most voxel-based methods are limited to
LION [Zeng et al. 2022] DM★ Point Clouds
low-resolution. Octree-based decoders [Häne et al. 2017; Meagher
1980; Riegler et al. 2017b,a; Tatarchenko et al. 2017; Wang et al. SDF-StyleGAN [Zheng et al. 2022] GAN Fields
2017, 2018] and sparse hash-based decoders [Dai et al. 2020] take NeuralWavelet [Hui et al. 2022] DM★ Fields
3D space sparsity into account, alleviating the efficiency issues and TriplaneDiffusion [Shue et al. 2022]⋄ DM★ Fields
supporting high-resolution outputs. DiffusionSDF [Chou et al. 2022]⋄ DM★ Fields
Point Clouds. Early works on neural-network-based point cloud Ours DM★ Fields
processing include PointNet [Qi et al. 2017a,b] and DGCNN [Wang ★
et al. 2019]. These works are built upon per-point fully connected Generative models in latent space.
⋄ Works in submission.
layers. More recently, transformers [Vaswani et al. 2017] were pro-
posed for point cloud processing, e.g., [Guo et al. 2021; Zhang et al.
2022; Zhao et al. 2021]. These works are inspired by Vision Trans-
formers (ViT) [Dosovitskiy et al. 2021] in the image domain. Points Zhang 2019; Mescheder et al. 2019; Michalkiewicz et al. 2019; Park
are firstly grouped into patches to form tokens and then fed into et al. 2019] or a vector [Chan et al. 2022; Mildenhall et al. 2020]. A
a transformer with self-attention. In this work, we also introduce 3D object is then implicitly defined by this neural network. Neural
a network for processing point clouds. Improving upon previous fields have gained lots of popularity as they can generate objects
works, we compress a given point cloud to a small representation with arbitrary topologies and infinite resolution. The methods are
that is more suitable for generative modeling. also called neural implicit representations or coordinate-based net-
works. For neural fields for 3d shape modeling, we can categorize
Neural Fields. A recent trend is to use neural fields as a 3d data methods into global methods and local methods. 1) The global meth-
representation. The key building block is a neural network which ods encode a shape with a single global latent vector [Mescheder
accepts a 3D coordinate as input, and outputs a scalar [Chen and et al. 2019; Park et al. 2019]. Usually the capacity of these kind of
3
⊺ √
methods is limited and they are unable to encode shape details. 2) produce coefficients q 𝑗 k𝑖 / 𝑑 (they need to be normalized with the
The local methods use localized latent vectors which are defined for softmax function),
3D positions defined on either a regular [Chibane et al. 2020; Jiang ⊺ √
et al. 2020; Peng et al. 2020; Tang et al. 2021] or irregular grid [Boulch q 𝑗 k𝑖 / 𝑑
𝐴𝑖,𝑗 = Í √ (1)
and Marlet 2022; Genova et al. 2020; Li et al. 2022; Zhang et al. 2022]. 𝑁𝑘 ⊺
𝑖=1 exp q 𝑗 k𝑖 / 𝑑.
In contrast, we propose a latent representation where latent vectors
do not have associated 3D positions. Instead, we learn to represent The coefficients are then used to (linearly) combine values V =
a shape as a list of latent vectors. See Tab. 1. [v1, v2, . . . , v𝑁𝑘 ] ∈ R𝑑 𝑣 ×𝑁𝑘 . We can write the output of an attention
layer as follows,
2.2 Generative models. Attention(Q, K, V)
We have seen great success in different 2D image generative mod-
= o1 o2 · · · o𝑁𝑞 ∈ R𝑑 𝑣 ×𝑁𝑞
els in the past decade. Popular deep generative methods include "𝑁 # (2)
generative adversarial networks (GANs) [Goodfellow et al. 2014], 𝑘
∑︁ 𝑁𝑘
∑︁ 𝑁𝑘
∑︁
variational autoencoers (VAEs) [Kingma and Welling 2014], nor- = 𝐴𝑖,1 v𝑖 𝐴𝑖,2 v𝑖 · · · 𝐴𝑖,𝑁𝑞 v𝑖
malizing flows (NFs) [Rezende and Mohamed 2015], energy-based 𝑖=1 𝑖=1 𝑖=1
models [LeCun et al. 2006; Xie et al. 2016], autoregressive models Cross Attention. Given two sets A = a1, a2, . . . , a𝑁𝑎 ∈ R𝑑𝑎 ×𝑁𝑎
(ARs) [Esser et al. 2021; Van Den Oord et al. 2017] and more re- and B = b1, b2, . . . , b𝑁𝑏 ∈ R𝑑𝑏 ×𝑁𝑏 , the query vectors Q are con-
cently, diffusion models (DMs) [Ho et al. 2020] which are the chosen
structed with a linear function q(·) : R𝑑𝑎 → R𝑑 by taking elements
generative model in our work.
of A as input. Similarly, we construct K and V with k(·) : R𝑑𝑏 → R𝑑
In 3D domain, GANs have been popular for 3D generation [Achliop-
and v(·) : R𝑑𝑏 → R𝑑 , respectively. The inputs of both k(·) and v(·)
tas et al. 2018; Chen and Zhang 2019; Ibing et al. 2021; Wu et al.
are from B. Each column in the output of Eq. (2) can be written as,
2016; Zheng et al. 2022], while only a few works are using NFs [Yang
et al. 2019] and VAEs [Mo et al. 2019]. A lot of recent work employs 𝑁𝑏
∑︁ 1 √
ARs [Cheng et al. 2022; Mittal et al. 2022; Nash et al. 2020; Sun et al. o(a 𝑗 , B) = v(b𝑖 ) · exp q(a 𝑗 ) ⊺ k(b𝑖 )/ 𝑑 , (3)
𝑍 (a 𝑗 , B)
2020; Yan et al. 2022; Zhang et al. 2022]. DMs for 3D shapes are 𝑖=1
relatively unexplored compared to other generative methods.
√
exp q(a 𝑗 ) ⊺ k(b𝑖 )/ 𝑑 is a normalizing fac-
Í𝑁𝑏
where 𝑍 (a 𝑗 , B) = 𝑖=1
There are several DMs dealing with point cloud data [Luo and Hu
2021; Zeng et al. 2022; Zhou et al. 2021]. Due to the high freedom tor. The cross attention operator between two sets is,
degree of regressed coordinates, it is always difficult to obtain clean
CrossAttn(A, B) = o(a1, B) o(a2, B) · · · o(a𝑁𝑎 , B) ∈ R𝑑×𝑁𝑎
manifold surfaces via post-processing. As mentioned before, we (4)
believe that neural fields are generally more suitable than point
clouds for 3D shape generation. The area of combining DMs and Self Attention. In the case of self attention, we let the two sets be
neural fields is still underexplored. the same A = B,
DreamFusion [Poole et al. 2022] explores how to extract 3D in- SelfAttn(A) = CrossAttn(A, A). (5)
formation from a pretrained 2D image diffusion model. The recent
NeuralWavelet [Hui et al. 2022] first encodes shapes (represented as
signed distance fields) into the frequency domain with the wavelet 4 LATENT REPRESENTATION FOR NEURAL FIELDS
transform, and then train DMs on the frequency coefficients. While Our representation is inspired by radial basis functions (RBFs). We
this formulation is elegant, generative models generally work bet- will therefore describe our surface representation design using RBFs
ter on learned representations. Some concurrent works [Chou et al. as a starting point, and how we extended them using concepts
2022; Shue et al. 2022] in submission also utilize DMs in a latent space from neural fields and the transformer architecture. A continuous
for neural field generation. The TriplaneDiffusion [Shue et al. 2022] function can be represented with a set of weighted points in 3D
trains an autodecoder first for each shape. DiffusionSDF [Chou et al. using RBFs:
2022] runs a shape autoencoder based on triplane features [Peng 𝑀
∑︁
et al. 2020]. ÔRBF (x) = 𝜆𝑖 · 𝜙 (x, x𝑖 ) (6)
𝑖=1
Summary of 3D generation methods. We list several 3d generation where 𝜙 (x, x𝑖 ) is a radial basis function (RBF) and typically repre-
methods in Tab. 2, highlighting the choice of generative model (GAN, sents the similarity (or dissimilarity) between two inputs,
DM, EBM, NF, or AR) and the choice of data structure to represent
𝜙 (x, x𝑖 ) = 𝜙 (∥x − x𝑖 ∥). (7)
3D shapes (point clouds, meshes, voxels or fields).
Given ground-truth occupancies of x𝑖 , the values of 𝜆𝑖 can be ob-
3 PRELIMINARIES tained by solving a system of linear equations. In this way, we
can represent the continuous function O (·) as a set of 𝑀 points
An attention layer [Vaswani et al. 2017] has three types of inputs:
including their corresponding weights,
queries, keys, and values. Queries Q = [q1, q2, . . . , q𝑁𝑞 ] ∈ R𝑑×𝑁𝑞
𝜆𝑖 ∈ R, x𝑖 ∈ R3 𝑖=1 .
𝑀
and keys K = [k1, k2, . . . , k𝑁𝑘 ] ∈ R𝑑×𝑁𝑘 are first compared to (8)
4
Point Cloud Position Embeddings Query Points Position Embeddings Target
Q
Cross Attention
K, V
···
Surface Sampling
K, V
Isosurface
latents
KL Regularization
latent queries
Cross Attention
Self Attention
Self Attention
Self Attention
···
.. Q .. .. .. ..
. . . . .
Shape Encoding (Sec. 5.1) KL (Sec. 5.2) Shape Decoding (Sec. 5.3)
Fig. 3. Shape autoencoding pipeline. Given a 3D ground-truth surface mesh as the input, we first sample a point cloud that is mapped to positional
embeddings and encode them into a set of latent codes through a cross-attention module (Sec. 5.1). Next, we perform (optional) compression and KL-
regularization in the latent space to obtain structured and compact latent shape representations (Sec. 5.2). Finally, the self-attention is carried out to aggregate
and exchange the information within the latent set. And a cross-attention module is designed to calculate the interpolation weights of query points. The
interpolated feature vectors are fed into a fully connected layer for occupancy prediction (Sec. 5.3).
Í
However, in order to retain the details of a 3d shape, we often need 𝑀
where 𝑍 x, {x𝑖 }𝑖=1 = 𝑖=1 𝑀 𝜙 (x, x ) is a normalizing factor. Thus
𝑖
a very large number of points (e.g., 𝑀 = 80, 000 in [Carr et al. 2001]). the representation for a 3D shape can be written as
This representation does not benefit from recent advances in repre- n o𝑀
sentation learning and cannot compete with more compact learned f𝑖 ∈ R𝐶 , x𝑖 ∈ R3 . (11)
representations. We therefore want to modify the representation to 𝑖=1
change it into a neural field. After that, an MLP : R𝐶 → [0, 1] is applied to project the approxi-
One approach to neural fields is to represent each shape as a mated feature F̂KN (x) to occupancy,
separate neural network (making the network weights of a fixed
size network the representation of a shape) and train a diffusion Ô3DILG (x) = MLP F̂KN (x) . (12)
process as hypernetwork. A second approach is to have a shared
encoder-decoder network for all shapes and represent each shape as Neural networks with latent sets (proposed). We initially explored
a latent computed by the encoder. We opt for the second approach, many variations for 3D shape representation based on irregular
as it leads to more compact representations because it is jointly and regular grids as well as tri-planes, frequency compositions, and
learned from all shapes in the data set and the network weights other factored representations. Ultimately, we could not improve
themselves do not count towards the latent representation. Such a on existing irregular grids. However, we were able to achieve a
neural field takes a tuple of coordinates x and 𝐶-dimensional latent significant improvement with the following change. We aim to keep
f as input and outputs occupancy, the structure of an irregular grid and the interpolation, but without
representing the actual spatial position explicitly. We let the net-
ÔNN (x) = NN(x, f), (9) work encode spatial information. Both the representations (RBF in
Eq. (6) and 3DILG in Eq. (10)) are composed by two parts, values and
where NN : R3 × R𝐶
→ [0, 1] is a neural network. A first approach
similarities. We keep the structure of the interpolation, but eliminate
was to use a single global latent f, but a major limitation is the
explicit point coordinates and integrate cross attention from Eq. (3).
ability to encode shape details [Mescheder et al. 2019]. Some follow-
The result is the following learnable function approximator,
up works study coordinate-dependent latents [Chibane et al. 2020;
Peng et al. 2020; Sajjadi et al. 2022] that combine traditional data 𝑀 √
∑︁ 1 ⊺
structures such as regular grids with the neural field concept. Latent F̂ (x) = v(f𝑖 ) · 𝑒 q(x) k(f𝑖 )/ 𝑑 , (13)
𝑀
𝑍 x, {f𝑖 }𝑖=1
vectors are arranged in a spatial data structure and then interpo- 𝑖=1
lated (trilinearly) to obtain the coordinate-dependent latent fx . A Í √
𝑀
where 𝑍 x, {f𝑖 }𝑖=1 = 𝑖=1𝑀 𝑒 q(x) ⊺ k(f𝑖 )/ 𝑑 is a normalizing factor.
recent work 3DILG [Zhang et al. 2022] proposed a sparse represen-
tation for 3D shapes, using latents f𝑖 arranged in an irregular grid at Similar to the MLP in Eq. 12, we apply a single fully connected layer
point locations x𝑖 . The final coordinate-dependent latent fx is then to get desired occupancy values,
estimated by kernel regression,
Ô (x) = FC F̂ (x) . (14)
𝑀
∑︁ 1
fx = F̂KN (x) = f𝑖 · 𝜙 (x, x𝑖 ), (10) Compared to 3DILG and all other coordinate-latent-based methods,
𝑀
𝑍 x, {x𝑖 }𝑖=1 𝑀 , the new
𝑖=1 we dropped the dependency of the coordinate set {x𝑖 }𝑖=1
5
Subsample and Copy
which can also be seen as a “partial” self attention. See Fig. 4 for
an illustration of both design choices. Intuitively, the number 𝑀
K, V
K, V
affects the reconstruction performance: the larger the 𝑀, the better
Q Q reconstruction. However, 𝑀 strongly affects the training time due
Learnable
Cross Attention Cross Attention to the transformer architecture, so it should not be too large. In our
final model, the number of latents 𝑀 is set as 512, and the number
of channels 𝐶 is 512 to provide a trade off between reconstruction
(a) Learnable Queries (b) Point Queries
quality and training time.
Fig. 4. Two ways to encode a point cloud. (a) uses a learnable query set;
(b) uses a downsampled version of input point embeddings as the query set. 5.2 KL regularization block
Latent diffusion [Rombach et al. 2022] proposed to use a variational
autoencoder (VAE) [Kingma and Welling 2014] to compress images.
representation only contains a set of latents, We adapt this design idea for our 3D shape representation and
also regularize the latents with KL-divergence. We should note
n o𝑀
f𝑖 ∈ R𝐶 . (15)
𝑖=1 that the KL regularization is optional and only necessary for the
An alternative view of our proposed function approximator is to second-stage diffusion model training. If we just want a method for
see it as cross attention between query points x and a set of latents. surface reconstruction from point clouds, we do not need the KL
regularization.
5 NETWORK ARCHITECTURE FOR SHAPE We first linear project latents to mean and variance by two net-
REPRESENTATION LEARNING work branches, respectively,
In this section, we will discuss how we design a variational autoen-
FC𝜇 (f𝑖 ) = 𝜇𝑖,𝑗 𝑗 ∈ [1,2,··· ,𝐶0 ]
coder based on the latent representation proposed in Sec. 4. The
architecture has three components discussed in the following: a 3D
2
(18)
FC𝜎 (f𝑖 ) = log 𝜎𝑖,𝑗
shape encoder, KL regularization block, and a 3D shape decoder. 𝑗 ∈ [1,2,··· ,𝐶 0 ]
5.1 Shape encoding where FC𝜇 : R𝐶 → R𝐶0 and FC𝜎 : R𝐶 → R𝐶0 are two linear
projection layers. We use a different size of output channels 𝐶 0 ,
We sample the surfaces of 3D input shapes in a 3D shape dataset.
where 𝐶 0 ≪ 𝐶. This compression enables us to train diffusion
This results in a point cloud of size 𝑁 for each shape, {x𝑖 ∈ R3 }𝑖=1
𝑁
models on smaller latents of total size 𝑀 · 𝐶 0 ≪ 𝑀 · 𝐶. We can
or in matrix form X ∈ R 3×𝑁 . While the dataset used in the paper write the bottleneck of the VAE formally, ∀𝑖 ∈ [1, 2, · · · , 𝑀], 𝑗 ∈
originally represents shapes as triangle meshes, our framework [1, 2, · · · , 𝐶 0 ],
is directly compatible with other surface representations, such as
scanned point clouds, spline surfaces, or implicit surfaces. 𝑧𝑖,𝑗 = 𝜇𝑖,𝑗 + 𝜎𝑖,𝑗 · 𝜖, (19)
In order to learn representations in the form of Eq. (15), the first where 𝜖 ∼ N (0, 1). The KL regularization can be written as,
challenge is to aggregate the information contained in a possibly
𝑁 into a smaller set of latent vectors {f }𝑀 .
large point cloud {x𝑖 }𝑖=1 𝑀 𝐶0
𝑖 𝑖=1
𝑀
1 ∑︁ ∑︁ 1 2 2 2
We design a set-to-set network to this effect. Lreg {f𝑖 }𝑖=1 = 𝜇𝑖,𝑗 + 𝜎𝑖,𝑗 − log 𝜎𝑖,𝑗 . (20)
𝑀 · 𝐶 0 𝑖=1 𝑗=1 2
A popular solution to this problem in previous work is to divide
the large point cloud into a smaller set of patches and to learn one
In practice, we set the weight for KL loss as 0.001 and report the
latent vector per patch. Although this is a very well researched
performance for different values of 𝐶 0 in Sec. 8.1. Our recommended
and standard component in many networks, we discovered a more
setting is 𝐶 0 = 32.
successful way to aggregate features from a large point cloud that is
better compatible with the transformer architecture. We considered
two options. 5.3 Shape decoding
One way is to define a learnable query set. Inspired by DETR [Car- To increase the expressivity of the network, we add a latent learning
ion et al. 2020] and Perceiver [Jaegle et al. 2021], we use the cross network between the two parts. Because our latents are a set of
attention to encode X, vectors, it is natural to use transformer networks here. Thus, the
proposed network here is a series of self attention blocks,
Enclearnable (X) = CrossAttn(L, PosEmb(X)) ∈ R𝐶×𝑀 , (16)
where L ∈ R𝐶×𝑀
is a learnable query set where each entry is 𝐶-
𝑀
{f𝑖 }𝑖=1 ← SelfAttn (𝑙) {f𝑖 }𝑖=1
𝑀
, for 𝑖 = 1, · · · , 𝐿. (21)
dimensional, and PosEmb : R3 → R𝐶 is a column-wise positional
embedding function. The SelfAttn(·) with a superscript (𝑙) here means 𝑙-th block. The
Another way is to utilize the point cloud itself. We first subsample 𝑀 obtained using either Eq. (16) or Eq. (17) are fed into
latents {f𝑖 }𝑖=1
the point cloud X to a smaller one with furthest point sampling, the self attention blocks. Given a query x, the corresponding latent
X0 = FPS(X) ∈ R3×𝑀 . The cross attention is applied to X0 and X, is interpolated using Eq. (13), and the occupancy is obtained with a
Encpoints (X) = CrossAttn(PosEmb(X0 ), PosEmb(X)), (17) fully connected layer as shown in Eq. (14).
6
Shape Encoding (Sec. 5.1) Latent Decoding (Sec. 5.3)
Self Attention
Self Attention
𝑀
{f𝑖 }𝑖=1
FC𝜇 𝐶0 𝑀
𝑀
{f𝑖 }𝑖=1 ···
{z𝑖 }𝑖=1
𝐶 𝑀 Sample 𝐶0 𝐶
FCup
FC𝜎 𝐶0 𝑀 (a) Unconditional Denoising Network
𝑀 𝑀
𝑀 Condition
Fig. 5. KL regularization. Given a set of latents {f𝑖 ∈ R𝐶 }𝑖=1
𝑀 obtained
KV
from the shape encoding in Sec. 5.1, we employ two linear projection layers Q
Cross Attention
Self Attention
FC𝜇 , FC𝜎 to predict the mean and variance of a low-dimensional latent
space, where a KL regularization commonly used in VAE training is applied
to constrain the feature diversity. Then, we obtain smaller latents {z𝑖 ∈
···
R𝐶0 } of size 𝑀 · 𝐶 0 ≪ 𝑀 · 𝐶 via reparametrization sampling. Finally, the
compressed latents are mapped back to the original space by FCup to obtain
a higher dimensionality for the shape decoding in Sec. 5.3. (b) Conditional Denoising Network
Fig. 7. Denoising network. Our denoising network is composed of several
Forward Diffusion Process
denoising layers (a box in the figure denotes a layer). The denoising layer
Add Noise Add Noise Add Noise
for unconditional generation contains two sequential self attention blocks.
Condition
The denoising layer for conditional generation contains a self attention
Reverse Diffusion Process and a cross attention block. The cross attention is for injecting condition
Denoise Denoise Denoise information such as categories, images or partial point clouds.
7
Table 3. Shape autoencoding (surface reconstruction from point clouds) on ShapeNet. We show averaged metrics on all 55 categories and individual
metrics for the 7 largest categories. We compare with existing representative methods, OccNet (global latent), ConvOccNet (local latent grid), IF-Net
(multiscale local latent grid), and 3DILG (irregular latent grid). For our method, we show two different designs. The column Learned Queries shows results of
using Eq. (16), while the column Point Queries means we are using a subsampled point set as queries in Eq. (17). The results of Point Queries are generally
better than Learned Queries. This is expected because input-dependent queries (Point Queries) are better than fixed queries (Learned Queries).
Ours
OccNet ConvOccNet IF-Net 3DILG
Learned Queries Point Queries
table 0.823 0.847 0.901 0.963 0.965 0.971
car 0.911 0.921 0.952 0.961 0.966 0.969
chair 0.803 0.856 0.927 0.950 0.957 0.964
airplane 0.835 0.881 0.937 0.952 0.962 0.969
IoU ↑ sofa 0.894 0.930 0.960 0.975 0.975 0.982
rifle 0.755 0.871 0.914 0.938 0.947 0.960
lamp 0.735 0.859 0.914 0.926 0.931 0.956
mean (selected) 0.822 0.881 0.929 0.952 0.957 0.967
mean (all) 0.825 0.888 0.934 0.953 0.955 0.965
table 0.041 0.036 0.029 0.026 0.026 0.026
car 0.082 0.083 0.067 0.066 0.062 0.062
chair 0.058 0.044 0.031 0.029 0.028 0.027
airplane 0.037 0.028 0.020 0.019 0.018 0.017
Chamfer ↓ sofa 0.051 0.042 0.032 0.030 0.030 0.029
rifle 0.046 0.025 0.018 0.017 0.016 0.014
lamp 0.090 0.050 0.038 0.036 0.035 0.032
mean (selected) 0.058 0.040 0.034 0.032 0.031 0.030
mean (all) 0.072 0.052 0.041 0.040 0.039 0.038
table 0.961 0.982 0.998 0.999 0.999 0.999
car 0.830 0.852 0.888 0.892 0.898 0.899
chair 0.890 0.943 0.990 0.992 0.994 0.997
airplane 0.948 0.982 0.994 0.993 0.994 0.995
F-Score ↑ sofa 0.918 0.967 0.988 0.986 0.986 0.990
rifle 0.922 0.987 0.998 0.997 0.998 0.999
lamp 0.820 0.945 0.970 0.971 0.970 0.975
mean (selected) 0.898 0.951 0.975 0.976 0.977 0.979
mean (all) 0.858 0.933 0.967 0.966 0.966 0.970
Rendering-FID = ∥𝜇g − 𝜇 r ∥ + 𝑇𝑟 Σ𝑔 + Σ𝑟 − 2(Σ𝑔 Σ𝑟 ) 1/2 (24)
8
Proposed
Input GT OccNet ConvONet IF-Net 3DILG
Learnable Queries Point Queries
Fig. 8. Visualization of shape autoencoding results (surface reconstruction from point clouds from ShapeNet).
9
PVD
Grid-83
3DILG
Ours
where 𝑔 and 𝑟 denotes the generated and training datasets respec- 𝑙𝑟 max = 5𝑒 − 5 in the first 𝑡 0 = 80 epochs, and then gradually
𝑡 −𝑡 0
tively. 𝜇 and Σ are the statistical mean and covariance matrix of the decreased using the cosine decay schedule 𝑙𝑟 max ∗ 0.5
1+𝑐𝑜𝑠 (
𝑇 −𝑡 0 )
feature distribution extracted by the Inception network. until reaching the minimum value of 1𝑒 − 6. The diffusion models
The Rendering-KID is defined as, are trained on 4 A100 with batch size of 256 for 𝑇 = 8, 000 epochs.
!2 The learning rate is linearly increased to 𝑙𝑟𝑚𝑎𝑥 = 1𝑒 − 4 in the first
1 ∑︁
Rendering-KID = MMD max 𝐷 (x, y) (25) 𝑡 0 = 800 epochs, and then gradually decreased using the above
|R| y∈ G mentioned decay schedule until reaching 1𝑒 − 6. We use the default
x∈R
settings for the hyperparameters of EDM [Karras et al. 2022]. During
where 𝐷 (x, y) is a polynomial kernel function to evaluate the simi-
sampling, we obtain the final latent set via only 18 denoising steps.
larity of two samples, G and R are feature distributions of generated
set and reference set, respectively. The function MMD(·) is Maxi-
mum Mean Discrepancy. However, the rendering-based FID and KID 8 RESULTS
are essentially designed to understand 3D shapes from 2D images. We present our results for multiple applications: 1) shape auto-
Thus, they have the inherent issue of not accurately understanding encoding, 2) unconditional generation, 3) category-conditioned
shape compositions in the 3D world. To compensate their draw- generation, 4) text-conditioned generation, 5) shape completion,
backs, we also adapt the FID and KID to 3D shapes directly. For each 6) image-conditioned generation. Finally, we perform a shape nov-
generated or ground-truth shape, we sample 4096 points (with nor- elty analysis to validate that we are not overfitting to the dataset.
mals) from the surface mesh and then feed them into a pre-trained
PointNet++ [Qi et al. 2017b] to extract a global latent vector, repre- 8.1 Shape Auto-Encoding
senting the global structure of the 3D shape. The PointNet++ is first We show the quantitative results in Tab. 3 for a deterministic au-
pretrained on shape classification on ShapeNet-55. As we use point toencoder without the KL block described in Sec. 5.2. In particular,
clouds, we call the FID and KID for 3D shapes as Fréchet PointNet++ we show results for the largest 7 categories as well as averaged re-
Distance (FPD) and Kernel PointNet++ Distance (KPD). The two sults over the categories. The two design choices of shape encoding
metrics are defined similarly as in Eq. (24) and Eq. (25), except that described in Sec. 5.1 are also investigated. The case of using the
the features are extracted from a PointNet++ network. subsampled point cloud as queries is better than learnable queries in
all categories. Thus we use subsampled point clouds in our later ex-
7.3 Implementation periments. The visualization of reconstruction results can be found
For the shape auto-encoder, we use the point cloud of size 2048 as in Fig. 8. We visualize some extremely difficult shapes from the
input. At each iteration, we individually sample 1024 query points datasets (test split). These shapes often contain some thin structures.
from the bounding volume ([−1, 1] 3 ) and the other 1024 points However, our method still performs well.
from near surface region for the occupancy values prediction. The Both our method and the competitor 3DILG use transformer as
shape auto-encoder is trained on 8 A100, with batch size of 512 the main backbone. However, we differ in nature. 1) For encoding,
for 𝑇 = 1, 600 epochs. The learning rate is linearly increased to 3DILG uses KNN to aggregate local information and we use cross
10
Grid-83
3DILG
NW
Ours
Grid-83
3DILG
NW
Ours
Grid-83
3DILG
NW
Ours
Fig. 10. Category-conditional generation. From top to bottom, we show category (airplane, chair, table) conditioned generation results.
attention. KNN manually selects neighboring points according to The numerical results for the reconstruction are significant. The
spatial similarities (distances) while cross attention learns the sim- maximum achievable number for the metrics IoU and F1 is 1. The
ilarities on the go. 2) 3DILG uses a set of points and one latent improvement has to be interpreted in how much closer we get to 1.
per point. Our representation only contains a set of latents. This The visualizations also highlight the improvement.
simplification makes the second-stage generative model training
Ablation study of the number of latents. The number 𝑀 is the
easier. 3) For decoding, 3DILG applies spatial interpolation and we
number of latent vectors used in the network. Intuitively, a larger
use interpolation in feature space. The used cross attention can be
𝑀 leads to a better reconstruction. We show results of 𝑀 in Tab. 4.
seen as learnable interpolation. This gives us more flexibility.
Thus, in all of our experiments, 𝑀 is set to 512. We are limited by
computation time to work with larger 𝑀.
11
Ablation study of the KL block. We described the KL block in Sec. 5.2 AutoSDF Ours
that leads to additional compression. In addition, this block changes
the deterministic shape encoding into a variational autoencoder.
The introduced hyperparameter is 𝐶 0 . A smaller 𝐶 0 leads to a higher
compression rate. The choice of 𝐶 0 is ablated in Tab. 5. Clearly, larger “horizontal slats on top of back”
𝐶 0 gives better results. The reconstruction results of 𝐶 0 = 8, 16, 32, 64
are very close. However, they differ significantly in the second stage,
because a larger latent size could make the training of diffusion
models more difficult. This result is very encouraging for our model,
because it indicates that aggressively increasing the compression
in the KL block does not decrease reconstruction performance too “one big hole between back and seat”
much. We can also see that compressing with the KL block by de-
creasing 𝐶 0 is much better than compressing using fewer latent
vectors 𝑀.
“this chair has wheels”
8.2 Unconditional Shape Generation
Comparison with surface generation. We evaluate the task of un-
conditional shape generation with the proposed metrics in Tab. 6.
We also compared our method with a baseline method proposed
in [Zhang et al. 2022]. The method is called Grid-83 because the
“vertical back ribs”
latent grid size is 83 , which is exactly the same as in AutoSDF [Mittal
et al. 2022]. The table also shows the results of different 𝐶 0 . Our Fig. 11. Text conditioned generation. For each text prompt, we generate
3 shapes. Our results (Right) are compared with AutoSDF (Left).
results are best when 𝐶 0 = 32 in all metrics. When 𝐶 0 = 64 the
results become worse. This also aligns with our conjecture that a
larger latent size makes the training more difficult.
better recall, which means our method can generate a higher per-
Comparison with point cloud generation. Additionally, we compare centage of the training data. For 3DShapeGen and AutoSDF, both
our method with PVD [Zhou et al. 2021] which is a point cloud precision and recall are low compared to other methods. Second,
diffusion method. We re-train PVD using the official released code we show other metrics based on point cloud distances (CD and
on our preprocessed dataset and splits. We use the same evaluation EMD) [Achlioptas et al. 2018]. The smaller the better for MMD
protocol as before but with one major difference. Since PVD can only and the larger the better for COV. These metrics are often used to
generate point clouds without normals, we use another pretrained evaluate point cloud generation.
PointNet++ (without normals) as the feature extractor to calculate
Surface-FPD and Surface-KPD. The Tab. 7 shows we can beat PVD 8.4 Text-conditioned generation
by a large margin. Additionally, we also show the metrics calculated The results of our text-conditioned generation model can be found
on rendered images. Visualization of generated results can be found in Fig. 11. Since the model is a probabilistic model, we can sample
in Fig. 9. shapes given a text prompt. The results are very encouraging and
they constitute the first demonstration of text-conditioned 3D shape
8.3 Category-conditioned generation generation using diffusion models. To the best of our knowledge,
there are no published competing methods at the point of submitting
We train a category-conditioned generation model using our method.
this work.
We evaluate our models in Tab. 8. We should note that the competitor
method NeuralWavelet [Hui et al. 2022] trains models for categories
8.5 Probabilistic shape completion
separately; thus, NeuralWavelet is not a true category-conditioned
model. We also visualize some results (airplane, chair, and table) We also extend our diffusion model for probabilistic shape comple-
in Fig. 10. Our training is more challenging, as we train on a dataset tion by using a partial point cloud as conditioning input. The compar-
that is an order of magnitude larger and we train for all classes ison against ShapeFormer [Yan et al. 2022] is depicted in Fig. 12. As
jointly. While NeuralWavelet already has good results, the joint seen, our latent set diffusion can predict more accurate completion,
training is necessary / beneficial for many subsequent applications. and we also have the ability to achieve more diverse generations.
Additionally, we show evaluation metrics and more competitor
methods in Tab. 9. First, we use precision and recall (P&R) [Sajjadi 8.6 Image-conditioned shape generation.
et al. 2018] to quantify the percentage of generated samples that We also provide comparisons on the task of single-view 3D object
are similar to training and the percentage of training data that can reconstruction in Fig. 13. Compared to other deterministic methods
be generated, respectively. 3DILG, NeuralWavelet, and our method, including OccNet [Mescheder et al. 2019] and IM-Net [Chen and
can achieve high precision which means they can generate similar Zhang 2019], our latent set diffusion can not only reconstruct more
shapes to training. However, our method also shows significantly accurate surface details, (e.g. long rods and tiny holes in the back),
12
Table 4. Ablation study for different number of latents 𝑀 for Table 5. Ablation study for different number of channels 𝐶 0 for shape (variational)
shape autoencoding autoencoding.
Table 6. Unconditional generation on full ShapeNet. Table 7. Unconditional generation on full ShapeNet.
Ours
Grid-83 3DILG PVD Ours
𝐶 0 = 8 𝐶 0 = 16 𝐶 0 = 32 𝐶 0 = 64
Surface-FPD ↓ 4.03 1.89 2.71 1.87 0.76 0.97 Surface-FPD ↓ 2.33 0.63
Surface-KPD (×103 ) ↓ 6.15 2.17 3.48 2.42 0.66 1.11 Surface-KPD (×103 ) ↓ 2.65 0.53
Rendering-FID ↓ 32.78 24.83 28.25 27.26 17.08 24.24 Rendering-FID ↓ 270.64 17.08
Rendering-KID (×103 ) ↓ 14.12 10.51 14.60 19.37 6.75 11.76 Rendering-KID (×103 ) ↓ 281.54 6.75
Table 8. Category conditioned generation. NW is short for NeuralWavelet. The dash sign “-” means the method NeuralWavelet does not release models
trained on these categories.
Table 9. Category conditioned generation II. We show results for additional metrics and additional methods for category conditioned generation.
chair table
3DILG 3DShapeGen AutoSDF NW Ours 3DILG 3DShapeGen AutoSDF NW Ours
Precision ↑ 0.87 0.56 0.42 0.89 0.86 0.85 0.64 0.64 0.83 0.83
Recall ↑ 0.65 0.45 0.23 0.57 0.86 0.59 0.52 0.69 0.68 0.89
MMD-CD (×102 ) ↓ 1.78 2.14 7.27 2.14 1.78 2.85 2.65 2.77 2.68 2.38
MMD-EMD (×102 ) ↓ 9.43 10.55 19.57 11.15 9.41 11.02 9.53 9.63 9.60 8.81
COV-CD (×102 ) ↑ 31.95 28.01 6.31 29.19 37.48 18.54 23.61 21.55 21.71 25.83
COV-EMD (×102 ) ↑ 36.29 36.69 18.34 34.91 45.36 27.73 43.26 29.16 30.74 43.58
but also support multi-modal prediction, which is a desired property consideration changes, and for the second stage – the core of our
to deal with severe occlusions. diffusion architecture – training time is also relatively high. Overall,
we believe that there is significant potential for future research av-
8.7 Shape novelty analysis enues to speed up training, in particular, in the context of diffusion
models.
We use shape retrieval to demonstrate that we are not simply over-
fitting to the training set. Given a generated shape, we measure the
Chamfer distance between it and training shapes. The visualization
of retrieved shapes can be found in Fig. 14. Clearly, the model can
synthesize new shapes with realistic structures. 9 CONCLUSION
We have introduced 3DShape2VecSet, a novel shape representation
8.8 Limitations for neural fields that is tailored to generative diffusion models. To
While our method shows convincing results on a variety of tasks, this end, we combine ideas from radial basis functions, previous
our design choices also have drawbacks that we would like to dis- neural field architectures, variational autoencoding, as well as cross
cuss. For instance, we require a two stage training strategy. While attention and self-attention to design a learnable representation.
this leads to improved performance in terms of generation quality, Our shape representation can take a variety of inputs including
training the first stage is more time consuming than relying on triangle meshes and point clouds and encode 3D shapes as neu-
manually-designed features such as wavelets [Hui et al. 2022]. In ral fields on top of a set of latent vectors. As a result, our method
addition, the first stage might require retraining if the shape data in demonstrates improved performance in 3D shape encoding and 3D
13
GT Condition ShapeFormer Ours shape generative modeling tasks, including unconditioned genera-
tion, category-conditioned generation, text-conditioned generation,
point-cloud completion, and image-conditioned generation.
In future work, we see many exciting possibilities. Most impor-
tantly, we believe that our model further advances the state of the
art in point cloud and shape processing on a large variety of tasks.
In particular, we would like to employ the network architecture of
3DShape2VecSet to tackle the problem of surface reconstruction
from scanned point clouds. In addition, we can see many applica-
tions for content-creation tasks, for example 3D shape generation
of textured models along with their material properties. Finally, we
would like to explore editing and manipulation tasks leveraging
pretrained diffusion models for prompt to prompt shape editing,
leveraging the recent advances in image diffusion models.
ACKNOWLEDGMENTS
We would like to acknowledge Anna Frühstück for helping with
Fig. 12. Point cloud conditioned generation. We show three generated
figures and the video voiceover. This work was supported by the
results given a partial cloud. The ground-truth point cloud and the partial
point cloud used as condition are shown in Left. We compare our results SDAIA-KAUST Center of Excellence in Data Science and Artificial
(Right) with ShapeFormer (Middle). Intelligence (SDAIA-KAUST AI) as well as the ERC Starting Grant
Scan2CAD (804724).
Condition IM-Net OccNet Ours
REFERENCES
Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learn-
ing representations and generative models for 3d point clouds. In International
conference on machine learning. PMLR, 40–49.
Panos Achlioptas, Judy Fan, Robert Hawkins, Noah Goodman, and Leonidas J Guibas.
2019. ShapeGlot: Learning language for shape differentiation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision. 8938–8947.
Alexandre Boulch and Renaud Marlet. 2022. Poco: Point convolution for surface
reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 6302–6314.
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2016. Generative and
discriminative voxel modeling with convolutional neural networks. arXiv preprint
arXiv:1608.04236 (2016).
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In
European conference on computer vision. Springer, 213–229.
Jonathan C Carr, Richard K Beatson, Jon B Cherrie, Tim J Mitchell, W Richard Fright,
Bruce C McCallum, and Tim R Evans. 2001. Reconstruction and representation of
3D objects with radial basis functions. In Proceedings of the 28th annual conference
on Computer graphics and interactive techniques. 67–76.
Rohan Chabra, Jan E Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove,
and Richard Newcombe. 2020. Deep local shapes: Learning local sdf priors for
detailed 3d reconstruction. In European Conference on Computer Vision. Springer,
Fig. 13. Image conditioned generation. In the left column we show the
608–625.
condition image. In the middle we show results obtained by the method Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De
IM-Net and OccNet. Our generated results are shown on the right. Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero
Karras, and Gordon Wetzstein. 2022. Efficient Geometry-aware 3D Generative
Ref Gen Ref Gen Ref Gen Ref Gen Adversarial Networks. In CVPR.
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet:
An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape
modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 5939–5948.
An-Chieh Cheng, Xueting Li, Sifei Liu, Min Sun, and Ming-Hsuan Yang. 2022. Autore-
gressive 3d shape generation via canonical mapping. arXiv preprint arXiv:2204.01955
(2022).
Fig. 14. Shape generation novelty. For a generated shape, we retrieve Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Implicit functions in
the top-1 similar shape in the training set. The similarity is measured using feature space for 3d shape reconstruction and completion. In Proceedings of the
Chamfer distance of sampled surface point clouds. In each pair, we show IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6970–6981.
Gene Chou, Yuval Bahat, and Felix Heide. 2022. DiffusionSDF: Conditional Generative
the retrieved shape (left) and the generated shape (right). The generated
Modeling of Signed Distance Functions. arXiv preprint arXiv:2211.13757 (2022).
shapes are from our category-conditioned generation results. Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio
Savarese. 2016. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object
Reconstruction. european conference on computer vision (2016), 628–644.
14
Angela Dai, Christian Diller, and Matthias Nießner. 2020. Sg-nn: Sparse generative 4460–4470.
neural networks for self-supervised scene completion of rgb-d scans. In Proceedings Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 849–858. Anders Eriksson. 2019. Deep level sets: Implicit surface representations for 3d shape
Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion inference. arXiv preprint arXiv:1901.06802 (2019).
using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ra-
conference on computer vision and pattern recognition. 5868–5877. mamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre- for View Synthesis. In ECCV.
training of deep bidirectional transformers for language understanding. arXiv Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shubham Tulsiani. 2022. Autosdf:
preprint arXiv:1810.04805 (2018). Shape priors for 3d completion, reconstruction and generation. In Proceedings of the
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua IEEE/CVF Conference on Computer Vision and Pattern Recognition. 306–315.
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and Leonidas
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Guibas. 2019. StructureNet: Hierarchical Graph Networks for 3D Shape Generation.
Words: Transformers for Image Recognition at Scale. ICLR (2021). ACM Transactions on Graphics (TOG), Siggraph Asia 2019 38, 6 (2019), Article 242.
Philipp Erler, Paul Guerrero, Stefan Ohrhallinger, Niloy J Mitra, and Michael Wim- Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. 2020. Polygen: An
mer. 2020. Points2surf learning implicit surfaces from point clouds. In European autoregressive generative model of 3d meshes. In International conference on machine
Conference on Computer Vision. Springer, 108–124. learning. PMLR, 7220–7229.
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high- Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-
resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer grove. 2019. Deepsdf: Learning continuous signed distance functions for shape
Vision and Pattern Recognition. 12873–12883. representation. In Proceedings of the IEEE/CVF conference on computer vision and
Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, and Thomas Funkhouser. 2020. pattern recognition. 165–174.
Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger.
on Computer Vision and Pattern Recognition. 4857–4866. 2020. Convolutional occupancy networks. In European Conference on Computer
Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learning a Vision. Springer, 523–540.
predictable and generative vector representation for objects. In European Conference Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
on Computer Vision. Springer, 484–499. Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep
Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. learning on point sets for 3d classification and segmentation. In Proceedings of the
Advances in Neural Information Processing Systems 27 (2014), 2672–2680. IEEE conference on computer vision and pattern recognition. 652–660.
Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet++: Deep
Shi-Min Hu. 2021. Pct: Point cloud transformer. Computational Visual Media 7, 2 hierarchical feature learning on point sets in a metric space. Advances in neural
(2021), 187–199. information processing systems 30 (2017).
Christian Häne, Shubham Tulsiani, and Jitendra Malik. 2017. Hierarchical surface Aditya Ramesh. 2022. Hierarchical Text-Conditional Image Generation with CLIP
prediction for 3d object reconstruction. In 2017 International Conference on 3D Vision Latents.
(3DV). IEEE, 412–420. Danilo Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning Flows. In International Conference on Machine Learning. 1530–1538.
for image recognition. In Proceedings of the IEEE conference on computer vision and Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. 2017b. Octnetfu-
pattern recognition. 770–778. sion: Learning depth fusion from data. In 2017 International Conference on 3D Vision
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic (3DV). IEEE, 57–66.
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017a. Octnet: Learning deep
Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. 2022. Neural wavelet-domain 3d representations at high resolutions. In Proceedings of the IEEE Conference on
diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers. 1–9. Computer Vision and Pattern Recognition, Vol. 3.
Moritz Ibing, Isaak Lim, and Leif Kobbelt. 2021. 3d shape generation with grid-based Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
implicit functions. In Proceedings of the IEEE/CVF Conference on Computer Vision 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of
and Pattern Recognition. 13559–13568. the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton,
Joao Carreira. 2021. Perceiver: General perception with iterative attention. In Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gon-
International conference on machine learning. PMLR, 4651–4664. tijo Lopes, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep
Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner, Thomas Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
Funkhouser, et al. 2020. Local implicit grid representations for 3d scenes. In Pro- Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. Assessing generative models via precision and recall. Advances in neural
6001–6010. information processing systems 31 (2018).
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the Design Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Rad-
Space of Diffusion-Based Generative Models. In Proc. NeurIPS. wan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. 2022.
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Scene representation transformer: Geometry-free novel view synthesis through set-
International Conference on Learning Representations (ICLR), Yoshua Bengio and latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer
Yann LeCun (Eds.). Vision and Pattern Recognition. 6229–6238.
Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. 2006. A tutorial on J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon
energy-based learning. Predicting structured data 1, 0 (2006). Wetzstein. 2022. 3D Neural Field Generation using Triplane Diffusion. arXiv
Tianyang Li, Xin Wen, Yu-Shen Liu, Hua Su, and Zhizhong Han. 2022. Learning deep preprint arXiv:2211.16677 (2022).
implicit functions for 3D shapes with dynamic code clouds. In Proceedings of the Yongbin Sun, Yue Wang, Ziwei Liu, Joshua Siegel, and Sanjay Sarma. 2020. Pointgrow:
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12840–12850. Autoregressively learned point cloud generation with self-attention. In Proceedings
William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution of the IEEE/CVF Winter Conference on Applications of Computer Vision. 61–70.
3D surface construction algorithm. ACM siggraph computer graphics 21, 4 (1987), Jiapeng Tang, Jiabao Lei, Dan Xu, Feiying Ma, Kui Jia, and Lei Zhang. 2021. Sa-convonet:
163–169. Sign-agnostic optimization of convolutional occupancy networks. In Proceedings of
Andreas Lugmayr, Martin Danelljan, Andrés Romero, Fisher Yu, Radu Timofte, and the IEEE/CVF International Conference on Computer Vision. 6504–6513.
Luc Van Gool. 2022. RePaint: Inpainting using Denoising Diffusion Probabilistic Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. 2017. Octree Generating
Models. ArXiv abs/2201.09865 (2022). Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.
Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud In 2017 IEEE International Conference on Computer Vision (ICCV). 2107–2115.
generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.
Pattern Recognition. 2837–2845. Advances in neural information processing systems 30 (2017).
Donald JR Meagher. 1980. Octree encoding: A new technique for the representation, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jon es, Aidan N
manipulation and display of arbitrary 3-d objects by computer. Electrical and Systems Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances
Engineering Department Rensseiaer Polytechnic . . . . in neural information processing systems 30 (2017).
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-cnn:
Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. on Graphics (TOG) 36, 4 (2017), 72.
15
Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong. 2018. Adaptive O-CNN: a (2020).
patch-based deep representation of 3D shapes. In SIGGRAPH Asia 2018 Technical Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischinski, Daniel Cohen-Or, and
Papers. ACM, 217. Hui Huang. 2022. Shapeformer: Transformer-based shape completion via sparse
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Solomon. 2019. Dynamic graph cnn for learning on point clouds. Acm Transactions Pattern Recognition. 6239–6249.
On Graphics (tog) 38, 5 (2019), 1–12. Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath
Francis Williams, Zan Gojcic, Sameh Khamis, Denis Zorin, Joan Bruna, Sanja Fidler, Hariharan. 2019. Pointflow: 3d point cloud generation with continuous normalizing
and Or Litany. 2022. Neural fields as learnable kernels for 3d reconstruction. In flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4541–4550.
18500–18510. Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016. Karsten Kreis. 2022. LION: Latent Point Diffusion Models for 3D Shape Generation.
Learning a probabilistic latent space of object shapes via 3d generative-adversarial arXiv preprint arXiv:2210.06978 (2022).
modeling. In Advances in Neural Information Processing Systems. 82–90. Biao Zhang, Matthias Nießner, and Peter Wonka. 2022. 3DILG: Irregular Latent Grids
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and for 3D Generative Modeling. In Advances in Neural Information Processing Systems.
Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. https://openreview.net/forum?id=RO0wSr3R7y-
In Proceedings of the IEEE conference on computer vision and pattern recognition. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. 2021. Point
1912–1920. transformer. In Proceedings of the IEEE/CVF International Conference on Computer
Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. 2016. A theory of generative Vision. 16259–16268.
convnet. In International Conference on Machine Learning. PMLR, 2635–2644. Xin-Yang Zheng, Yang Liu, Peng-Shuai Wang, and Xin Tong. 2022. SDF-StyleGAN:
Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. 2021. Genera- Implicit SDF-Based StyleGAN for 3D Shape Generation. In Comput. Graph. Forum
tive pointnet: Deep energy-based learning on unordered point sets for 3d generation, (SGP).
reconstruction and classification. In Proceedings of the IEEE/CVF Conference on Com- Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3d shape generation and completion through
puter Vision and Pattern Recognition. 14976–14985. point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on
Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Computer Vision. 5826–5835.
Wu. 2020. Generative VoxelNet: learning energy-based models for 3D shape syn-
thesis and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence
16