Learning A Structured Latent Space For Unsupervised Point Cloud Completion
Learning A Structured Latent Space For Unsupervised Point Cloud Completion
Yingjie Cai1 , Kwan-Yee Lin1 *, Chao Zhang2 , Qiang Wang2 , Xiaogang Wang1 , Hongsheng Li1 *
1
CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong
2
Samsung Research Institute China - Beijing (SRC-B)
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00546
Abstract
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
initial code, unsuitable learning rate or too many iterations, • We propose to learn a unified and structured latent
etc. The inversion process is much more time-consuming space for unsupervised point cloud completion, which
than direct methods (∼3500x). Another representative un- encodes both partial and complete point clouds to im-
supervised work [41] exploits cycle supervisions to enhance prove partial-complete geometry consistency and lead
consistency indirectly by learning bidirectional transforma- to better shape completion accuracy.
tions between the latent spaces of complete and incomplete
• We propose to constrain the complete and occlusion
shapes (point clouds). However, the bidirectional transfor-
codes of a series of related partial point clouds to en-
mations need to be separately modeled and are difficult to
hance the learning of the structured latent space.
learn, especially for the complete-to-partial mapping. If one
direction is not learned well, the other direction would be in- • Experimental results demonstrate the superiority of the
fluenced correspondingly. In summary, without direct and proposed method over state-of-the-art unsupervised
accurate paired supervision, designing proper supervision point cloud completion methods on both synthetic and
and applying to unsupervised point cloud completion is of real datasets.
great importance to this task.
To this end, we propose to create a unified and struc- 2. Related Work
tured latent space for encoding both partial and complete Point Cloud Completion. Point cloud completion has
shapes. To apply strong supervisions for unsupervised point played an important role for many downstream applica-
cloud completion, we make an assumption that each partial tions such as robotics [12, 31] and perception [1, 2, 10, 20],
shape is created by occluding a complete one. If a complete which has seen significant development since the pioneer-
shape is occluded to become partial in the 3D space, its code ing work PCN [51] proposed. Most existing approaches
in the latent space should also be “occluded” from a com- like [6,11,22,22,25,29,37,39,42,48,50,53] are trained in a
plete shape code accordingly. We model the “occlusion” fully-supervised manner. Although supervised point cloud
of a complete code in the latent space as weighting each of completion methods have achieved impressive results, they
its dimension with a weight in [0, 1]. However, instead of are difficult to generalize to real-world scans, since the
manually determining the occlusion weights, we make them paired data is difficult to collect for actual scans and their
learned from the training data. In this way, the complete and data distributions might not match well. Pcl2pcl [6] first
partial shapes are strongly bounded in a unified latent space. proposes to complete the partial shapes in an unsupervised
In addition, to better regularize the relation between partial manner without the need for paired data, which trains two
point clouds from the same complete shape, the occlusion separate auto-encoders, for reconstructing complete shapes
code of a more occluded shape is required to have smaller and partial ones respectively and learns a latent code trans-
weights than that of a less occluded shape. formation from the latent space of partial shapes to that
Specifically, to learn the unified latent space, we repre- of the complete ones. Its subsequent work [44] outputs
sent any partial or complete point cloud by two codes: a multiple plausible complete shapes from a partial input.
complete shape code and an occlusion code. The complete Base on Pcl2pcl, Cycle4completion [41] exploits an ex-
shape code can be fed into a completion decoder to recon- tra complete-to-partial latent spaces transformation in addi-
struct the corresponding complete point cloud. The “oc- tion of partial-to-complete direction to capture the bidirec-
cluded” shape code via multiplying the above two codes tional geometric correspondence between incomplete and
can be fed into a partial decoder to reconstruct the partial complete shapes. Another unsupervised work ShapeInver-
shape. Furthermore, we create a series of related partial sion [52] proposes to apply GAN inversion which utilizes
point clouds by gradually removing more points from a par- the shape prior learned from a pre-trained generator to com-
tial shape and apply ranking constraints to their occlusion plete the partial shape in an unsupervised manner. However,
codes by N-pair loss [36] according to their relative occlu- the inverse optimization process is time-consuming com-
sion degrees. Their complete shape codes are required to pared with forward-based methods and the results are easy
be equal since they represent the same object. By adopt- to stuck at local minima, which greatly limits the practi-
ing such properly designed strong supervisions, more ac- cal application of inversion-based methods. Different from
curate complete point clouds with better geometric consis- existing methods, we propose to learn a unified latent space
tency and shape details can be reconstructed. supervised by tailored structured latent constraints to recon-
We experiment on popular point cloud completion struct better complete shapes.
benchmarks, including a synthetic dataset (ShapeNet [23]) Structured Ranking Losses. Deep metric learning plays
and real datasets (KITTI [13], ScanNet [9] and Matter- an important role in various applications of computer vi-
port3D [3]). The proposed method outperforms state-of- sion, such as image retrieval [30, 36], clustering [18], and
the-art unsupervised methods [6, 41, 44, 52]. Our main con- transfer learning [33]. The loss function is one of the essen-
tributions are summarized as follows: tial components in successful deep metric learning frame-
5534
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
Partial Input ሺܽሻ Occlusion Code Reconstructed
Complete Shape Occlusion ሺܾሻ ͲǤ͵ ଵ ൈ ͲǤ͵ Partial
Code Code
ͲǤͻ
Fusion
ݖଶ ൈ ͲǤͻ Partial Code
ܲ Completed ͲǤ ݖଷ ൈ ͲǤ ܲ
Fusion Point Cloud ڭ ڭ
ͲǤʹ ݖୢ ൈ ͲǤʹ Completed
ܧ
ܦ
ሺܢǡ ܗሻ ܲ
Point Cloud
Partial
ܲԢ Input
Figure 2. Overview. (a) A series of related partial point clouds encoded to multiple complete shape and occlusion code pairs. Their
element-wise multiplication are the representation in the unified latent space. (b) Reconstructing the partial input P̂ and predicting the
completed point cloud Ĉ simultaneously with a shape latent code discriminator and a complete point cloud discriminator. (c) The real
complete shape codes and point clouds are provided by the complete point cloud auto-encoder. Best viewed in color.
works and a large variety of loss functions have been pro- ing complete point cloud and the difference between the
posed. Contrastive loss [7, 16] captures the relationship complete and partial point is just their occlusion degrees, as
between pairwise data points, i.e., similarity or dissimi- shown in Figure 2 (a). Therefore, we embed the incomplete
larity. Triplet-based loss is widely studied [8, 35, 38] and and the complete point clouds into a unified latent space
composed of an anchor point, a positive data point, and equipping with different occlusion degrees.
a negative point and aims to pull the anchor point closer Specifically, as illustrated in Figure 2 (b), we map any
to the positive point than to the negative point by a fixed partial point cloud P into a complete shape code z ∈ Rd and
margin δ. Inspired by this, recent ranking-motivated ap- a corresponding occlusion code o ∈ Rd via a point cloud
proaches [24, 30, 32, 33, 35, 36] have proposed taking into encoder Ep [46] consisting of EdgeConv [40] layers. Each
consideration richer structured information across multiple entry zi , i ∈ [1, . . . , d] of the occlusion code is bounded
data points and achieve impressive performance. Different in [0, 1] by a sigmoid function and has the same length as
from triplet loss who considers one negative point, N-pair the complete shape code. Occlusion of the complete shape
loss [36] aims to identify one positive example from multi in the latent space is modeled as softly “gating” each di-
negative examples. mension of the complete shape code. A smaller occlusion
value denotes more occlusion to a complete shape. The em-
3. Method bedding of the partial shape in the unified latent space can
then be obtained by element-wise multiplication of the two
The goal of our work is to reconstruct the complete point
codes. The complete and partial codes are then fed into
cloud from an input partial point cloud with only unpaired
two separate decoders, Dc and Dp , to generate completed
data. Designing proper and strong supervisions is of great
importance to tackle this challenging problem. We propose point cloud Ĉ and reconstruct the input partial point cloud
to learn a unified latent space for encoding both complete P̂ , respectively. The two separate decoders adopt the same
and partial point clouds (shapes). We first introduce the architecture made up of a multiple layers perception (MLP)
unified latent space in Section 3.1, which encodes complete following [41]. Both Ĉ and P̂ are supervised by a point-
and partial point clouds in a joint space. Then structured wise Chamfer Distance (CD) loss with respect to the partial
latent supervisions of a series of related partial point clouds input. The point-wise reconstruction loss is expressed as:
are adopted to further regularize the learning of the space
(Section 3.2). The overall architecture is depicted in Fig- \label {eq:rec} \begin {aligned} \mathcal {L}_{rec} = \mathcal {L}_{CD}(P, \hat {P}) + \mathcal {L}_{CD}(P, Deg(\hat {C})). \end {aligned} (1)
ure 2.
For (P, Ĉ), the bi-direction Chamfer Distance cannot be
3.1. A Unified Latent Space for Point Cloud Encod- utilized directly, but only the Unidirectional Chamfer Dis-
ing tance (UCD) cannot provide enough supervision for the in-
We introduce the unified latent space to establish the re- ference of the missing parts, so we degrade (i.e. Deg) the
lations between partial and complete point clouds in an un- Ĉ into a partial point cloud following [52]’s degradation
paired manner. Partial point clouds can be considered as module, where only top-k nearest points with respect to
being created by occluding the complete shapes. A partial partial point cloud are kept. In order to further encourage
point cloud represents the same object as its correspond- the predicted complete point clouds to represent reasonable
5535
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
shapes, a point cloud discriminator is adopted. We formu- Point-wise Reconstruction Loss
late the point cloud discriminator with WGAN-GP [15] loss Ԣ
as ܲ ܲ
ܧ
ܦ
\label {eq:D_code} \mathcal {L}_{D}^{p}=\mathbb {E}_{\hat {C}} D(\hat {C})-\mathbb {E}_{Y} D\left (Y\right )+\lambda _{gp} \mathcal {T}_{D}, (2)
Reconstructed
Partial Input
where λgp is a pre-defined weight factor and TD is gradient Partial
penalty term, denoted as ܲԢ
ܲԢ
\mathcal {T}_{D}=\mathbb {E}_{\hat {C}}\left [\left (\left \|\nabla _{{\hat {C}}} D(\hat {C})\right \|_{2}-1\right )^{2}\right ]. (3) Ԣ Ԣ Ԣ
Point-wise Reconstruction Loss
The code adversarial training loss for the encoder Ep and Figure 3. Illustration of the latent code swapping. To better de-
decoder Dc is couple the complete shape and occlusion codes, we swap the com-
\label {eq:G_code} \mathcal {L}_{G}^{p}=-\mathbb {E}_{\hat {C}} D(\hat {C}). (4) plete shape codes between related partial point clouds and apply
point-wise reconstruction loss to supervise the reconstructed par-
Note that, during inference, only the encoder Ep and de- tial point clouds. Best viewed in color.
coder Dc are needed.
For the N-pair loss, there are one anchor sample a ∈ Rd ,
3.2. Structural Regularization of the Unified Space one positive sample p ∈ Rd and N negative samples
To further regularize the learning of the structured la- nj ∈ Rd , as shown in Eq. (7). By minimizing the loss
tent space, we create a series of related partial point clouds function, the positive sample would be pulled closer to the
and propose several properly designed latent code super- anchor, and negative samples are pushed farther away from
visions, including structured ranking regularization, latent the anchor. Here, we adopt N-pair loss by choosing differ-
code swapping constraint, and latent code distribution su- ent occlusion codes as anchors, which can be written as the
pervision to enhance the learning of the structured latent following sets, where 1 ∈ Rd is an all-one vector:
space. Specifically, given a partial input P , we can create \label {set} \begin {aligned} &\left \{\mathbf {a}=\mathbf {1}, \quad \mathbf {p}=\mathbf {o}, \quad \left \{\mathbf {n}_{j}\right \}_{j=1}^{N}=\left \{\mathbf {o}^{\prime }, \mathbf {o}^{\prime \prime }\right \}\right \}, \\ &\left \{\mathbf {a}=\mathbf {o}, \quad \mathbf {p}=\mathbf {o}^{\prime }, \quad \left \{\mathbf {n}_{j}\right \}_{j=1}^{N}=\left \{\mathbf {o}^{\prime \prime }\right \}\right \}, \\ &\left \{\mathbf {a}=\mathbf {o}^{\prime \prime }, \quad \mathbf {p}=\mathbf {o}^{\prime }, \quad \left \{\mathbf {n}_{j}\right \}_{j=1}^{N}=\{\mathbf {o}\}\right \}, \end {aligned}
a series of related partial point clouds (see {P, P ′ , P ′′ } in
Figure 2 (a)) by gradually removing more points. For P ′ (8)
and P ′′ , there are K and 2K points removed from the ini-
tial partial shape P . Therefore, for a triplet of such related
partial point clouds S={P, P ′ , P ′′ }, their occlusion degrees
gradually increase. Therefore, the proposed relative ranking relation Eq. (6) be-
Structured Ranking Regularization. For their complete tween occlusion codes are constrained via applying the N-
shape codes, since the point clouds represent the same ob- pair loss on each of the sets in Eq. (8). Through adopting
ject, their complete shape latent codes, z, z′ , z′′ are required such strong ranking regularization, the unified latent space
to be equal. And we adopt Smooth L1 loss to constrain them is trained to be more structured.
Latent Code Swapping. To further regularize the com-
\label {eq:equal} L_{z}=L_{1}\left (\mathbf {z}, \mathbf {z'}\right ) + L_{1}\left (\mathbf {z}, \mathbf {z''}\right ). (5) plete shape codes and occlusion codes, we employ a latent
code swapping constraint. Specifically, as illustrated in Fig-
Furthermore, since their occlusion codes represent the in- ure 3, we swap the complete and occlusion codes extracted
creasing degree of occlusion, their corresponding occlusion from a partial point cloud P and a more occluded version
codes’ weights shall be smaller as their occlusion degrees of P ′ to reconstruct the corresponding partial point clouds.
increase. Such relations between their occlusion codes can Based on our assumption on the unified space, z and z′ rep-
be expressed as resent the same complete object and the partial degrees are
decided by their occlusion codes. Therefore, no matter o′ is
\label {relative} {o}_i^{\prime \prime } \leq {o}_i^{\prime } \leq {o}_i \leq 1 \quad \quad \text {for} \quad i=1, \cdots , d, (6)
combined with z or z′ , the same partial point cloud should
where o′′i , o′i , oi are the ith entry of the occlusion codes for be reconstructed. Therefore, we also feed the fused code
P ′′ , P ′ , P , respectively. To implement such a constraint, we from z and o′ to the decoder Dp and apply point-wise re-
adopt the N-pair loss [36] to constrain the proposed relative construction loss Lrec to constrain. And the fused shape
relations in Eq. (6), code from z′ and o is similarly processed. Through such a
latent code swapping constraint, the disentanglement of the
complete shape codes and the occlusion codes are greatly
\label {eq:n-pair} \begin {aligned} L\left (\mathbf {a}, \mathbf {p}, \left \{\mathbf {n}_j\right \}_{j=1}^{N}\right )=\log (1+\sum _{j=1}^{N} \exp \left (\mathbf {a}^\top \mathbf {n}_j-\mathbf {a}^\top \mathbf {p}\right )). \end {aligned} improved, which leads to better shape completion.
Latent Code Distribution. In order to further constrain the
(7) reality of the complete shape codes, a shape latent code dis-
5536
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
criminator is applied to directly supervise whether the com- where xout and xin are two point clouds. The smaller the
plete shape codes learned from partial point clouds match distance value is, the more accurate the reconstructed point
well with the real complete shape codes extracted from a cloud is. For synthetic dataset PartNet utilized by [44],
complete shape auto-encoder. As illustrated in Figure 2 (c), we follow the method and also use Minimum Matching
the real shape latent code zc ∈ Rd can be obtained from Distance (MMD) metric to evaluate the accuracy of the
a complete point cloud auto-encoder following [41]. The completion. The MMD measures the quality of the com-
input of the auto-encoder Y is a shape randomly sampled pleted shape and we calculate the MMD between the set
from a complete point clouds set which is not paired with of completion shapes and the set of test shapes. For real-
P . The objective functions for updating latent code dis- world scans where no ground truth is provided, we fol-
criminator LcD and latent code generator LcG are similar as low [44, 47] to evaluate the generated shapes in terms of
Eqs. (2) and (4) respectively. UCD and MMD, respectively. The UCD evaluates the con-
In summary, through the constraints on the unified latent sistency and computes the Chamfer distance from the par-
space, the complete shape and occlusion codes can be well tial input to the predicted complete point cloud.
learned to enhance the relation between the complete and Implementation Details. The proposed method follows
partial point clouds. previous unsupervised point cloud methods [6, 41, 44, 52]
Overall Loss. The overall training objective for the two to train single-class models separately for better fidelity.
discriminators is The number of points of the predicted complete shapes is
\mathcal {L}_{D} = \mathcal {L}_{D}^{p} + \mathcal {L}_{D}^{c}. (9) 2048 for all datasets. We use 8 TITAN GPUs to implement
our experiments. Specifically, we adopt an Adam optimizer
And the overall training loss for the encoders and decoders with a learning rate 10−4 and a batch size of 16 per GPU
including Ep , Dp , Ec and Dc is the weighted sum of point- to train the framework for 500 epochs. The top 5 points are
wise reconstruction loss, structured latent supervision and kept for complete point cloud degradation. The dimension
adversarial losses: d of complete shape code and occlusion code are both 96
\mathcal {L} = \gamma \mathcal {L}_{rec} + \beta \mathcal {L}_{z} + \mathcal {L}_{npair} + \mathcal {L}_{G}^{p} + \mathcal {L}_{G}^{c}, (10) and a number of 500 points (i.e., K=500) are gradually re-
moved to generate more partial point clouds. The γ=100,
where γ and β are pre-defined weight factors. β=10 and λgp =1 are set for the combination of losses.
4. Experiments
4.1. Completion Results on ShapeNet Benchmark
We evaluate the proposed method through extensive ex-
periments. Besides shape completion on the virtual scan We conduct experiments on CRN, 3D-EPN, and PartNet
benchmarks, we also demonstrate its effectiveness com- synthetic datasets generated from ShapeNet to demonstrate
pared with other methods on the widely used real-world the superiority of our method over state-of-the-art unsuper-
partial scans. vised methods.
Datasets. For a comprehensive comparison, we conduct ex- Comparison on CRN and 3D-EPN datasets. For syn-
periments on both synthetic and real-world partial shapes thetic datasets CRN and 3D-EPN equipped with ground
following state-of-the-art unsupervised point cloud comple- truth, we evaluate the shape completion performance us-
tion methods [6, 41, 44, 52]. We evaluated our method on ing CD and F1-score following [6, 41, 52]. Tables 1 and 2
three synthetic datasets CRN [39], 3D-EPN [11] and Part- show the experiment results on the two datasets across eight
Net [29], which are all derived from ShapeNet [5]. For categories where “Cycle.” and “Inversion.” represent Cy-
real-world scans, we evaluate on objects extracted from cle4Completion and ShapeInversion respectively. As illus-
three datasets covering indoor and outdoor scenes, KITTI trated in Table 1, the proposed method outperforms state-of-
(cars) [14], ScanNet (chairs and tables) [9], and Matter- the-art unsupervised method ShapeInversion [52] by large
Port3D (chairs and tables) [4]. margins across most categories and achieves 12.2 CD and
Evaluation Metrics. For datasets equipped with ground 85.6 F1-score surpassing [52] by 2.7 and 1.7 for average
truth, we evaluate the shape completion performance us- CD and F1-Score metrics, respectively. For ShapeNet 3D-
ing CD and F1-score following previous unsupervised point EPN dataset, as shown in Table 2, our method consistently
cloud completion methods [6, 41, 52], where F1-score is the achieves the best completion performance on most cate-
harmonic average of the accuracy and the completeness. gories, especially for categories like “chair” and “table”,
The Chamfer Distance is defined as: whose shape diversity and number of training samples are
relatively rich compared with other categories. For chair
\label {eq:cd-loss} \begin {aligned} \mathcal {L}_{C D}\left (\mathbf {x}_{out}, \mathbf {x}_{i n}\right ) &=\frac {1}{\left |\mathbf {x}_{out}\right |} \sum _{p \in \mathbf {x}_{out}} \min _{q \in \mathbf {x}_{i n}}\|p-q\|_{2}^{2} \\ &+\frac {1}{\left |\mathbf {x}_{i n}\right |} \sum _{q \in \mathbf {x}_{i n}} \min _{p \in \mathbf {x}_{out}}\|p-q\|_{2}^{2}, \end {aligned} and table, the CD and F1-score metrics show significant im-
(11) provements (from 14.6/84.2 to 12.1/86.4 for chair and from
22.5/82.7 to 19.8/85.5 for table). There is a 0.9 gap between
our method and [41] on the car category. Through evalua-
5537
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
Table 1. Shape completion performance on CRN benchmark. The numbers shown are [CD↓ /F1↑], where CD is scaled by 104 .
Methods Plane Cabinet Car Chair Lamp Sofa Table Boat Average
Pcl2pcl [6] 9.7/89.1 27.1/68.4 15.8/80.8 26.9/70.4 25.7/70.4 34.1/58.4 23.6/79.0 15.7/77.8 22.4/74.2
Cycle. [41] 5.2/94.0 14.7/82.1 12.4/82.1 18.0/77.5 17.3/77.4 21.0/75.2 18.9/81.2 11.5/84.8 14.9/81.8
Inversion. [52] 5.6/94.3 16.1/77.2 13.0/85.8 15.4/81.2 18.0/81.7 24.6/78.4 16.2/85.5 10.1/87.0 14.9/83.9
Ours 3.9/95.9 13.5/83.3 8.7/90.4 13.9/82.3 15.8/81.0 14.8/81.6 17.1/82.6 10.0/87.6 12.2/85.6
Table 2. Shape completion performance on 3D-EPN benchmark. The numbers shown are [CD↓ /F1↑], where CD is scaled by 104 .
Methods Plane Cabinet Car Chair Lamp Sofa Table Boat Average
Pcl2pcl [6] 4.0/– 19.0/– 10.0/– 20.0/– 23.0/– 26.0/– 26.0/– 11.0/– 17.4/–
Cycle. [41] 3.7/96.4 12.6/87.1 8.1/91.8 14.6/84.2 18.2/80.6 26.2/71.7 22.5/82.7 8.7/89.8 14.3/85.5
Inversion. [52] 4.3/96.2 20.7/79.4 11.9/86.0 20.6/81.1 25.9/78.4 54.8/74.7 38.0/80.2 12.8/85.2 23.6/82.7
Ours 3.5/96.8 12.2/86.4 9.0/88.4 12.1/86.4 17.6/81.6 26.0/75.5 19.8/85.5 8.6/89.8 13.6/86.3
Table 3. Shape completion performance on PartNet benchmark. the learned unified latent space and properly applied struc-
We evaluate the results with MMD↓, which is scaled by 102 . tured latent supervisions, which leads to more reasonable
predicted complete point clouds and better consistency be-
Methods Chair Lamp Table Average
tween partial and complete point clouds.
Pcl2pcl [6] 1.90 2.50 1.90 2.10
MPC [44] 1.52 1.97 1.46 1.65
4.2. Completion Results on Real-World Scans
Cycle. [41] 1.71 3.46 1.56 2.24
Inversion. [52] 1.68 2.54 1.74 1.98 We investigate the generalization of the proposed
Ours 1.43 1.95 1.37 1.58 method on various real-world datasets including both out-
door and indoor scenes, where the objects tend to be more
tion on the two popular synthetic datasets across eight cat- incomplete and noisier. The trained car, chair and table
egories, our method outperforms existing methods consis- models on CRN dataset are directly utilized to predict com-
tently, which proves the superiority of the proposed frame- plete point clouds on KITTI, ScanNet and MatterPort3D
work that learns a unified latent space with effective and datasets without any further fine-tuning process. As shown
efficient structured regularization. in Table 4 (Cycle. [41] vs. Ours), our method signifi-
Comparison on PartNet dataset. We also conduct ex- cantly outperforms Cycle4Completion [41] across multi-
periments on PartNet dataset utilized by MPC [44]. The ple categories on all the three real-scan datasets. For the
PartNet benchmark is generated by removing semantic parts comparison with ShapeInversion [52], as the inversion pro-
on ShapeNet dataset. We follow [44] to adopt Minimum cess is to minimize the UCD loss directly between partial-
Matching Distance on three categories to evaluate the qual- complete pairs, it is unfair to compare our method that
ity of the completed shapes. As shown in Table 3, our does not involve GAN inversion with ShapeInversion [52].
method outperforms existing state-of-the-art unsupervised However, our method is also compatible with ShapeInver-
methods on the three categories consistently. sion [52]. When integrating GAN inversion on the top of
Qualitative Results. Figure 4 illustrates the qualitative re- our method, it can surpass ShapeInversion on all various
sults of the same samples from the ShapeNet dataset. De- real-world scans, as shown in Table 4 (Inversion. [52] vs.
spite the efforts of previous approaches, they usually fail Ours + Inversion), which demonstrates that our method can
to deal with severe occlusion cases and can not maintain enhance consistency between the predicted complete point
shape details. As shown by the couch and car in Figure 4 cloud and the partial input. On the other hand, the results
first two rows, in the case of severe occlusion, the com- of our method in Table 5 also surpass other unsupervised
pleted point clouds of previous methods do not represent methods consistently.
the target objects (see the large missing regions which are In addition, as shown in Tables 4 and 5, we also com-
not recovered correctly). However, our method can accu- pare the generalization of our models with those of state-of-
rately recover the complete point cloud of the target ob- the-art fully-supervised methods [48, 50] on the real scans.
ject, even when only fairly limited information is available Their authors’ official released models are used here. Our
under severe occlusion. What’s more, our method recon- unsupervised model can outperform them on multiple cate-
structs more accurate complete point clouds equipped with gories of real scans, which demonstrates that the proposed
better fine-grained shape details. As shown by the red dot- unsupervised method has better generalization ability on
ted boxes, our method can generate more accurate complete real-world scans than the supervised methods, which are
shapes at the corner of the lamps, the tail of the planes, and specifically trained to only fit their original synthetic data.
the legs of chairs and tables. We attribute the great results to Figure 5 shows the completion results of our method on
5538
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
Input C yc le 4C o m p le tio n S h a p eIn ve r sio n O urs G r o u n d Truth
Couch
Car
Plane
Lamp
C h a ir
T a b le
Figure 4. Point cloud completion results on ShapeNet dataset. From left to right: partial input, results of Cycle4completion [41],
ShapeInversion [52] ours and ground truth. Our results achieve more accurate completion under severe occlusion and recover better
fine-grained shape details compared with state-of-the-art methods. Better viewed in color and zoom in.
4.3. Ablation Study
Table 4. Shape completion performance on the real scans. The
results are evaluated by UCD↓, where UCD is scaled by 104 . sup.: To verify the effectiveness of each component of the pro-
supervised methods. posed method, we conduct a series of experiments on four
ScanNet MatterPort3D KITTI representative categories on ShapeNet CRN dataset.
Methods sup.
Chair Table Chair Table Car Effect of Unified Latent Space for Point Clouds Encod-
GRNet [48] yes 1.6 1.6 1.6 1.5 2.2 ing. To evaluate the benefits of introducing the unified la-
PoinTr [50] yes 1.7 1.5 1.8 1.3 1.9 tent code space for unsupervised point cloud completion,
Pcl2pcl [6] no 17.3 9.1 15.9 6.0 9.2
we compare two alternative strategies that do not encode
Cycle. [41] no 9.4 4.3 4.9 4.9 9.4
Inversion. [52] no 3.2 3.3 3.6 3.1 2.9 occlusion codes as soft weighting vectors. Instead of fus-
Ours no 3.2 2.7 3.3 2.7 4.2 ing the codes via element-wise multiplication, we test fus-
Ours + Inversion no 1.1 0.87 1.1 0.87 0.76 ing the shape and occlusion codes via concatenation or
Table 5. Shape completion performance on the real scans. We element-wise addition. Table 6 shows the quantitative re-
evaluate the results with MMD↓, where MMD is scaled by 102 . sults of the compared schemes. We employ our simplified
sup.: supervised methods. model, which multiples the shape and occlusion codes, and
ScanNet MatterPort3D KITTI has point and code discriminators and code swapping con-
Methods sup. straints under our unified space design but without using
Chair Table Chair Table Car
GRNet [48] yes 6.070 6.302 6.147 6.911 2.845 the ranking constraints as the baseline (denoted as “Uni.
PoinTr [50] yes 6.001 6.089 6.248 6.648 2.790 Space”) in Table 6. The average CD drops from 18.3 to
Cycle. [41] no 6.278 5.727 6.022 6.535 3.033 19.1 and 18.6 when fusing the two codes via concatena-
Inversion. [52] no 6.370 6.222 6.360 7.110 2.850 tion or addition, which proves that the learned unified latent
Ours no 5.893 5.541 5.770 6.076 2.742
space is conducive to unsupervised point cloud completion,
and also creates a foundation for integrating our stronger
real data, which indicates that even under severe occlusions ranking supervision.
(such as KITTI’s car), our method can still generate reason- Effect of Structured Ranking Supervision. To evaluate
able complete shapes.
5539
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
Table 6. Comparison of different schemes for fusing complete Table 8. Effect of discriminators and latent code swapping. CD↓
shape and occlusion codes. CD↓ scaled by 104 are reported here. scaled by 104 are reported here.
Method Chair Lamp Sofa Table Avg. Method Chair Lamp Sofa Table Avg.
Uni. Space 16.2 20.3 16.1 20.5 18.3 Full Model 13.9 15.8 14.8 17.1 15.4
Concatenation 17.2 20.7 17.5 20.8 19.1 w/o pointD 14.7 23.6 20.0 19.4 19.4
Addition 17.2 20.1 16.3 20.8 18.6 w/o codeD 18.5 21.7 22.0 20.8 20.8
Input Cycle4Com pletion ShapeInversion Ours w/o code swap 14.7 20.0 17.0 19.2 17.7
5540
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
References [12] Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-
slam: Large-scale direct monocular slam. In European con-
[1] Yingjie Cai, Xuesong Chen, Chao Zhang, Kwan-Yee Lin, ference on computer vision, pages 834–849. Springer, 2014.
Xiaogang Wang, and Hongsheng Li. Semantic scene com- 1, 2
pletion via integrating instances and scene in-the-loop. In [13] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Proceedings of the IEEE/CVF Conference on Computer Vi- ready for autonomous driving? the kitti vision benchmark
sion and Pattern Recognition, pages 324–333, 2021. 1, 2 suite. In Conference on Computer Vision and Pattern Recog-
[2] Yingjie Cai, Buyu Li, Zeyu Jiao, Hongsheng Li, Xingyu nition (CVPR), 2012. 2
Zeng, and Xiaogang Wang. Monocular 3d object detection [14] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
with decoupled structured polygon estimation and height- ready for autonomous driving? the kitti vision benchmark
guided depth estimation. In Proceedings of the AAAI Con- suite. In 2012 IEEE Conference on Computer Vision and
ference on Artificial Intelligence, volume 34, pages 10478– Pattern Recognition, pages 3354–3361. IEEE, 2012. 5
10485, 2020. 1, 2 [15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
[3] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- Dumoulin, and Aaron Courville. Improved training of
ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
Zeng, and Yinda Zhang. Matterport3d: Learning from rgb- 4
d data in indoor environments. International Conference on [16] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional-
3D Vision (3DV), 2017. 2 ity reduction by learning an invariant mapping. In 2006 IEEE
[4] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Computer Society Conference on Computer Vision and Pat-
Halber, Matthias Niessner, Manolis Savva, Shuran Song, tern Recognition (CVPR’06), volume 2, pages 1735–1742.
Andy Zeng, and Yinda Zhang. Matterport3d: Learning IEEE, 2006. 3
from rgb-d data in indoor environments. arXiv preprint [17] Zhizhong Han, Xiyang Wang, Chi-Man Vong, Yu-Shen Liu,
arXiv:1709.06158, 2017. 5 Matthias Zwicker, and CL Chen. 3dviewgraph: Learning
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, global features for 3d shapes from a graph of unordered
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, views with attention. arXiv preprint arXiv:1905.07503,
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: 2019. 1
An information-rich 3d model repository. arXiv preprint [18] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji
arXiv:1512.03012, 2015. 5 Watanabe. Deep clustering: Discriminative embeddings
[6] Xuelin Chen, Baoquan Chen, and Niloy J Mitra. Unpaired for segmentation and separation. In 2016 IEEE Interna-
point cloud completion on real scans using adversarial train- tional Conference on Acoustics, Speech and Signal Process-
ing. arXiv preprint arXiv:1904.00069, 2019. 1, 2, 5, 6, 7 ing (ICASSP), pages 31–35. IEEE, 2016. 2
[7] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning [19] Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d seman-
a similarity metric discriminatively, with application to face tic instance segmentation of rgb-d scans. In Proceedings of
verification. In 2005 IEEE Computer Society Conference on the IEEE/CVF Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition (CVPR’05), vol- Recognition, pages 4421–4430, 2019. 1
ume 1, pages 539–546. IEEE, 2005. 3 [20] Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d seman-
[8] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. tic instance segmentation of rgb-d scans. In Proc. Computer
Fine-grained categorization and dataset bootstrapping using Vision and Pattern Recognition (CVPR), IEEE, 2019. 2
deep metric learning with humans in the loop. In Proceed- [21] Tao Hu, Zhizhong Han, and Matthias Zwicker. 3d shape
ings of the IEEE conference on computer vision and pattern completion with multi-view consistent inference. In Pro-
recognition, pages 1153–1162, 2016. 3 ceedings of the AAAI Conference on Artificial Intelligence,
[9] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- volume 34, pages 10997–11004, 2020. 1
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: [22] Zitian Huang, Yikuan Yu, Jiawen Xu, Feng Ni, and Xinyi Le.
Richly-annotated 3d reconstructions of indoor scenes. In Pf-net: Point fractal network for 3d point cloud completion.
Proceedings of the IEEE conference on computer vision and In Proceedings of the IEEE/CVF Conference on Computer
pattern recognition, pages 5828–5839, 2017. 2, 5 Vision and Pattern Recognition, pages 7662–7670, 2020. 1,
[10] Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, 2
Jürgen Sturm, and Matthias Nießner. Scancomplete: Large- [23] Marcin Kopaczka, Justus Schock, and Dorit Merhof. Super-
scale scene completion and semantic segmentation for 3d realtime facial landmark detection and shape fitting by deep
scans. In Proceedings of the IEEE Conference on Computer regression of shape model parameters. arXiV preprint, 2019.
Vision and Pattern Recognition, pages 4578–4587, 2018. 1, 2
2 [24] Marc T Law, Raquel Urtasun, and Richard S Zemel. Deep
[11] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. spectral clustering learning. In International conference on
Shape completion using 3d-encoder-predictor cnns and machine learning, pages 1985–1994. PMLR, 2017. 3
shape synthesis. In Proceedings of the IEEE Conference [25] Minghua Liu, Lu Sheng, Sheng Yang, Jing Shao, and Shi-
on Computer Vision and Pattern Recognition, pages 5868– Min Hu. Morphing and sampling network for dense point
5877, 2017. 2, 5 cloud completion. In Proceedings of the AAAI conference on
5541
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
artificial intelligence, volume 34, pages 11596–11603, 2020. [38] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg,
1, 2 Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learn-
[26] Xinhai Liu, Zhizhong Han, Fangzhou Hong, Yu-Shen Liu, ing fine-grained image similarity with deep ranking. In Pro-
and Matthias Zwicker. Lrc-net: Learning discriminative ceedings of the IEEE conference on computer vision and pat-
features on point clouds by encoding local region contexts. tern recognition, pages 1386–1393, 2014. 3
Computer Aided Geometric Design, 79:101859, 2020. 1 [39] Xiaogang Wang, Marcelo H Ang Jr, and Gim Hee Lee. Cas-
[27] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias caded refinement network for point cloud completion. In
Zwicker. Fine-grained 3d shape classification with hierar- Proceedings of the IEEE/CVF Conference on Computer Vi-
chical part-view attention. IEEE Transactions on Image Pro- sion and Pattern Recognition, pages 790–799, 2020. 2, 5
cessing, 30:1744–1758, 2021. 1 [40] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Michael M Bronstein, and Justin M Solomon. Dynamic
[28] Xinhai Liu, Zhizhong Han, Xin Wen, Yu-Shen Liu, and
graph cnn for learning on point clouds. Acm Transactions
Matthias Zwicker. L2g auto-encoder: Understanding point
On Graphics (tog), 38(5):1–12, 2019. 3
clouds by local-to-global reconstruction with hierarchical
[41] Xin Wen, Zhizhong Han, Yan-Pei Cao, Pengfei Wan, Wen
self-attention. In Proceedings of the 27th ACM International
Zheng, and Yu-Shen Liu. Cycle4completion: Unpaired point
Conference on Multimedia, pages 989–997, 2019. 1
cloud completion using cycle transformation with missing
[29] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna region coding. In Proceedings of the IEEE/CVF Conference
Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- on Computer Vision and Pattern Recognition, pages 13080–
scale benchmark for fine-grained and hierarchical part-level 13089, 2021. 1, 2, 3, 5, 6, 7, 8
3d object understanding. In Proceedings of the IEEE/CVF [42] Xin Wen, Tianyang Li, Zhizhong Han, and Yu-Shen Liu.
Conference on Computer Vision and Pattern Recognition, Point cloud completion by skip-attention network with hi-
pages 909–918, 2019. 2, 5 erarchical folding. In Proceedings of the IEEE/CVF Con-
[30] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Le- ference on Computer Vision and Pattern Recognition, pages
ung, Sergey Ioffe, and Saurabh Singh. No fuss distance met- 1939–1948, 2020. 2
ric learning using proxies. In Proceedings of the IEEE In- [43] Xin Wen, Peng Xiang, Zhizhong Han, Yan-Pei Cao, Pengfei
ternational Conference on Computer Vision, pages 360–368, Wan, Wen Zheng, and Yu-Shen Liu. Pmp-net: Point cloud
2017. 2, 3 completion by learning multi-step point moving paths. In
[31] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Proceedings of the IEEE Conference on Computer Vision
Tardos. Orb-slam: a versatile and accurate monocular slam and Pattern Recognition (CVPR), 2021. 1
system. IEEE transactions on robotics, 31(5):1147–1163, [44] Rundi Wu, Xuelin Chen, Yixin Zhuang, and Baoquan Chen.
2015. 1, 2 Multimodal shape completion via conditional generative ad-
[32] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin versarial networks. In Computer Vision–ECCV 2020: 16th
Murphy. Deep metric learning via facility location. In Pro- European Conference, Glasgow, UK, August 23–28, 2020,
ceedings of the IEEE Conference on Computer Vision and Proceedings, Part IV 16, pages 281–296. Springer, 2020. 1,
Pattern Recognition, pages 5382–5390, 2017. 3 2, 5, 6
[33] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio [45] Peng Xiang, Xin Wen, Yu-Shen Liu, Yan-Pei Cao, Pengfei
Savarese. Deep metric learning via lifted structured fea- Wan, Wen Zheng, and Zhizhong Han. SnowflakeNet: Point
ture embedding. In Proceedings of the IEEE conference on cloud completion by snowflake point deconvolution with
computer vision and pattern recognition, pages 4004–4012, skip-transformer. In Proceedings of the IEEE International
2016. 2, 3 Conference on Computer Vision (ICCV), 2021. 1
[46] Chulin Xie, Chuxin Wang, Bo Zhang, Hao Yang, Dong
[34] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
Chen, and Fang Wen. Style-based point generator with ad-
Pointnet: Deep learning on point sets for 3d classification
versarial rendering for point cloud completion. In Proceed-
and segmentation. In Proceedings of the IEEE conference
ings of the IEEE/CVF Conference on Computer Vision and
on computer vision and pattern recognition, pages 652–660,
Pattern Recognition (CVPR), pages 4619–4628, June 2021.
2017. 1
3
[35] Florian Schroff, Dmitry Kalenichenko, and James Philbin. [47] Chulin Xie, Chuxin Wang, Bo Zhang, Hao Yang, Dong
Facenet: A unified embedding for face recognition and clus- Chen, and Fang Wen. Style-based point generator with ad-
tering. In Proceedings of the IEEE conference on computer versarial rendering for point cloud completion. In Proceed-
vision and pattern recognition, pages 815–823, 2015. 3 ings of the IEEE/CVF Conference on Computer Vision and
[36] Kihyuk Sohn. Improved deep metric learning with multi- Pattern Recognition, pages 4619–4628, 2021. 5
class n-pair loss objective. In Advances in neural information [48] Haozhe Xie, Hongxun Yao, Shangchen Zhou, Jiageng Mao,
processing systems, pages 1857–1865, 2016. 2, 3, 4 Shengping Zhang, and Wenxiu Sun. Grnet: Gridding resid-
[37] Lyne P Tchapmi, Vineet Kosaraju, Hamid Rezatofighi, Ian ual network for dense point cloud completion. In European
Reid, and Silvio Savarese. Topnet: Structural point cloud Conference on Computer Vision, pages 365–381. Springer,
decoder. In Proceedings of the IEEE/CVF Conference on 2020. 1, 2, 6, 7
Computer Vision and Pattern Recognition, pages 383–392, [49] Kangxue Yin, Hui Huang, Daniel Cohen-Or, and Hao Zhang.
2019. 2 P2p-net: Bidirectional point displacement net for shape
5542
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.
transform. ACM Transactions on Graphics (TOG), 37(4):1–
13, 2018. 1
[50] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen
Lu, and Jie Zhou. Pointr: Diverse point cloud completion
with geometry-aware transformers. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 12498–12507, 2021. 1, 2, 6, 7
[51] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and
Martial Hebert. Pcn: Point completion network. In 3DV,
pages 728–737. IEEE, 2018. 1, 2
[52] Junzhe Zhang, Xinyi Chen, Zhongang Cai, Liang Pan, Haiyu
Zhao, Shuai Yi, Chai Kiat Yeo, Bo Dai, and Chen Change
Loy. Unsupervised 3d shape completion through gan inver-
sion. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 1768–1777,
2021. 1, 2, 3, 5, 6, 7, 8
[53] Wenxiao Zhang, Qingan Yan, and Chunxia Xiao. Detail pre-
served point cloud completion via separated feature aggre-
gation. In Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceed-
ings, Part XXV 16, pages 512–528. Springer, 2020. 2
5543
Authorized licensed use limited to: Tianjin University of Technology. Downloaded on April 26,2023 at 02:35:29 UTC from IEEE Xplore. Restrictions apply.