Pointwise Convolutional Neural Networks
Pointwise Convolutional Neural Networks
Abstract
984
Coordinates
(n × 3)
(n × 40)
Figure 2: Pointwise convolutional neural network. The input point cloud is fed into each convolution operator, and all outputs
are concatenated before being fed to a a final convolution layer for dense semantic segmentation, or to fully connected layers
for object recognition. In this figure, we assume point cloud with n points and c attributes (colors, normals, coordinates, etc.).
We use 9 output channels for each convolution operator before concatenation. Source code is available at our homepage [13].
985
dataset from Silberman et al. [33], there has been a spark in designed for grid and volumes. PointNet [26] implements
the direction of RGBD semantic segmentation. The work point feature learning by fully connected layers.
from Long et al. [19] showed how to adopt a conventional The previous limitations motivate us to design fully con-
classifcation network for the semantic segmentation prob- volutional networks for point clouds. The basic building
lem. Since then, different techniques have been proposed block of our architecture is a convolution operator applied
to further improve the segmentation results. Some notable at each point in a point cloud, which we term the pointwise
examples are SegNet [2] which employs an encoder-decoder convolution. This operator works as follows.
architecture, or the dilation filter [43].
In the 3D domain, interactive semantic segmentation [40, Convolution. A convolution kernel is centered at each
39] relied on user strokes to propagate segmentation. McCor- point of a point cloud. Neighbor points within the kernel
mac et al. [21] explored transfering semantic segmentation support can contribute to the center point. Each kernel has
from 2D predictions to the 3D domain. An advantage of such a size or radius value, which can be adjusted to account
methods is that they can produce high-resolution segmen- for different number of neighbor points in each convolution
tation. However, none of the predictions can be performed layer. Figure 1 shows a diagram that demonstrates this idea.
directly in the 3D domain. Formally, pointwise convolution can be written as
SSCNet [36] applied convolutional neural network to a
X 1 X
3D volume representation to classify each voxel in the scene. xℓi = wk xℓ−1 , (1)
j
This could be flexible as real-time scene reconstruction tech- | Ωi (k) |
k pj ∈Ωi (k)
niques such as KinectFusion [23] and voxel hashing [24] are
often based on volumes. PointNet [26] can also be used for where k iterates over all sub-domains in the kernel support;
semantic segmentation with minor modifications from their Ωi (k) is the k-th sub-domain of the kernel centered at point
object recognition network. i; pi is the coordinate of point i; | · | counts all points within
Recently, Qi et al. [29] proposed to build a graph neural the sub-domain; wk is the kernel weight at the k-th sub-
network for semantic segmentation on a point cloud, where domain, xi and xj the value at point i and j, and ℓ − 1 and ℓ
each graph node is a group of points and graph edges are the index of the input and output layer.
constructed by nearest neighbor search on the point cloud.
Their results are shown with RGB-D images, where color Gradient backpropagation. To make pointwise convolu-
features from a pre-trained VGG-16 network [34, 5] are used tion trainable, it is necessary to compute the gradients with
to initialize the prediction. Here, we demonstrate a fully con- respects to the input data and the kernel weights. Let L is
volutional neural network for 3D point cloud segmentation. the loss function. The gradient with respect to input could
Compared to the method by Qi et al. [29], we train our be defined as
network from scratch. The input point cloud is also more
general such as CAD models or 3D meshes reconstructed ∂L X ∂L ∂xℓ
i
= (2)
from RGB-D sensors. ∂xℓ−1
j i∈Ωj
∂x ℓ
i ∂x ℓ−1
j
3. Pointwise Convolution where we iterate over all neighbor points i of a given point
j. In the chain rule, ∂L/∂xℓi is the gradient up to layer ℓ,
Before presenting pointwise convolution, we briefly re-
which is known during back propagation. The derivative
vise a few possibilities to represent 3D data for neural net-
∂xℓi /∂xℓ−1
j could be written as
work. The most straightforward approach is perhaps to em-
ploy volumetric representation. For example, VoxNet [20] X X
∂xℓi 1
represents each object by a volume up to 64 × 64 × 64 res- ℓ−1
= wk 1 (3)
olution. This is natural because almost existing network ∂xj | Ωi (k) |
k p ∈Ω (k) j i
986
Note that the above formula does not assume a specific deep learning [3], which is more robust for tasks such as
shape for convolution kernel. Here we simply use a uniform non-rigid shape correspondences and retrieval. To compute
grid kernel. In conjunction with an acceleration structure a geodesic convolution at a particular point, only neighbor
for neighbor query, e.g., grid, the convolution operator can points on its local surface manifold are considered. This is
be efficiently implemented on both CPU and GPU. In this achieved by definition because the filter support in geodesic
paper, we use convolution kernels of size 3 × 3 × 3. All convolution is directly defined on the surface manifold. By
points within each kernel cell have the same weights. contrast, our pointwise convolution operates adaptively in
Unlike convolution in volumes, in our design, we do not the 3D Euclidean space, and does not require any surface
use pooling. There are some advantages of doing so. First, it definition to operate.
is no longer required to deal with point cloud downsampling
and upsampling, which is not straightforward when the point 4. Evaluations
attributes become high dimensional when the point cloud is
processed in the network. Second, by keeping the point cloud Semantic segmentation. We evaluate our pointwise con-
unchanged in the entire network, acceleration structures for volutional neural network with semantic scene segmentation
neighbor query only need to be built once. This significantly and object recognition. For scene segmentation, we first ex-
speeds up computation and simplifies network design. periment with the S3DIS dataset [1], which has 13 categories
of indoor scene objects. Each point has 9 attributes: XYZ
coordinates, RGB color, and normalized coordinates w.r.t.
Point order. A notable difference between our design and the room space it belongs to. To perform segmentation of a
PointNet [26] is how points are ordered before being fed to scene, each squared-meter block of the scene (measured on
the network. In PointNet, point cloud is orderless, and the the floor), sampled to 4096 points, are fed into the network.
training process of PointNet learns a symmetric function to The predictions of all blocks are then assembled to obtain
turn an ordered point cloud into order invariant. However, the prediction of the entire scene.
we argue that this might not be necessary. In our method, we
We report per-point accuracy of the semantic segmenta-
input points sorted in a specific order, e.g., XYZ or Morton
tion. As shown in Table 1, our network is able to produce
curve [22], to the network and can still achieve competitive
comparable accuracy to PointNet [26], with the accuracy of
performance in the object recognition task. In this task, the
81.5%. Table 2 reports per-class accuracy. Figure 3 shows
order of the points only affects the final global feature vector
visualization of predictions and ground truths of the scenes
used to predict the object category. In semantic segmentation,
in the evaluation dataset.
in principle we can leverage local features at each point, and
hence point order is not necessary.
Accuracy
Network Accuracy
(per class)
À-trous convolution. The original pointwise convolution PointNet [26] - 87.0
can be easily extended to à-trous convolution by including Ours 56.5 81.5
a stride parameter that determines the gaps between kernel
cells. The benefit of pointwise à-trous convolution is that it Table 1: Comparison of scene segmentation on S3DIS
is possible to extend the kernel size, and hence the perceptive dataset [1].
field, without actually processing too many points in the con-
volution. This yields significant speed up without sacrificing
accuracy as to be demonstrated in our experiments. To further test semantic segmentation with more cate-
gories and more complex indoor scenes, we annotate 76
scenes from the SceneNN dataset [13] with 40 categories
Point attributes. For easy housekeeping in the implemen- defined by the NYU v2 dataset [33]. Scenes in this dataset
tation of our convolution operator, we separately store point appear to be more cluttered, which poses great challenges
coordinates and other point attributes such as colors, normals, to semantic segmentation. We use 56 scenes for training,
or other high-dimensional features output from preceding and 20 scenes for evaluation. In each scene, a 2 × 2 sqm.
convolutional layers. Point coordinates can be passed to any window with stride 0.2 meter and height 2 meters is used
layer despite the layer depth so that they can be used for to scan the floor area, resulting in approximately 30K scene
neighbor queries to determine which points can participate blocks for training and 15K blocks for testing. Each block is
in the convolution at a particular point. Point attributes can sampled to 4096 points.
then be retrieved accordingly. For SceneNN dataset, we additionally compare with
VoxNet [20], a voxel-based representation technique, and
Relevance to geometric deep learning. Our pointwise SemanticFusion [21], a multi-view 2D-3D semantic segmen-
convolution is relevant to geodesic convolution in geometric tation with RGB-D images. For VoxNet [20], we apply their
987
Network ceiling floor wall column The visualization of the predictions and ground truth are
shown in Figure 4. It can be seen that structures like wall
PointNet [26] 98.3 98.8 83.3 63.4
Ours 97.4 99.1 89.1 56.2 and floor have very good accuracy, and small objects are
moderately well segmented. A notable issue is noise due to
door table chair sofa clutter prediction inconsistency in the overlap regions of the blocks.
PointNet [26] 84.6 70.3 66.0 56.7 69.0 This could be addressed by a conditional random field and
Ours 62.9 73.7 68.4 54.6 65.2 would be an interesting future work.
Table 3 reports the accuracy of a few common categories.
Table 2: Per-class accuracy of semantic segmentation on While structures and chairs are quite accurate, table and
S3DIS dataset [1]. desk are often ambiguous, resulting in lower accuracy for
both classes. In general, the performance of VoxNet [20]
is inferior to ours and SemanticFusion [21] due to limited
network to predict labels of scene blocks as described above
resolution (we used 643 volume). Our method works com-
and gather all outputs into a final scene prediction. For Se-
petitively to SemanticFusion, but note that our method does
manticFusion [21], we perform 2D semantic segmentation
not apply any label smoothing while SemanticFusion has a
on the RGB-D images independently and then integrate all
conditional random field to remove noise after propagating
2D predictions to a 3D point cloud to generate the final
predictions from 2D to 3D.
segmentation.
(a) Our predictions (b) Ground truth (a) Our predictions (b) Ground truth
Figure 3: Semantic segmentation on the S3DIS dataset [1]. Figure 4: Semantic segmentation on SceneNN dataset [13].
988
Network wall floor chair table desk 0.8
0.9
Accuracy
Accuracy
0.6
Ours 93.8 88.6 58.6 23.5 29.5 0.5
0.5
0.4 0.4
Train Train
Table 3: Per-class accuracy of semantic segmentation on 0.3 Test
0.3
Test
989
point set and report the results in Table 7(a). We found that longer to train networks with 8 and 16 layers, resulting in
point orders sorted by space filling curve techniques such as slightly slower accuracy. Experimenting the training with
Morton curve [22] yields comparable accuracy, which means residual learning [11] would be an interesting future work.
that it is sufficient to just follow an order, but not a particular
one. However, a benefit is that space filling curves organize Running time. A key challenge when implementing point-
points such that nearby points in space are stored close to wise convolution is how to perform fast nearest neighbor
each other in memory, allowing more memory coherence. query without impacting too much the network training and
prediction time. To make the training feasible, we choose to
Neighborhood radius. So far we have been setting the use a grid for neighbor query because this is a lightweight
radius for neighbor query as constant in each convolution and GPU-friendly data structure to build and query on the
layer. In our experience, this works well for both tasks. We fly. In fact, we experimented with kd-tree, but found that on
also explore the capability of adaptive radius using k-nearest modern CPUs and GPUs, a kd-tree query does not outper-
neighbors. The modification for the convolution operator is form a grid unless the number of points are more than 16K
as follows. points, not to mention extra time needed for tree construction
At each point, a k-nearest neighbor is performed, and the that has O(n log n) complexity.
query radius is set to the distance to the furthest neighbor. Our pointwise convolution is currently implemented with
This radius is used each time neighbor points have to be Tensorflow. We report the running time, including grid build
queried for convolution. To compute gradients for backprop- and query each time convolution is invoked, as follows. For
agation for this operator, it is worth noting that in this case, a batch size of 128 point clouds, each with 2048 points, a
neighbor lookup is no longer symmetric. Therefore, at a forward convolution of our network takes 1.272 seconds on
point j, it is required to look up all points i such that point i an Intel Core i7 6900K with 16 threads, and a backward
can contribute to point j in the forward convolution. propagation takes 2.423 seconds to compute the gradients.
We compare the performance of the k-nearest neighbor Our GPU implementation on an NVIDIA TITAN X can fur-
and the fixed radius convolution for object recognition task. ther improve the running time for about 10%. Compared to
The result is shown in Table 7(b). In general, we found no PointNet [26] and VoxNet [20] which leverage Tensorflow’s
significant difference in terms of accuracy. optimized convolution operators, our pointwise convolution
is not yet engineering optimized. Our training time is about
2× slower which we currently compensate by using multiple
Order Accuracy Neighbor query Accuracy
CPUs and GPUs.
ZYX 86.1 Fixed-size radius 86.1
Morton 86.0 K-nearest neighbor 85.7 5. Conclusion
(a) (b)
In this paper, we proposed pointwise convolution and
Table 7: (a) Object recognition with different ways of or- leveraged it to build convolutional neural networks for scene
dering the input point cloud. (b) Object recognition with understanding with point cloud data. We demonstrated two
convolution using neighbor queries with adaptive radius. scene understanding applications including scene segmenta-
tion and object recognition. We showed that it is practical
to simply sort input point clouds in a specific order before
feature learning for object classification. Our pointwise con-
Deeper networks. Finally, we study the capability of volution can offer competitive accuracy while being simple
learning with deeper networks using pointwise convolution. to implement, allowing us to create effective and simple
From the basic model, we increase the number of layers neural networks for learning local features of point clouds.
from 4 to 8 and 16, and then retrain from scratch. The per- There are several research avenues to be further explored.
formance are reported in Table 8 below. Generally, it takes For example, finding a robust solution to handle large-scale
point clouds for scene understanding would be an interesting
future work. Currently, we just circumvent the large-scale
Network Accuracy
issue in semantic segmentation by simply dividing the scene
4 layers 86.1 into blocks and resample each block to fixed number of
8 layers 82.1 points for prediction. In addition, it would be of great in-
16 layers 82.6 terest to extend pointwise convolutional neural networks to
geometry point cloud processing [44], or explore the connec-
Table 8: Deep pointwise convolutional neural network. We tion of pointwise convolution to tensor voting [41], which
compare object recognition performance with 4-, 8-, and was used in the literature to detect structures in a local point
16-layer architecture. neighborhood.
990
Network airplane bathtub bed bench bookshelf bottle bowl car chair cone
PointNet [26] 100 80.0 94.0 75.0 93.0 94.0 100.0 97.9 96.0 100.0
Ours 100 82.0 93.0 68.4 91.8 93.9 95.0 95.6 96.0 80.0
cup curtain desk door dresser flower pot glass box guitar keyboard lamp
PointNet [26] 70.0 90.0 79.0 95.0 65.1 30.0 94.0 100.0 100.0 90.0
Ours 60.0 80.0 76.7 75.0 67.4 10.0 80.8 98.0 100.0 83.3
laptop mantel monitor night stand person piano plant radio range hood sink
PointNet [26] 100.0 96.0 95.0 82.6 85.0 88.8 73.0 70.0 91.0 80.0
Ours 95.0 93.9 92.9 70.2 89.5 84.5 78.8 65.0 88.9 65.0
sofa stairs stool table tent toilet tv stand vase wardrobe xbox
PointNet [26] 96.0 85.0 90.0 88.0 95.0 99.0 87.0 78.8 60.0 70.0
Ours 96.0 80.0 83.3 90.9 90.0 94.9 84.5 81.3 30.0 75.0
Table 9: Per-class accuracy of object recognition on the ModelNet40 dataset. Average: PointNet: 86.3. Ours 81.4.
Network chair display desk book storage box table bin bag keyboard
PointNet [26] 84.2 85.4 56.7 30.1 62.5 23.8 80.0 75.0 47.4 82.4
Ours 83.1 85.4 70.0 57.7 45.8 23.8 60.0 65.0 36.8 88.2
sofa bookshelf pillow machine pc case light oven cup printer bed
PointNet [26] 76.5 23.1 84.6 18.2 36.4 77.8 60.0 37.5 50.0 28.6
Ours 88.2 38.5 76.9 18.2 54.5 88.9 30.0 75.0 12.5 42.9
Table 10: Per-class accuracy of object recognition on the ObjectNN dataset. Average: PointNet: 56.0. Ours: 57.1.
A. Layer Visualization
Intuitively, pointwise convolution works by summarizing
local spatial point distributions to build feature vectors for
each point in a point cloud. As shown in per-class accuracy
tables, local features work the most effectively in classifying
(a) Layer 1 (b) Layer 2 (c) Layer 3 (d) Layer 4
structures such as ceiling, floor, or walls and common furni-
ture such as tables and chairs. In our observation, it is quite Figure 6: Visualization of the filters in pointwise convolution
challenging to differentiate between tables (for dining) and network for object recognition.
desks (for study and work).
We visualize the filters of the first four layers in the object
recognition network in Figure 6. Here we display each
3 × 3 × 3 filter on a row in the visualization. The number of Acknowledgement. We thank Quang-Hieu Pham for help-
rows is equal to the product of the total number of input and ing with the 2D-to-3D semantic segmentation experiment
output channels of each filter (27 for the first layer, and 81 and proofreading the paper, Quang-Trung Truong and Ben-
for the subsequent layers). In the visualization, blue and red jamin Kang Yue Sheng for their kind support for the neural
represent positive and negative values, respectively. White network training experiments.
represents zero. This shows that the filters in the network are Binh-Son Hua and Sai-Kit Yeung are supported by the
relatively sparse and smooth. We also observed that positive SUTD Digital Manufacturing and Design Centre which is
and negative values dominate the filters interchangeably in supported by the Singapore National Research Foundation
each layer. (NRF). Sai-Kit Yeung is also supported by Singapore MOE
Academic Research Fund MOE2016-T2-2-154, Heritage Re-
search Grant of the National Heritage Board, Singapore, and
Singapore NRF under its IDM Futures Funding Initiative and
Virtual Singapore Award No. NRF2015VSGAA3DCM001-
014.
991
References [19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015. 2, 3
[1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fis-
[20] D. Maturana and S. Scherer. VoxNet: A 3D Convolutional
cher, and S. Savarese. 3d semantic parsing of large-scale
Neural Network for Real-Time Object Recognition. In IROS,
indoor spaces. In CVPR, 2016. 4, 5
2015. 2, 3, 4, 5, 6, 7
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A
[21] J. McCormac, A. Handa, A. Davison, and S. Leutenegger.
deep convolutional encoder-decoder architecture for image
Semanticfusion: Dense 3d semantic mapping with convolu-
segmentation. arXiv:1511.00561, 2015. 3
tional neural networks. arXiv:1609.05130, 2016. 3, 4, 5,
[3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-
6
dergheynst. Geometric deep learning: going beyond euclidean
[22] G. M. Morton. A computer oriented geodetic data base and
data. arXiv:1611.08097, 2016. 4
a new technique in file sequencing. International Business
[4] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On
Machines Company New York, 1966. 4, 7
visual similarity based 3d model retrieval. In Computer graph-
[23] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,
ics forum, volume 22, pages 223–232. Wiley Online Library,
D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and
2003. 2
A. Fitzgibbon. Kinectfusion: Real-time dense surface map-
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
ping and tracking. In The IEEE International Symposium on
Yuille. Deeplab: Semantic image segmentation with deep
Mixed and Augmented Reality (ISMAR), 2011. 3
convolutional nets, atrous convolution, and fully connected
crfs. arXiv:1606.00915, 2016. 3 [24] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-
time 3d reconstruction at scale using voxel hashing. ACM
[6] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and
Transactions on Graphics (TOG), 2013. 3
A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
2 [25] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin.
Shape distributions. ACM Transactions on Graphics (TOG),
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
21(4):807–832, 2002. 2
Fei. Imagenet: A large-scale hierarchical image database. In
CVPR, 2009. 2 [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep
[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, learning on point sets for 3d classification and segmentation.
E. Tzeng, and T. Darrell. Decaf: A deep convolutional activa- CVPR, 2017. 2, 3, 4, 5, 6, 7, 8
tion feature for generic visual recognition. In International [27] C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas.
Conference on Machine Learning, 2014. 2 Volumetric and multi-view cnns for object classification on
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich 3d data. In CVPR, 2016. 2, 6
feature hierarchies for accurate object detection and semantic [28] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep
segmentation. In CVPR, 2014. 2 hierarchical feature learning on point sets in a metric space.
[10] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. arXiv:1706.02413, 2017. 6
Matchnet: Unifying feature and metric learning for patch- [29] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph
based matching. In CVPR, 2015. 2 neural networks for RGBD semantic segmentation. In ICCV,
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2017. 3
for image recognition. In CVPR, 2016. 7 [30] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning
[12] B. K. P. Horn. Extended gaussian images. Proceedings of the deep 3d representations at high resolutions. In CVPR, 2017.
IEEE, 72(12):1671–1686, 1984. 2 2, 3
[13] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, [31] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
and S.-K. Yeung. Scenenn: A scene meshes dataset with Cnn features off-the-shelf: an astounding baseline for recog-
annotations. In International Conference on 3D Vision (3DV), nition. In Proceedings of the IEEE Conference on Computer
2016. http://www.scenenn.net. 2, 4, 5, 6 Vision and Pattern Recognition Workshops, pages 806–813,
[14] B.-S. Hua, Q.-T. Truong, M.-K. Tran, Q.-H. Pham, 2014. 2
A. Kanezaki, T. Lee, H. Chiang, W. Hsu, B. Li, Y. Lu, et al. [32] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep
Shrec17: Rgb-d to cad retrieval with objectnn dataset. 6 panoramic representation for 3-d shape recognition. Signal
[15] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation Processing Letters, IEEE, 22(12):2339–2343, 2015. 2
invariant spherical harmonic representation of 3 d shape de- [33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
scriptors. In Symposium on geometry processing, volume 6, segmentation and support inference from rgbd images. In
pages 156–164, 2003. 2 ECCV, 2012. 3, 4
[16] R. Klokov and V. Lempitsky. Escape from cells: Deep kd- [34] K. Simonyan and A. Zisserman. Very deep convolutional
networks for the recognition of 3d point cloud models. In networks for large-scale image recognition. arXiv:1409.1556,
ICCV, 2017. 2 2014. 2, 3
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- [35] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
based learning applied to document recognition. Proceedings scene understanding benchmark suite. In CVPR, 2015. 2
of the IEEE, 86(11):2278–2324, 1998. 2 [36] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
[18] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. T. Funkhouser. Semantic scene completion from a single
arXiv:1801.07791, 2018. 2 depth image. CVPR, 2017. 3
992
[37] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.
Multi-view convolutional neural networks for 3d shape recog-
nition. In ICCV, 2015. 2
[38] J. Sun, M. Ovsjanikov, and L. Guibas. A concise and provably
informative multi-scale signature based on heat diffusion.
In Computer graphics forum, volume 28, pages 1383–1392.
Wiley Online Library, 2009. 2
[39] D. Thanh Nguyen, B.-S. Hua, L.-F. Yu, and S.-K. Yeung. A
robust 3d-2d interactive tool for scene segmentation and an-
notation. IEEE Transactions on Visualization and Computer
Graphics (TVCG), 2017. 3
[40] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton,
P. Kohli, M. Nießner, A. Criminisi, S. Izadi, and P. Torr.
Semanticpaint: Interactive 3d labeling and learning at your
fingertips. ACM Transactions on Graphics, 2015. 3
[41] T.-P. Wu, S. K. Yeung, J. Jia, C.-K. Tang, and G. G. Medioni.
A closed-form solution to tensor voting: Theory and applica-
tions. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2012. 7
[42] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3d shapenets:
A deep representation for volumetric shapes. In CVPR, 2015.
2, 6
[43] F. Yu and V. Koltun. Multi-scale context aggregation by
dilated convolutions. In ICLR, 2016. 3
[44] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng. Pu-net:
Point cloud upsampling network. In CVPR, 2018. 7
993