0% found this document useful (0 votes)
56 views10 pages

Pointwise Convolutional Neural Networks

The paper presents a new convolutional neural network for semantic segmentation and object recognition using 3D point clouds. It introduces pointwise convolution, which applies convolution to each point. This allows building fully convolutional networks simply and effectively for these tasks using point clouds. Experiments show the approach achieves competitive accuracy compared to previous techniques.

Uploaded by

cuimosemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Pointwise Convolutional Neural Networks

The paper presents a new convolutional neural network for semantic segmentation and object recognition using 3D point clouds. It introduces pointwise convolution, which applies convolution to each point. This allows building fully convolutional networks simply and effectively for these tasks using point clouds. Experiments show the approach achieves competitive accuracy compared to previous techniques.

Uploaded by

cuimosemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Pointwise Convolutional Neural Networks

Binh-Son Hua Minh-Khoi Tran Sai-Kit Yeung

The University of Tokyo Singapore University of Technology and Design

Abstract

Deep learning with 3D data such as reconstructed point


clouds and CAD models has received great research inter-
ests recently. However, the capability of using point clouds
with convolutional neural network has been so far not fully
explored. In this paper, we present a convolutional neural
network for semantic segmentation and object recognition
with 3D point clouds. At the core of our network is point-
wise convolution, a new convolution operator that can be
applied at each point of a point cloud. Our fully convolu-
tional network design, while being surprisingly simple to Figure 1: Pointwise convolution. We define a new convo-
implement, can yield competitive accuracy in both semantic lution operator for point cloud input. For each point, near-
segmentation and object recognition task. est neighbors are queried on the fly and binned into kernel
cells before convolving with kernel weights. By stacking
pointwise convolution operators together, we can build fully
1. Introduction convolutional neural networks for scene segmentation and
object recognition for point clouds.
Deep learning with 3D data has received great research
interests recently, which leads to noticeable advances in
typical applications including scene understanding, shape could be exported from a wide range of CAD modelling
completion, and shape matching. Among these, scene un- and 3D reconstruction software. However, the capability of
derstanding is considered as one of the most important tasks using point clouds with neural network has been so far not
for robots and drones as it can assist exploratory scene nav- fully explored.
igations. Tasks such as semantic scene segmentation and In this paper, we present a convolutional neural network
object recognition are often performed to predict contex- for semantic segmentation and object recognition with 3D
tual information about objects for both indoor and outdoor point clouds. At the core of our network is a new convolution
scenes. operator, called pointwise convolution, which can be applied
Unfortunately, deep learning in 3D was deemed difficult at each point in a point cloud to learn pointwise features.
due to the fact that there are several ways to represent 3D data This leads to surprisingly simple and fully convolutional net-
such as volumes, point clouds, or multi-view images. Vol- work designs for scene segmentation and object recognition.
ume representation is a true 3D representation and straight- Our experiments show that pointwise convolution can yield
forward to implement but often requires a large amount of competitive accuracy to previous techniques while being
memory for data storage. By contrast, multi-view represen- much simpler to implement. In summary, our contributions
tation is not a true 3D representation but shows promising are:
prediction accuracy as existing pre-trained weights from 2D • A pointwise convolution operator that can output fea-
networks can be utilized. Among such representations, point tures at each point in a point cloud;
clouds have been the most flexible as they are compact and
• Two pointwise convolutional neural networks for se-
0 Thiswork was done when Binh-Son Hua was a postdoctoral researcher mantic scene segmentation and object recognition.
in Singapore University of Technology and Design in 2017.

984
Coordinates
(n × 3)
(n × 40)

(n × c) (n × 9) (n × 9) (n × 9) (n × 9) (n × 36) Semantic segmentation

Point cloud (512) (40)


Category

Pointwise convolution Concatenation Fully connected concat fc


fc
(concat) (fc)
dropout 0.5

Figure 2: Pointwise convolutional neural network. The input point cloud is fed into each convolution operator, and all outputs
are concatenated before being fed to a a final convolution layer for dense semantic segmentation, or to fully connected layers
for object recognition. In this figure, we assume point cloud with n points and c attributes (colors, normals, coordinates, etc.).
We use 9 output channels for each convolution operator before concatenation. Source code is available at our homepage [13].

2. Related Works outperform other hand-crafted features for various tasks,


including object detection [9], scene recognition [8], texture
Recently, there has been a great number of works about
recognition [31, 6] and classification [10].
deep learning with 3D data. Let us focus on those for scene
Recently, several approaches to using 3D convolu-
understanding tasks such as semantic segmentation and ob-
tional networks to extract shape descriptor have been pro-
ject recognition.
posed, ranging from voxel-based representation [42, 20]
2.1. Shape descriptors panorama [32], feature pooling from 2D projections from
multiple viewpoints [37, 27], to point set [27]. Among these,
Hand-crafted shape descriptors were widely used in com-
Qi et al. [26]’s PointNet is one of the first network architec-
puter vision and graphics applications before the era of deep
tures that can handle point cloud data. PointNet is robust
learning. For example, 3D shapes can be projected into
as it can learn an order-invariance function to canonicalize
2D images and represented by a set of 2D descriptors on
input point clouds. Subsequently, PointCNN [18] explored
such images. Shapes can then be represented as histograms
the idea of equivariance instead of invariance and demon-
or bag-of-feature models which can be constructed from
strated competitive performance to PointNet. To achieve
surface normals and curvatures [12]. 3D shapes can also
scalability, it is also possible to learn representations on un-
be represented by their inherent statistical properties, such
structured point clouds by building computational graphs
as distance distribution [25] and harmonic descriptors [15].
based on hierarchical data structures such as octree [30] and
Heat kernel signatures extract shape descriptions by simulat-
kd-tree [16].
ing an heat diffusion process on 3D shapes [38]. The Light
Despite their competitive performance, network struc-
Field Descriptor (LFD) is another popular descriptor useful
tures based on PointNet [26] are rather complex. In this
in the shape classification tasks. It extracts geometric and
work, we show that it is possible to perform scene under-
Fourier descriptors from object silhouettes rendered from
standing tasks such as semantic segmentation and object
several different viewpoints [4]. Despite their long history
recognition on ordered point clouds. We design pointwise
and being widely used, hand-crafted 3D shape descriptors
convolution, a simple convolution operator for 3D point
do not generalize well across different domains.
cloud and use it to make (fully) convolutional neural net-
2.2. Object recognition works for object recognition and semantic segmentation.
With the availability of our pointwise convolution, we aim
Convolutional neural networks (CNNs) [17] has been
to pave the way towards adapting many existing network
successfully applied in various areas of computer vision and
architecture designed for scene understanding with color and
artificial intelligence. Recently, significant achievements
RGB-D images [34, 19, 35] to the 3D domain.
have been reached in understanding images through learning
features by CNNs. Large RGB image datasets like ImageNet 2.3. Semantic segmentation
[7] can be used in training a CNN, which is in turn able to
learn general purpose image descriptors from such datasets. There are considerably great numbers of related works in
Image descriptors generated by CNNs are proved to greatly semantic segmentation. Since the introduction of the NYUv2

985
dataset from Silberman et al. [33], there has been a spark in designed for grid and volumes. PointNet [26] implements
the direction of RGBD semantic segmentation. The work point feature learning by fully connected layers.
from Long et al. [19] showed how to adopt a conventional The previous limitations motivate us to design fully con-
classifcation network for the semantic segmentation prob- volutional networks for point clouds. The basic building
lem. Since then, different techniques have been proposed block of our architecture is a convolution operator applied
to further improve the segmentation results. Some notable at each point in a point cloud, which we term the pointwise
examples are SegNet [2] which employs an encoder-decoder convolution. This operator works as follows.
architecture, or the dilation filter [43].
In the 3D domain, interactive semantic segmentation [40, Convolution. A convolution kernel is centered at each
39] relied on user strokes to propagate segmentation. McCor- point of a point cloud. Neighbor points within the kernel
mac et al. [21] explored transfering semantic segmentation support can contribute to the center point. Each kernel has
from 2D predictions to the 3D domain. An advantage of such a size or radius value, which can be adjusted to account
methods is that they can produce high-resolution segmen- for different number of neighbor points in each convolution
tation. However, none of the predictions can be performed layer. Figure 1 shows a diagram that demonstrates this idea.
directly in the 3D domain. Formally, pointwise convolution can be written as
SSCNet [36] applied convolutional neural network to a
X 1 X
3D volume representation to classify each voxel in the scene. xℓi = wk xℓ−1 , (1)
j
This could be flexible as real-time scene reconstruction tech- | Ωi (k) |
k pj ∈Ωi (k)
niques such as KinectFusion [23] and voxel hashing [24] are
often based on volumes. PointNet [26] can also be used for where k iterates over all sub-domains in the kernel support;
semantic segmentation with minor modifications from their Ωi (k) is the k-th sub-domain of the kernel centered at point
object recognition network. i; pi is the coordinate of point i; | · | counts all points within
Recently, Qi et al. [29] proposed to build a graph neural the sub-domain; wk is the kernel weight at the k-th sub-
network for semantic segmentation on a point cloud, where domain, xi and xj the value at point i and j, and ℓ − 1 and ℓ
each graph node is a group of points and graph edges are the index of the input and output layer.
constructed by nearest neighbor search on the point cloud.
Their results are shown with RGB-D images, where color Gradient backpropagation. To make pointwise convolu-
features from a pre-trained VGG-16 network [34, 5] are used tion trainable, it is necessary to compute the gradients with
to initialize the prediction. Here, we demonstrate a fully con- respects to the input data and the kernel weights. Let L is
volutional neural network for 3D point cloud segmentation. the loss function. The gradient with respect to input could
Compared to the method by Qi et al. [29], we train our be defined as
network from scratch. The input point cloud is also more
general such as CAD models or 3D meshes reconstructed ∂L X ∂L ∂xℓ
i
= (2)
from RGB-D sensors. ∂xℓ−1
j i∈Ωj
∂x ℓ
i ∂x ℓ−1
j

3. Pointwise Convolution where we iterate over all neighbor points i of a given point
j. In the chain rule, ∂L/∂xℓi is the gradient up to layer ℓ,
Before presenting pointwise convolution, we briefly re-
which is known during back propagation. The derivative
vise a few possibilities to represent 3D data for neural net-
∂xℓi /∂xℓ−1
j could be written as
work. The most straightforward approach is perhaps to em-
ploy volumetric representation. For example, VoxNet [20] X X
∂xℓi 1
represents each object by a volume up to 64 × 64 × 64 res- ℓ−1
= wk 1 (3)
olution. This is natural because almost existing network ∂xj | Ωi (k) |
k p ∈Ω (k) j i

architecture for image applications can be adopted. How-


ever, a significant drawback is that volumetric representation Similarly, the gradient with respect to kernel weights
requires a large amount of memory while the number of could be defined by iterating over all points i:
non-zero values in a volume only accounts for a very small X ∂L ∂xℓ
∂L i
percentage. This could be addressed by a sparse representa- = ℓ ∂w
(4)
tion [30]. ∂wk i
∂x i k

A second possibility is to use point clouds. This is a di-


where
rect representation as point cloud is often the output of many
applications such as RGB-D reconstruction and CAD model- ∂xℓi 1 X
ing. However, mapping point cloud to neural network is not = xℓ−1
j (5)
∂wk | Ωi (k) |
natural because traditional convolution operators are only pj ∈Ωi (k)

986
Note that the above formula does not assume a specific deep learning [3], which is more robust for tasks such as
shape for convolution kernel. Here we simply use a uniform non-rigid shape correspondences and retrieval. To compute
grid kernel. In conjunction with an acceleration structure a geodesic convolution at a particular point, only neighbor
for neighbor query, e.g., grid, the convolution operator can points on its local surface manifold are considered. This is
be efficiently implemented on both CPU and GPU. In this achieved by definition because the filter support in geodesic
paper, we use convolution kernels of size 3 × 3 × 3. All convolution is directly defined on the surface manifold. By
points within each kernel cell have the same weights. contrast, our pointwise convolution operates adaptively in
Unlike convolution in volumes, in our design, we do not the 3D Euclidean space, and does not require any surface
use pooling. There are some advantages of doing so. First, it definition to operate.
is no longer required to deal with point cloud downsampling
and upsampling, which is not straightforward when the point 4. Evaluations
attributes become high dimensional when the point cloud is
processed in the network. Second, by keeping the point cloud Semantic segmentation. We evaluate our pointwise con-
unchanged in the entire network, acceleration structures for volutional neural network with semantic scene segmentation
neighbor query only need to be built once. This significantly and object recognition. For scene segmentation, we first ex-
speeds up computation and simplifies network design. periment with the S3DIS dataset [1], which has 13 categories
of indoor scene objects. Each point has 9 attributes: XYZ
coordinates, RGB color, and normalized coordinates w.r.t.
Point order. A notable difference between our design and the room space it belongs to. To perform segmentation of a
PointNet [26] is how points are ordered before being fed to scene, each squared-meter block of the scene (measured on
the network. In PointNet, point cloud is orderless, and the the floor), sampled to 4096 points, are fed into the network.
training process of PointNet learns a symmetric function to The predictions of all blocks are then assembled to obtain
turn an ordered point cloud into order invariant. However, the prediction of the entire scene.
we argue that this might not be necessary. In our method, we
We report per-point accuracy of the semantic segmenta-
input points sorted in a specific order, e.g., XYZ or Morton
tion. As shown in Table 1, our network is able to produce
curve [22], to the network and can still achieve competitive
comparable accuracy to PointNet [26], with the accuracy of
performance in the object recognition task. In this task, the
81.5%. Table 2 reports per-class accuracy. Figure 3 shows
order of the points only affects the final global feature vector
visualization of predictions and ground truths of the scenes
used to predict the object category. In semantic segmentation,
in the evaluation dataset.
in principle we can leverage local features at each point, and
hence point order is not necessary.
Accuracy
Network Accuracy
(per class)
À-trous convolution. The original pointwise convolution PointNet [26] - 87.0
can be easily extended to à-trous convolution by including Ours 56.5 81.5
a stride parameter that determines the gaps between kernel
cells. The benefit of pointwise à-trous convolution is that it Table 1: Comparison of scene segmentation on S3DIS
is possible to extend the kernel size, and hence the perceptive dataset [1].
field, without actually processing too many points in the con-
volution. This yields significant speed up without sacrificing
accuracy as to be demonstrated in our experiments. To further test semantic segmentation with more cate-
gories and more complex indoor scenes, we annotate 76
scenes from the SceneNN dataset [13] with 40 categories
Point attributes. For easy housekeeping in the implemen- defined by the NYU v2 dataset [33]. Scenes in this dataset
tation of our convolution operator, we separately store point appear to be more cluttered, which poses great challenges
coordinates and other point attributes such as colors, normals, to semantic segmentation. We use 56 scenes for training,
or other high-dimensional features output from preceding and 20 scenes for evaluation. In each scene, a 2 × 2 sqm.
convolutional layers. Point coordinates can be passed to any window with stride 0.2 meter and height 2 meters is used
layer despite the layer depth so that they can be used for to scan the floor area, resulting in approximately 30K scene
neighbor queries to determine which points can participate blocks for training and 15K blocks for testing. Each block is
in the convolution at a particular point. Point attributes can sampled to 4096 points.
then be retrieved accordingly. For SceneNN dataset, we additionally compare with
VoxNet [20], a voxel-based representation technique, and
Relevance to geometric deep learning. Our pointwise SemanticFusion [21], a multi-view 2D-3D semantic segmen-
convolution is relevant to geodesic convolution in geometric tation with RGB-D images. For VoxNet [20], we apply their

987
Network ceiling floor wall column The visualization of the predictions and ground truth are
shown in Figure 4. It can be seen that structures like wall
PointNet [26] 98.3 98.8 83.3 63.4
Ours 97.4 99.1 89.1 56.2 and floor have very good accuracy, and small objects are
moderately well segmented. A notable issue is noise due to
door table chair sofa clutter prediction inconsistency in the overlap regions of the blocks.
PointNet [26] 84.6 70.3 66.0 56.7 69.0 This could be addressed by a conditional random field and
Ours 62.9 73.7 68.4 54.6 65.2 would be an interesting future work.
Table 3 reports the accuracy of a few common categories.
Table 2: Per-class accuracy of semantic segmentation on While structures and chairs are quite accurate, table and
S3DIS dataset [1]. desk are often ambiguous, resulting in lower accuracy for
both classes. In general, the performance of VoxNet [20]
is inferior to ours and SemanticFusion [21] due to limited
network to predict labels of scene blocks as described above
resolution (we used 643 volume). Our method works com-
and gather all outputs into a final scene prediction. For Se-
petitively to SemanticFusion, but note that our method does
manticFusion [21], we perform 2D semantic segmentation
not apply any label smoothing while SemanticFusion has a
on the RGB-D images independently and then integrate all
conditional random field to remove noise after propagating
2D predictions to a 3D point cloud to generate the final
predictions from 2D to 3D.
segmentation.

(a) Our predictions (b) Ground truth (a) Our predictions (b) Ground truth

Figure 3: Semantic segmentation on the S3DIS dataset [1]. Figure 4: Semantic segmentation on SceneNN dataset [13].

988
Network wall floor chair table desk 0.8
0.9

VoxNet [20] 82.8 74.3 3.1 0.8 5.4 0.7 0.8

SemanticFusion [21] 72.8 94.4 46.3 70.1 28.1 0.6 0.7

Accuracy

Accuracy
0.6
Ours 93.8 88.6 58.6 23.5 29.5 0.5
0.5
0.4 0.4
Train Train
Table 3: Per-class accuracy of semantic segmentation on 0.3 Test
0.3
Test

SceneNN dataset [13]. 0 20 40


Epoch
60 0 50 100
Epoch
150 200

(a) Scene segmentation (b) Object recognition


Object recognition. We evaluate object recognition with Figure 5: Train and test accuracy over time.
two datasets, ModelNet40 [42] and ObjectNN [14]. Model-
Net40 is a CAD model dataset of 40 categories which has
served as a standard benchmark for object recognition in but overall both methods are less effective due to the ambigu-
the recent years. On the other hand, ObjectNN is an ob- ity in learning features from both CAD models and RGB-D
ject dataset from RGB-D scene reconstruction mixed with objects.
CAD models for studying 3D object retrieval. Objects in Table 9 and Table 10 further provide per-class accuracy
ObjectNN is particularly difficult to classify because they are on the ModelNet40 and the ObjectNN dataset, respectively.
reconstructed from noisy RGB-D data and often has missing
parts.
For object recognition, our point attributes are simple Convergence. Figure 5 shows a plot of the training and
XYZ coordinates. In fact, we also trained the network with test accuracy of our networks over time. The graph shows
point attributes set to one, making the convolution equivalent that our pointwise convolutional neural network can be
to density estimation, and found no significant change in trained effectively.
accuracy. Our results on ModelNet40 are shown in Table 4.
As can be seen, our network performs comparably to state- Ablation experiments. Here we analyze the effectiveness
of pointwise convolution. We first start with with a basic 4-
Accuracy layer model as in Figure 2. The accuracy improvement when
Network Accuracy
(per class) more features are added are presented in Table 6. As can be
VoxNet [20] 83 -
seen, feature concatenation, à-trous convolution, SELU acti-
MO-SubvolumeSup [27] 86 89.2 vation, and dropout each contributes a small improvement to
PointNet [26] 86.2 89.2 the final result.
PointNet++ [28] - 90.7
Ours 81.4 86.1 Base Concat. À-trous SELU Dropout Accuracy
X 78.6
Table 4: Comparison of performance of network architec- X X 78.0
tures using 3D object representations on the ModelNet40 X X 75.0
dataset [42]. X X X 82.5
X X 81.7
of-the-art methods. Note that compared to VoxNet [20], our X X X 81.9
input point cloud is more compact. Our network is also X X X X 85.2
X X X X X 86.1
significantly simpler in design compared to PointNet [26]
and PointNet++ [28] while being close to their accuracy.
The results on ObjectNN are shown in Table 5. In this Table 6: Ablation experiment. Accuracy improvement is
dataset, again our method performs comparably to PointNet, achieved when pointwise convolution is combined with
feature concatenation (Concat.), à-trous convolution, self-
normalization activation function (SELU), and dropout.
Accuracy
Network Accuracy
(per class)
PointNet [26] 57.1 65.6 Point order. In object recognition, the order of the in-
Ours 57.1 65.1 put points determine the orders of the features in the fully
connected layers. As long as this layer has an order, it is
Table 5: Comparison of object recognition accuracy on the sufficient to discriminate their features and predict the cat-
ObjectNN dataset [14]. egories. We experiment with different orders of the input

989
point set and report the results in Table 7(a). We found that longer to train networks with 8 and 16 layers, resulting in
point orders sorted by space filling curve techniques such as slightly slower accuracy. Experimenting the training with
Morton curve [22] yields comparable accuracy, which means residual learning [11] would be an interesting future work.
that it is sufficient to just follow an order, but not a particular
one. However, a benefit is that space filling curves organize Running time. A key challenge when implementing point-
points such that nearby points in space are stored close to wise convolution is how to perform fast nearest neighbor
each other in memory, allowing more memory coherence. query without impacting too much the network training and
prediction time. To make the training feasible, we choose to
Neighborhood radius. So far we have been setting the use a grid for neighbor query because this is a lightweight
radius for neighbor query as constant in each convolution and GPU-friendly data structure to build and query on the
layer. In our experience, this works well for both tasks. We fly. In fact, we experimented with kd-tree, but found that on
also explore the capability of adaptive radius using k-nearest modern CPUs and GPUs, a kd-tree query does not outper-
neighbors. The modification for the convolution operator is form a grid unless the number of points are more than 16K
as follows. points, not to mention extra time needed for tree construction
At each point, a k-nearest neighbor is performed, and the that has O(n log n) complexity.
query radius is set to the distance to the furthest neighbor. Our pointwise convolution is currently implemented with
This radius is used each time neighbor points have to be Tensorflow. We report the running time, including grid build
queried for convolution. To compute gradients for backprop- and query each time convolution is invoked, as follows. For
agation for this operator, it is worth noting that in this case, a batch size of 128 point clouds, each with 2048 points, a
neighbor lookup is no longer symmetric. Therefore, at a forward convolution of our network takes 1.272 seconds on
point j, it is required to look up all points i such that point i an Intel Core i7 6900K with 16 threads, and a backward
can contribute to point j in the forward convolution. propagation takes 2.423 seconds to compute the gradients.
We compare the performance of the k-nearest neighbor Our GPU implementation on an NVIDIA TITAN X can fur-
and the fixed radius convolution for object recognition task. ther improve the running time for about 10%. Compared to
The result is shown in Table 7(b). In general, we found no PointNet [26] and VoxNet [20] which leverage Tensorflow’s
significant difference in terms of accuracy. optimized convolution operators, our pointwise convolution
is not yet engineering optimized. Our training time is about
2× slower which we currently compensate by using multiple
Order Accuracy Neighbor query Accuracy
CPUs and GPUs.
ZYX 86.1 Fixed-size radius 86.1
Morton 86.0 K-nearest neighbor 85.7 5. Conclusion
(a) (b)
In this paper, we proposed pointwise convolution and
Table 7: (a) Object recognition with different ways of or- leveraged it to build convolutional neural networks for scene
dering the input point cloud. (b) Object recognition with understanding with point cloud data. We demonstrated two
convolution using neighbor queries with adaptive radius. scene understanding applications including scene segmenta-
tion and object recognition. We showed that it is practical
to simply sort input point clouds in a specific order before
feature learning for object classification. Our pointwise con-
Deeper networks. Finally, we study the capability of volution can offer competitive accuracy while being simple
learning with deeper networks using pointwise convolution. to implement, allowing us to create effective and simple
From the basic model, we increase the number of layers neural networks for learning local features of point clouds.
from 4 to 8 and 16, and then retrain from scratch. The per- There are several research avenues to be further explored.
formance are reported in Table 8 below. Generally, it takes For example, finding a robust solution to handle large-scale
point clouds for scene understanding would be an interesting
future work. Currently, we just circumvent the large-scale
Network Accuracy
issue in semantic segmentation by simply dividing the scene
4 layers 86.1 into blocks and resample each block to fixed number of
8 layers 82.1 points for prediction. In addition, it would be of great in-
16 layers 82.6 terest to extend pointwise convolutional neural networks to
geometry point cloud processing [44], or explore the connec-
Table 8: Deep pointwise convolutional neural network. We tion of pointwise convolution to tensor voting [41], which
compare object recognition performance with 4-, 8-, and was used in the literature to detect structures in a local point
16-layer architecture. neighborhood.

990
Network airplane bathtub bed bench bookshelf bottle bowl car chair cone
PointNet [26] 100 80.0 94.0 75.0 93.0 94.0 100.0 97.9 96.0 100.0
Ours 100 82.0 93.0 68.4 91.8 93.9 95.0 95.6 96.0 80.0
cup curtain desk door dresser flower pot glass box guitar keyboard lamp
PointNet [26] 70.0 90.0 79.0 95.0 65.1 30.0 94.0 100.0 100.0 90.0
Ours 60.0 80.0 76.7 75.0 67.4 10.0 80.8 98.0 100.0 83.3
laptop mantel monitor night stand person piano plant radio range hood sink
PointNet [26] 100.0 96.0 95.0 82.6 85.0 88.8 73.0 70.0 91.0 80.0
Ours 95.0 93.9 92.9 70.2 89.5 84.5 78.8 65.0 88.9 65.0
sofa stairs stool table tent toilet tv stand vase wardrobe xbox
PointNet [26] 96.0 85.0 90.0 88.0 95.0 99.0 87.0 78.8 60.0 70.0
Ours 96.0 80.0 83.3 90.9 90.0 94.9 84.5 81.3 30.0 75.0

Table 9: Per-class accuracy of object recognition on the ModelNet40 dataset. Average: PointNet: 86.3. Ours 81.4.

Network chair display desk book storage box table bin bag keyboard
PointNet [26] 84.2 85.4 56.7 30.1 62.5 23.8 80.0 75.0 47.4 82.4
Ours 83.1 85.4 70.0 57.7 45.8 23.8 60.0 65.0 36.8 88.2
sofa bookshelf pillow machine pc case light oven cup printer bed
PointNet [26] 76.5 23.1 84.6 18.2 36.4 77.8 60.0 37.5 50.0 28.6
Ours 88.2 38.5 76.9 18.2 54.5 88.9 30.0 75.0 12.5 42.9

Table 10: Per-class accuracy of object recognition on the ObjectNN dataset. Average: PointNet: 56.0. Ours: 57.1.

A. Layer Visualization
Intuitively, pointwise convolution works by summarizing
local spatial point distributions to build feature vectors for
each point in a point cloud. As shown in per-class accuracy
tables, local features work the most effectively in classifying
(a) Layer 1 (b) Layer 2 (c) Layer 3 (d) Layer 4
structures such as ceiling, floor, or walls and common furni-
ture such as tables and chairs. In our observation, it is quite Figure 6: Visualization of the filters in pointwise convolution
challenging to differentiate between tables (for dining) and network for object recognition.
desks (for study and work).
We visualize the filters of the first four layers in the object
recognition network in Figure 6. Here we display each
3 × 3 × 3 filter on a row in the visualization. The number of Acknowledgement. We thank Quang-Hieu Pham for help-
rows is equal to the product of the total number of input and ing with the 2D-to-3D semantic segmentation experiment
output channels of each filter (27 for the first layer, and 81 and proofreading the paper, Quang-Trung Truong and Ben-
for the subsequent layers). In the visualization, blue and red jamin Kang Yue Sheng for their kind support for the neural
represent positive and negative values, respectively. White network training experiments.
represents zero. This shows that the filters in the network are Binh-Son Hua and Sai-Kit Yeung are supported by the
relatively sparse and smooth. We also observed that positive SUTD Digital Manufacturing and Design Centre which is
and negative values dominate the filters interchangeably in supported by the Singapore National Research Foundation
each layer. (NRF). Sai-Kit Yeung is also supported by Singapore MOE
Academic Research Fund MOE2016-T2-2-154, Heritage Re-
search Grant of the National Heritage Board, Singapore, and
Singapore NRF under its IDM Futures Funding Initiative and
Virtual Singapore Award No. NRF2015VSGAA3DCM001-
014.

991
References [19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015. 2, 3
[1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fis-
[20] D. Maturana and S. Scherer. VoxNet: A 3D Convolutional
cher, and S. Savarese. 3d semantic parsing of large-scale
Neural Network for Real-Time Object Recognition. In IROS,
indoor spaces. In CVPR, 2016. 4, 5
2015. 2, 3, 4, 5, 6, 7
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A
[21] J. McCormac, A. Handa, A. Davison, and S. Leutenegger.
deep convolutional encoder-decoder architecture for image
Semanticfusion: Dense 3d semantic mapping with convolu-
segmentation. arXiv:1511.00561, 2015. 3
tional neural networks. arXiv:1609.05130, 2016. 3, 4, 5,
[3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-
6
dergheynst. Geometric deep learning: going beyond euclidean
[22] G. M. Morton. A computer oriented geodetic data base and
data. arXiv:1611.08097, 2016. 4
a new technique in file sequencing. International Business
[4] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On
Machines Company New York, 1966. 4, 7
visual similarity based 3d model retrieval. In Computer graph-
[23] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,
ics forum, volume 22, pages 223–232. Wiley Online Library,
D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and
2003. 2
A. Fitzgibbon. Kinectfusion: Real-time dense surface map-
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
ping and tracking. In The IEEE International Symposium on
Yuille. Deeplab: Semantic image segmentation with deep
Mixed and Augmented Reality (ISMAR), 2011. 3
convolutional nets, atrous convolution, and fully connected
crfs. arXiv:1606.00915, 2016. 3 [24] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-
time 3d reconstruction at scale using voxel hashing. ACM
[6] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and
Transactions on Graphics (TOG), 2013. 3
A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
2 [25] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin.
Shape distributions. ACM Transactions on Graphics (TOG),
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
21(4):807–832, 2002. 2
Fei. Imagenet: A large-scale hierarchical image database. In
CVPR, 2009. 2 [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep
[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, learning on point sets for 3d classification and segmentation.
E. Tzeng, and T. Darrell. Decaf: A deep convolutional activa- CVPR, 2017. 2, 3, 4, 5, 6, 7, 8
tion feature for generic visual recognition. In International [27] C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas.
Conference on Machine Learning, 2014. 2 Volumetric and multi-view cnns for object classification on
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich 3d data. In CVPR, 2016. 2, 6
feature hierarchies for accurate object detection and semantic [28] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep
segmentation. In CVPR, 2014. 2 hierarchical feature learning on point sets in a metric space.
[10] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. arXiv:1706.02413, 2017. 6
Matchnet: Unifying feature and metric learning for patch- [29] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph
based matching. In CVPR, 2015. 2 neural networks for RGBD semantic segmentation. In ICCV,
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2017. 3
for image recognition. In CVPR, 2016. 7 [30] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning
[12] B. K. P. Horn. Extended gaussian images. Proceedings of the deep 3d representations at high resolutions. In CVPR, 2017.
IEEE, 72(12):1671–1686, 1984. 2 2, 3
[13] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, [31] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
and S.-K. Yeung. Scenenn: A scene meshes dataset with Cnn features off-the-shelf: an astounding baseline for recog-
annotations. In International Conference on 3D Vision (3DV), nition. In Proceedings of the IEEE Conference on Computer
2016. http://www.scenenn.net. 2, 4, 5, 6 Vision and Pattern Recognition Workshops, pages 806–813,
[14] B.-S. Hua, Q.-T. Truong, M.-K. Tran, Q.-H. Pham, 2014. 2
A. Kanezaki, T. Lee, H. Chiang, W. Hsu, B. Li, Y. Lu, et al. [32] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep
Shrec17: Rgb-d to cad retrieval with objectnn dataset. 6 panoramic representation for 3-d shape recognition. Signal
[15] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation Processing Letters, IEEE, 22(12):2339–2343, 2015. 2
invariant spherical harmonic representation of 3 d shape de- [33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
scriptors. In Symposium on geometry processing, volume 6, segmentation and support inference from rgbd images. In
pages 156–164, 2003. 2 ECCV, 2012. 3, 4
[16] R. Klokov and V. Lempitsky. Escape from cells: Deep kd- [34] K. Simonyan and A. Zisserman. Very deep convolutional
networks for the recognition of 3d point cloud models. In networks for large-scale image recognition. arXiv:1409.1556,
ICCV, 2017. 2 2014. 2, 3
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- [35] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
based learning applied to document recognition. Proceedings scene understanding benchmark suite. In CVPR, 2015. 2
of the IEEE, 86(11):2278–2324, 1998. 2 [36] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
[18] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. T. Funkhouser. Semantic scene completion from a single
arXiv:1801.07791, 2018. 2 depth image. CVPR, 2017. 3

992
[37] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.
Multi-view convolutional neural networks for 3d shape recog-
nition. In ICCV, 2015. 2
[38] J. Sun, M. Ovsjanikov, and L. Guibas. A concise and provably
informative multi-scale signature based on heat diffusion.
In Computer graphics forum, volume 28, pages 1383–1392.
Wiley Online Library, 2009. 2
[39] D. Thanh Nguyen, B.-S. Hua, L.-F. Yu, and S.-K. Yeung. A
robust 3d-2d interactive tool for scene segmentation and an-
notation. IEEE Transactions on Visualization and Computer
Graphics (TVCG), 2017. 3
[40] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton,
P. Kohli, M. Nießner, A. Criminisi, S. Izadi, and P. Torr.
Semanticpaint: Interactive 3d labeling and learning at your
fingertips. ACM Transactions on Graphics, 2015. 3
[41] T.-P. Wu, S. K. Yeung, J. Jia, C.-K. Tang, and G. G. Medioni.
A closed-form solution to tensor voting: Theory and applica-
tions. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2012. 7
[42] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3d shapenets:
A deep representation for volumetric shapes. In CVPR, 2015.
2, 6
[43] F. Yu and V. Koltun. Multi-scale context aggregation by
dilated convolutions. In ICLR, 2016. 3
[44] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng. Pu-net:
Point cloud upsampling network. In CVPR, 2018. 7

993

You might also like