0% found this document useful (0 votes)
40 views10 pages

Kpconv: Flexible and Deformable Convolution For Point Clouds

This paper presents a new type of point convolution called Kernel Point Convolution (KPConv) that operates directly on point clouds. KPConv defines spatial filters using kernel points with associated weights, allowing flexibility in the number of points used. The paper also proposes a deformable version that learns to shift the kernel points to adapt to local geometry. Experiments show KPConv outperforms other methods on classification and segmentation tasks.

Uploaded by

cuimosemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Kpconv: Flexible and Deformable Convolution For Point Clouds

This paper presents a new type of point convolution called Kernel Point Convolution (KPConv) that operates directly on point clouds. KPConv defines spatial filters using kernel points with associated weights, allowing flexibility in the number of points used. The paper also proposes a deformable version that learns to shift the kernel points to adapt to local geometry. Experiments show KPConv outperforms other methods on classification and segmentation tasks.

Uploaded by

cuimosemail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

KPConv: Flexible and Deformable Convolution for Point Clouds

Hugues Thomas1 Charles R. Qi2 Jean-Emmanuel Deschaud1 Beatriz Marcotegui1


François Goulette1 Leonidas J. Guibas2,3
1 2 3
Mines ParisTech Facebook AI Research Stanford University

Abstract are coupled with corresponding features like colors. In this


work, we will always consider a point cloud as those two el-
We present Kernel Point Convolution1 (KPConv), a new ements: the points P ∈ RN ×3 and the features F ∈ RN ×D .
design of point convolution, i.e. that operates on point Such a point cloud is a sparse structure that has the property
clouds without any intermediate representation. The convo- to be unordered, which makes it very different from a grid.
lution weights of KPConv are located in Euclidean space by However, it shares a common property with a grid which
kernel points, and applied to the input points close to them. is essential to the definition of convolutions: it is spatially
Its capacity to use any number of kernel points gives KP- localized. In a grid, the features are localized by their in-
Conv more flexibility than fixed grid convolutions. Further- dex in a matrix, while in a point cloud, they are localized by
more, these locations are continuous in space and can be their corresponding point coordinates. Thus, the points are
learned by the network. Therefore, KPConv can be extended to be considered as structural elements, and the features as
to deformable convolutions that learn to adapt kernel points the real data.
to local geometry. Thanks to a regular subsampling strat- Various approaches have been proposed to handle such
egy, KPConv is also efficient and robust to varying densities. data, and can be grouped into different categories that we
Whether they use deformable KPConv for complex tasks, or will develop in the related work section. Several meth-
rigid KPconv for simpler tasks, our networks outperform ods fall into the grid-based category, whose principle is to
state-of-the-art classification and segmentation approaches project the sparse 3D data on a regular structure where a
on several datasets. We also offer ablation studies and convolution operation can be defined more easily [23, 28,
visualizations to provide understanding of what has been 33]. Other approaches use multilayer perceptrons (MLP) to
learned by KPConv and to validate the descriptive power process point clouds directly, following the idea proposed
of deformable KPConv. by [47, 25].
More recently, some attempts have been made to design
a convolution that operates directly on points [2, 44, 19, 14,
1. Introduction 13]. These methods use the spatial localization property of

The dawn of deep learning has boosted modern computer


vision with discrete convolution as its fundamental building
block. This operation combines the data of local neighbor-
hoods on a 2D grid. Thanks to this regular structure, it can
be computed with high efficiency on modern hardware, but
when deprived of this regular structure, the convolution op-
eration has yet to be defined properly, with the same effi-
ciency as on 2D grids.
Many applications relying on such irregular data have
grown with the rise of 3D scanning technologies. For ex-
ample, 3D point cloud segmentation or 3D simultaneous
localization and mapping rely on non-grid structured data:
point clouds. A point cloud is a set of points in 3D (or Figure 1. KPConv illustrated on 2D points. Input points with a
higher-dimensional) space. In many applications, the points constant scalar feature (in grey) are convolved through a KPConv
that is defined by a set of kernel points (in black) with filter weights
1 Project page: https:// github.com/ HuguesTHOMAS/ KPConv on each point.

6411
a point cloud to define point convolutions with spatial ker- ability to adapt to the geometry of the scene objects.
nels. They share the idea that a convolution should define a
set of customizable spatial filters applied locally in the point 2. Related Work
cloud.
In this section, we briefly review previous deep learning
This paper introduces a new point convolution operator
methods to analyze point clouds, paying particular attention
named Kernel Point Convolution (KPConv). KPConv also
to the methods closer to our definition of point convolutions.
consists of a set of local 3D filters, but overcomes previous
Projection networks. Several methods project points to an
point convolution limitations as shown in related work. KP-
intermediate grid structure. Image-based networks are of-
Conv is inspired by image-based convolution, but in place
ten multi-view, using a set of 2D images rendered from the
of kernel pixels, we use a set of kernel points to define
point cloud at different viewpoints [34, 4, 17]. For scene
the area where each kernel weight is applied, like shown
segmentation, these methods suffer from occluded surfaces
in Figure 1. The kernel weights are thus carried by points,
and density variations. Instead of choosing a global pro-
like the input features, and their area of influence is defined
jection viewpoint, [35] proposed projecting local neighbor-
by a correlation function. The number of kernel points is
hoods to local tangent planes and processing them with 2D
not constrained, making our design very flexible. Despite
convolutions. However, this method relies heavily on tan-
the resemblance of vocabulary, our work differs from [31],
gent estimation.
which is inspired from point cloud registration techniques,
In the case of voxel-based methods, the points are pro-
and uses kernel points without any weights to learns local
jected on 3D grids in Euclidean space [23, 29, 3]. Using
geometric patterns.
sparse structures like octrees or hash-maps allows larger
Furthermore, we propose a deformable version of our grids and enhanced performances [28, 9], but these net-
convolution [7], which consists of learning local shifts ap- works still lack flexibility as their kernels are constrained
plied to the kernel points (see Figure 3). Our network gen- to use 33 = 27 or 53 = 125 voxels. Using a permutohedral
erates different shifts at each convolution location, meaning lattice instead of an Euclidean grid reduces the kernel to
that it can adapt the shape of its kernels for different re- 15 lattices [33], but this number is still constrained, while
gions of the input cloud. Our deformable convolution is KPConv allows any number of kernel points. Moreover,
not designed the same way as its image counterpart. Due avoiding intermediate structures should make the design of
to the different nature of the data, it needs a regularization more complex architectures like instance mask detector or
to help the deformed kernels fit the point cloud geometry generative models more straightforward in future works.
and avoid empty space. We use Effective Receptive Field Graph convolution networks. The definition of a convo-
(ERF) [21] and ablation studies to compare rigid KPConv lution operator on a graph has been addressed in different
with deformable KPConv. ways. A convolution on a graph can be computed as a mul-
As opposed to [40, 2, 44, 19], we favor radius neighbor- tiplication on its spectral representation [8, 46], or it can fo-
hoods instead of k-nearest-neighbors (KNN). As shown by cus on the surface represented by the graph [22, 5, 32, 24].
[13], KNN is not robust in non-uniform sampling settings. Despite the similarity between point convolutions and the
The robustness of our convolution to varying densities is most recent graph convolutions [38, 42], the latter learn fil-
ensured by the combination of radius neighborhoods and ters on edge relationships instead of points relative posi-
regular subsampling of the input cloud [37]. Compared to tions. In other words, a graph convolution combines fea-
normalization strategies [13, 14], our approach also allevi- tures on local surface patches, while being invariant to the
ates the computational cost of our convolution. deformations of those patches in Euclidean space. In con-
In our experiments section, we show that KPConv can trast, KPConv combines features locally according to the
be used to build very deep architectures for classification 3D geometry, thus capturing the deformations of the sur-
and segmentation, while keeping fast training and infer- faces.
ence times. Overall, rigid and deformable KPConv both Pointwise MLP networks. PointNet [25] is considered
perform very well, topping competing algorithms on sev- a milestone in point cloud deep learning. This network
eral datasets. We find that rigid KPConv achieves better uses a shared MLP on every point individually followed
performances on simpler tasks, like object classification, or by a global max-pooling. The shared MLP acts as a set of
small segmentation datasets. Deformable KPConv thrives learned spatial encodings and the global signature of the in-
on more difficult tasks, like large segmentation datasets of- put point cloud is computed as the maximal response among
fering many object instances and greater diversity. We also all the points for each of these encodings. The network’s
show that deformable KPConv is more robust to a lower performances are limited because it does not consider local
number of kernel points, which implies a greater descrip- spatial relationships in the data. Following PointNet, some
tive power. Last but not least, a qualitative study of KPConv hierarchical architectures have been developed to aggregate
ERF shows that deformable kernels improve the network local neighborhood information with MLPs [26, 18, 20].

6412
Figure 2. Comparison between an image convolution (left) and a KPConv (right) on 2D points for a simpler illustration. In the image, each
pixel feature vector is multiplied by a weight matrix (Wk )k<K assigned by the alignment of the kernel with the image. In KPConv, input
points are not aligned with kernel points, and their number can vary. Therefore, each point feature fi is multiplied by all the kernel weight
matrices, with a correlation coefficient hik depending on its relative position to kernel points.

As shown by [40, 19, 13], the kernel of a point convolu- We show that KPConv networks outperform all compa-
tion can be implemented with a MLP, because of its ability rable networks in the experiments section. Furthermore, to
to approximate any continuous function. However, using the best of our knowledge, none of the previous works ex-
such a representation makes the convolution operator more perimented a spatially deformable point convolution.
complex and the convergence of the network harder. In our
case, we define an explicit convolution kernel, like image 3. Kernel Point Convolution
convolutions, whose weights are directly learned, without
the intermediate representation of a MLP. Our design also 3.1. A Kernel Function Defined by Points
offers a straightforward deformable version, as offsets can Like previous works, KPConv can be formulated with
directly be applied to kernel points. the general definition of a point convolution (Eq. 1), in-
Point convolution networks. Some very recent works also spired by image convolutions. For the sake of clarity, we
defined explicit convolution kernels for points, but KPConv call xi and fi the points from P ∈ RN ×3 and their cor-
stands out with unique design choices. responding features from F ∈ RN ×D . The general point
Pointwise CNN [14] locates the kernel weights with convolution of F by a kernel g at a point x ∈ R3 is defined
voxel bins, and thus lacks flexibility like grid networks. as:
Furthermore, their normalization strategy burdens their net- X
work with unnecessary computations, while KPConv sub- (F ∗ g)(x) = g(xi − x)fi (1)
sampling strategy alleviates both varying densities and com- xi ∈Nx
putational cost.
We stand with [13] advising radius neighborhoods to
SpiderCNN [44] defines its kernel as a family of poly- ensure robustness to varying densities, therefore, Nx =
nomial functions applied with a different weight for each 
xi ∈ P kxi − xk 6 r with r ∈ R being the chosen
neighbor. The weight applied to a neighbor depends on the radius. In addition, [37] showed that hand-crafted 3D point
neighbor’s distance-wise order, making the filters spatially features offer a better representation when computed with
inconsistent. By contrast, KPConv weights are located in radius neighborhoods than with KNN. We believe that hav-
space and its result is invariant to point order. ing a consistent spherical domain for the function g helps
Flex-convolution [10] uses linear functions to model its the network to learn meaningful representations.
kernel, which could limit its representative power. It also The crucial part in Eq. 1 is the definition of the ker-
uses KNN, which is not robust to varying densities as dis- nel function g, which is where KPConv singularity lies. g
cussed above. takes the neighbors positions centered on x as input. We
PCNN [2] design is the closest to KPConv. Its definition call them yi = xi − x in the following. As our neighbor-
also uses points to carry kernel weights, and a correlation hoods are defined by a  radius r, the domain of definition
function. However, this design is not scalable because it of g is the ball Br3 = y ∈ R3 | kyk 6 r . Like image
does not use any form of neighborhood, making the convo- convolution kernels (see Figure 2 for a detailed compari-
lution computations quadratic on the number of points. In son between image convolution and KPConv), we want g
addition, it uses a Gaussian correlation where KPConv uses to apply different weights to different areas inside this do-
a simpler linear correlation, which helps gradient backprop- main. There are many ways to define areas in 3D space,
agation when learning deformations [7]. and points are the most intuitive as features are also local-

6413
xk | k < K} ⊂ Br3 be the kernel points
ized by them. Let {e
and {Wk | k < K} ⊂ RDin ×Dout be the associated weight
matrices that map features from dimension Din to Dout .
We define the kernel function g for any point yi ∈ Br3 as :
X
g(yi ) = h (yi , x
ek ) Wk (2)
k<K
where h is the correlation between x ek and yi , that should
be higher when x ek is closer to yi . Inspired by the bilinear
interpolation in [7], we use the linear correlation:
 
kyi − xek k
ek ) = max 0, 1 −
h (yi , x (3) Figure 3. Deformable KPConv illustrated on 2D points.
σ
X
where σ is the influence distance of the kernel points, and gdeform (yi , ∆(x)) = h (yi , x
ek + ∆k (x)) Wk (5)
will be chosen according to the input density (see Section k<K
3.3). Compared to a gaussian correlation, which is used by
We define the offsets ∆k (x) as the output of a rigid KP-
[2], linear correlation is a simpler representation. We advo-
Conv mapping Din input features to 3K values, as shown
cate this simpler correlation to ease gradient backpropaga-
in Figure 3. During training, the network learns the rigid
tion when learning kernel deformations. A parallel can be
kernel generating the shifts and the deformable kernel gen-
drawn with rectified linear unit, which is the most popular
erating the output features simultaneously, but the learning
activation function for deep neural networks, thanks to its
rate of the first one is set to 0.1 times the global network
efficiency for gradient backpropagation.
learning rate.
3.2. Rigid or Deformable Kernel Unfortunately, this straightforward adaptation of image
deformable convolutions does not fit point clouds. In prac-
Kernel point positions are critical to the convolution op-
tice, the kernel points end up being pulled away from the
erator. Our rigid kernels in particular need to be arranged
input points. These kernel points are lost by the network,
regularly to be efficient. As we claimed that one of the KP-
because the gradients of their shifts ∆k (x) are null when
Conv strengths is its flexibility, we need to find a regular
no neighbors are in their influence range. More details on
disposition for any K. We chose to place the kernel points
these “lost” kernel points are given in the supplementary. To
by solving an optimization problem where each point ap-
tackle this behaviour, we propose a “fitting” regularization
plies a repulsive force on the others. The points are con-
loss which penalizes the distance between a kernel point and
strained to stay in the sphere with an attractive force, and
its closest neighbor among the input neighbors. In addition,
one of them is constrained to be at the center. We detail
we also add a “repulsive” regularization loss between all
this process and show some regular dispositions in the sup-
pair off kernel points when their influence area overlap, so
plementary material. Eventually, the surrounding points are
that they do not collapse together. As a whole our regular-
rescaled to an average radius of 1.5σ, ensuring a small over-
ization loss for all convolution locations x ∈ R3 is:
lap between each kernel point area of influence and a good
space coverage. X
Lreg = Lfit (x) + Lrep (x) (6)
With properly initialized kernels, the rigid version of KP-
x
Conv is extremely efficient, in particular when given a large  2
X kyi − (e
xk + ∆k (x))k
enough K to cover the spherical domain of g. However it Lfit (x) = min (7)
is possible to increase its capacity by learning the kernel yi σ
k<K
point positions. The kernel function g is indeed differen- XX 2
tiable with respect to xek , which means they are learnable Lrep (x) = h (e
xk + ∆k (x), x
el + ∆l (x)) (8)
parameters. We could consider learning one global set of k<K l6=k
{exk } for each convolution layer, but it would not bring more With this loss, the network generates shifts that fit the
descriptive power than a fixed regular disposition. Instead local geometry of the input point cloud. We show this effect
the network generates a set of K shifts ∆(x) for every con- in the supplementary material.
volution location x ∈ R3 like [7] and define deformable
KPConv as: 3.3. Kernel Point Network Layers
This section elucidates how we effectively put the KP-
X
(F ∗ g)(x) = gdeform (x − xi , ∆(x))fi (4) Conv theory into practice. For further details, we have re-
xi ∈Nx leased our code using Tensorflow library.

6414
Subsampling to deal with varying densities. As explained cessed by the fully connected and softmax layers like in an
in the introduction, we use a subsampling strategy to control image CNN. For the results with deformable KPConv, we
the density of input points at each layer. To ensure a spatial only use deformable kernels in the last 5 KPConv blocks
consistency of the point sampling locations, we favor grid (see architecture details in the supplementary material).
subsampling. Thus, the support points of each layer, carry- KP-FCNN is a fully convolutional network for segmenta-
ing the features locations, are chosen as barycenters of the tion. The encoder part is the same as in KP-CNN, and the
original input points contained in all non-empty grid cells. decoder part uses nearest upsampling to get the final point-
Pooling layer. To create architectures with multiple layer wise features. Skip links are used to pass the features be-
scales, we need to reduce the number of points progres- tween intermediate layers of the encoder and the decoder.
sively. As we already have a grid subsampling, we dou- Those features are concatenated to the upsampled ones and
ble the cell size at every pooling layer, along with the other processed by a unary convolution, which is the equivalent of
related parameters, incrementally increasing the receptive a 1×1 convolution in image or a shared MLP in PointNet. It
field of KPConv. The features pooled at each new location is possible to replace the nearest upsampling operation by a
can either be obtained by a max-pooling or a KPConv. We KPConv, in the same way as the strided KPConv, but it does
use the latter in our architectures and call it “strided KP- not lead to a significant improvement of the performances.
Conv”, by analogy to the image strided convolution.
KPConv layer. Our convolution layer takes as input the 4. Experiments
points P ∈ RN ×3 , their corresponding features F ∈
RN ×Din , and the matrix of neighborhood indices N ∈ 4.1. 3D Shape Classification and Segmentation
N ′ ×nmax
[[1, N ]] . N ′ is the number of locations where the Data. First, we evaluate our networks on two common
neighborhoods are computed, which can be different from model datasets. We use ModelNet40 [43] for classification
N (in the case of “strided” KPConv). The neighborhood and ShapenetPart [45] for part segmentation. ModelNet40
matrix is forced to have the size of the biggest neighbor- contains 12,311 meshed CAD models from 40 categories.
hood nmax . Because most of the neighborhoods comprise ShapenetPart is a collection of 16,681 point clouds from
less than nmax neighbors, the matrix N thus contains un- 16 categories, each with 2-6 part labels. For benchmark-
used elements. We call them shadow neighbors, and they ing purpose, we use data provided by [26]. In both cases,
are ignored during the convolution computations. we follow standard train/test splits and rescale objects to fit
Network parameters. Each layer j has a cell size dlj from them into a unit sphere (and consider units to be meters for
which we infer other parameters. The kernel points influ- the rest of this experiment). We ignore normals because
ence distance is set as equal to σj = Σ × dlj . For rigid KP- they are only available for artificial data.
Conv, the convolution radius is automatically set to 2.5 σj Classification task. We set the first subsampling grid size
given that the average kernel point radius is 1.5 σj . For de- to dl0 = 2cm. We do not add any feature as input; each
formable KPConv, the convolution radius can be chosen as input point is assigned a constant feature equal to 1, as op-
rj = ρ × dlj . Σ and ρ are proportional coefficients set for posed to empty space which can be considered as 0. This
the whole network. Unless stated otherwise, we will use the constant feature encodes the geometry of the input points.
following set of parameters, chosen by cross validation, for Like [2], our augmentation procedure consists of scaling,
all experiments: K = 15, Σ = 1.0 and ρ = 5.0. The first flipping and perturbing the points. In this setup, we are able
subsampling cell size dl0 will depend on the dataset and, as to process 2.9 batches of 16 clouds per second on an Nvidia
stated above, dlj+1 = 2 ∗ dlj . Titan Xp. Because of our subsampling strategy, the input
point clouds do not all have the same number of points,
3.4. Kernel Point Network Architectures
which is not a problem as our networks accept variable input
Combining analogy with successful image networks and point cloud size. On average, a ModelNet40 object point
empirical studies, we designed two network architectures cloud comprises 6,800 points in our framework. The other
for the classification and the segmentation tasks. Diagrams training parameters are detailed in the supplementary mate-
detailing both architectures are available in the supplemen- rial, along with the architecture details. We also include the
tary material. number of parameters and the training/inference speeds for
KP-CNN is a 5-layer classification convolutional network. both rigid and deformable KPConv.
Each layer contains two convolutional blocks, the first one As shown on Table 1, our networks outperform other
being strided except for the first layer. Our convolutional state-of-the-art methods using only points (we do not take
blocks are designed like bottleneck ResNet blocks [12] with into account methods using normals as additional input).
a KPConv replacing the image convolution, batch normal- We also notice that rigid KPConv performances are slightly
ization and leaky ReLu activation. After the last layer, the better. We suspect that it can be explained by the task sim-
features are aggregated by a global average pooling and pro- plicity. If deformable kernels add more descriptive power,

6415
ModelNet40 ShapeNetPart with 20 semantic classes. S3DIS covers six large-scale in-
door areas from three different buildings for a total of 273
Methods OA mcIoU mIoU
million points annotated with 13 classes. Like [36], we ad-
SPLATNet [33] - 83.7 85.4 vocate the use of Area-5 as test scene to better measure the
SGPN [41] - 82.8 85.8 generalization ability of our method. Semantic3D is an on-
3DmFV-Net [9] 91.6 81.0 84.3 line benchmark comprising several fixed lidar scans of dif-
SynSpecCNN [46] - 82.0 84.7 ferent outdoor scenes. More than 4 billion points are an-
RSNet [15] - 81.4 84.9 notated with 8 classes in this dataset, but they mostly cover
SpecGCN [39] 91.5 - 85.4 ground, building or vegetation and there are fewer object
PointNet++ [26] 90.7 81.9 85.1 instances than in the other datasets. We favor the reduced-8
SO-Net [18] 90.9 81.0 84.9 challenge because it is less biased by the objects close to the
PCNN by Ext [2] 92.3 81.8 85.1 scanner. Paris-Lille-3D contains more than 2km of streets
SpiderCNN [44] 90.5 82.4 85.3 in 4 different cities and is also an online benchmark. The
MCConv [13] 90.9 - 85.9 160 million points of this dataset are annotated with 10 se-
FlexConv [10] 90.2 84.7 85.0 mantic classes.
PointCNN [19] 92.2 84.6 86.1 Pipeline for real scene segmentation. The 3D scenes in
DGCNN [42] 92.2 85.0 84.7 these datasets are too big to be segmented as a whole. Our
SubSparseCNN [9] - 83.3 86.0 KP-FCNN architecture is used to segment small subclouds
KPConv rigid 92.9 85.0 86.2 contained in spheres. At training, the spheres are picked
KPConv deform 92.7 85.1 86.4 randomly in the scenes. At testing, we pick spheres reg-
ularly in the point clouds but ensure each point is tested
Table 1. 3D Shape Classification and Segmentation results. For multiple times by different sphere locations. As in a vot-
generalizability to real data, we only consider scores obtained ing scheme on model datasets, the predicted probabilities
without shape normals on ModelNet40 dataset. The metrics are for each point are averaged. When datasets are colorized,
overall accuracy (OA) for Modelnet40, class average IoU (mcIoU)
we use the three color channels as features. We still keep
and instance average IoU (mIoU) for ShapeNetPart.
the constant 1 feature to ensure black/dark points are not ig-
they also increase the overall network complexity, which nored. To our convolution, a point with all features equal
can disturb the convergence or lead to overfitting on sim- to zero is equivalent to empty space. The input sphere ra-
pler tasks like this shape classification. dius is chosen as 50 × dl0 (in accordance to Modelnet40
Segmentation task. For this task, we use KP-FCNN ar- experiment).
chitecture with the same parameters as in the classification Results. Because outdoor objects are larger than indoor ob-
task, adding the positions (x, y, z) as additional features to jects, we use dl0 = 6cm on Semantic3D and Paris-Lille-
the constant 1, and using the same augmentation procedure. 3D, and dl0 = 4cm on Scannet and S3DIS. As shown in
We train a single network with multiple heads to segment Table 2, our architecture ranks second on Scannet and out-
the parts of each object class. The clouds are smaller (2,300 performs all other segmentation architectures on the other
points on average), and we can process 4.1 batches of 16 datasets. Compared to other point convolution architectures
shapes per second. Table 1 shows the instance average, and [2, 19, 40], KPConv performances exceed previous scores
the class average mIoU. We detail each class mIoU in the by 19 mIoU points on Scannet and 9 mIoU points on S3DIS.
supplementary material. KP-FCNN outperforms all other SubSparseCNN score on Scannet was not reported in their
algorithms, including those using additional inputs like im- original paper [9], so it is hard to compare without knowing
ages or normals. Shape segmentation is a more difficult task their experimental setup. We can notice that, in the same ex-
than shape classification, and we see that KPConv has better perimental setup on ShapeNetPart segmentation, KPConv
performances with deformable kernels. outperformed SubSparseCNN by nearly 2 mIoU points.
Among these 4 datasets, KPConv deformable kernels
4.2. 3D Scene Segmentation
improved the results on Paris-Lille-3D and S3DIS while the
Data. Our second experiment shows how our segmenta- rigid version was better on Scannet and Semantic3D. If we
tion architecture generalizes to real indoor and outdoor data. follow our assumption, we can explain the lower scores on
To this end, we chose to test our network on 4 datasets of Semantic3D by the lack of diversity in this dataset. Indeed,
different natures. Scannet [6], for indoor cluttered scenes, despite comprising 15 scenes and 4 billion points, it con-
S3DIS [1], for indoor large spaces, Semantic3D [11], for tains a majority of ground, building and vegetation points
outdoor fixed scans, and Paris-Lille-3D [30], for outdoor and a few real objects like car or pedestrians. Although
mobile scans. Scannet contains 1,513 small training scenes this is not the case of Scannet, which comprises more than
and 100 test scenes for online benchmarking, all annotated 1,500 scenes with various objects and shapes, our validation

6416
Methods Scannet Sem3D S3DIS PL3D
Pointnet [25] - - 41.1 -
Pointnet++ [26] 33.9 - - -
SnapNet [4] - 59.1 - -
SPLATNet [33] 39.3 - - -
SegCloud [36] - 61.3 48.9 -
RF MSSF [37] - 62.7 49.8 56.3
Eff3DConv [48] - - 51.8 -
TangentConv [35] 43.8 - 52.6 -
MSDVN [29] - 65.3 54.7 66.9
RSNet [15] - - 56.5 -
FCPN [27] 44.7 - - - Figure 4. Outdoor and Indoor scenes, respectively from Seman-
PointCNN [19] 45.8 - 57.3 - tic3D and S3DIS, classified by KP-FCNN with deformable ker-
PCNN [2] 49.8 - - - nels.
SPGraph [16] - 73.2 58.0 -
KPConv performs better than rigid KPConv with 15 kernel
ParamConv [40] - - 58.3 -
points. Although it is not the case on the test set, we tried
SubSparseCNN [9] 72.5 - - -
different validation sets that confirmed the superior perfor-
KPConv rigid 68.6 74.6 65.4 72.3 mances of deformable KPConv. This is not surprising as we
KPConv deform 68.4 73.1 67.1 75.9 obtained the same results on S3DIS. Deformable KPConv
seem to thrive on indoor datasets, which offer more diver-
Table 2. 3D scene segmentation scores (mIoU). Scannet, Se- sity than outdoor datasets. To understand why, we need to
mantic3D and Paris-Lille-3D (PL3D) scores are taken from their
go beyond numbers and see what is effectively learned by
respective online benchmarks (reduced-8 challenge for Seman-
the two versions of KPConv.
tic3D). S3DIS scores are given for Area-5 (see supplementary ma-
terial for k-fold). 4.4. Learned Features and Effective Receptive Field
studies are not reflected by the test scores on this bench- To achieve a deeper understanding of KPConv, we offer
mark. We found that the deformable KPConv outperformed two insights of the learning mechanisms.
its rigid counterpart on several different validation sets (see Learned features. Our first idea was to visualize the fea-
Section 4.3). As a conclusion, these results show that the tures learned by our network. In this experiment, we trained
descriptive power of deformable KPConv is useful to the KP-CNN on ModelNet40 with rigid KPConv. We added
network on large and diverse datasets. We believe KP- random rotation augmentations around vertical axis to in-
Conv could thrive on larger datasets because its kernel com- crease the input shape diversity. Then we visualize each
bines a strong descriptive power (compared to other simpler learned feature by coloring the points according to their
representations, like the linear kernels of [10]), and great level of activation for this features. In Figure 6, we chose
learnability (the weights of MLP convolutions like [19, 40] input point clouds maximizing the activation for different
are more complex to learn). An illustration of segmented features at the first and third layer. For a cleaner display, we
scenes on Semantic3D and S3DIS is shown in Figure 4.
More results visualizations are provided in the supplemen-
tary material.

4.3. Ablation Study


We conduct an ablation study to support our claim that
deformable KPConv has a stronger descriptive power than
rigid KPConv. The idea is to impede the capabilities of the
network, in order to reveal the real potential of deformable
kernels. We use Scannet dataset (same parameters as be-
fore) and use the official validation set, because the test set
cannot be used for such evaluations. As depicted in Figure
5, the deformable KPConv only loses 1.5% mIoU when re-
stricted to 4 kernel points. In the same configuration, the
rigid KPConv loses 3.5% mIoU. Figure 5. Ablation study on Scannet validation set. Evolution of
As stated in Section 4.2, we can also see that deformable the mIoU when reducing the number of kernel points.

6417
Figure 7. KPConv ERF at layer 4 of KP-FCNN, trained on Scan-
net. The green dots represent the ERF centers. ERF values are
merged with scene colors as red intensity. The more red a point is,
the more influence it has on the green point features.

further details in the scene. This adaptive behavior shows


Figure 6. Low and high level features learned in KP-CNN. Each that deformable KPConv improves the network ability to
feature is displayed on 2 input point clouds taken from Model- adapt to the geometry of the scene objects, and explains the
Net40. High activations are in red and low activations in blue.
better performances on indoor datasets.
projected the activations from the layer subsampled points
to the original input points. We observe that, in its first layer, 5. Conclusion
the network is able to learn low-level features like verti- In this work, we propose KPConv, a convolution that op-
cal/horizontal planes (a/b), linear structures (c), or corners erates on point clouds. KPConv takes radius neighborhoods
(d). In the later layers, the network detects more complex as input and processes them with weights spatially located
shapes like small buttresses (e), balls (f), cones (g), or stairs by a small set of kernel points. We define a deformable
(h). However, it is difficult to see a difference between rigid version of this convolution operator that learns local shifts
and deformable KPConv. This tool is very useful to un- effectively deforming the convolution kernels to make them
derstand what KPConv can learn in general, but we need fit the point cloud geometry. Depending on the diversity
another one to compare the two versions. of the datasets, or the chosen network configuration, de-
Effective Receptive Field. To apprehend the differ- formable and rigid KPConv are both valuable, and our net-
ences between the representations learned by rigid and de- works brought new state-of-the-art performances for nearly
formable KPConv, we can compute its Effective Receptive every tested dataset. We release our source code, hoping
Field (ERF) [21] at different locations. The ERF is a mea- to help further research on point cloud convolutional archi-
sure of the influence that each input point has on the result tectures. Beyond the proposed classification and segmenta-
of a KPConv layer at a particular location. It is computed tion networks, KPConv can be used in any other application
as the gradient of KPConv responses at this particular lo- addressed by CNNs. We believe that deformable convolu-
cation with respect to the input point features. As we can tions can thrive in larger datasets or challenging tasks such
see in Figure 7, the ERF varies depending on the object it as object detection, lidar flow computation, or point cloud
is centered on. We see that rigid KPConv ERF has a rel- completion.
atively consistent range on every type of object, whereas Acknowledgement. The authors gratefully acknowledge
deformable KPConv ERF seems to adapt to the object size. the support of ONR MURI grant N00014-13-1-0341, NSF
Indeed, it covers the whole bed, and concentrates more on grant IIS-1763268, a Vannevar Bush Faculty Fellowship,
the chair that on the surrounding ground. When centered on and a gift from the Adobe and Autodesk corporations.
a flat surface, it also seems to ignore most of it and reach for

6418
References [14] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point-
wise convolutional neural networks. In Proceedings of the
[1] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioan- IEEE Conference on Computer Vision and Pattern Recogni-
nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman- tion, pages 984–993, 2018.
tic parsing of large-scale indoor spaces. In Proceedings of the
[15] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Re-
IEEE Conference on Computer Vision and Pattern Recog-
current slice networks for 3d segmentation of point clouds.
nition, pages 1534–1543, 2016. http:// buildingparser.
In Proceedings of the IEEE Conference on Computer Vision
stanford.edu/ dataset.html.
and Pattern Recognition, pages 2626–2635, 2018.
[2] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point
[16] Loic Landrieu and Martin Simonovsky. Large-scale point
convolutional neural networks by extension operators. ACM
cloud semantic segmentation with superpoint graphs. In Pro-
Transactions on Graphics (TOG), 37(4):71, 2018.
ceedings of the IEEE Conference on Computer Vision and
[3] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fis-
Pattern Recognition, pages 4558–4567, 2018.
cher. 3dmfv: Three-dimensional point cloud classification
[17] Felix Järemo Lawin, Martin Danelljan, Patrik Tosteberg,
in real-time using convolutional neural networks. IEEE
Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg.
Robotics and Automation Letters, 3(4):3145–3152, 2018.
Deep projective 3d semantic segmentation. In International
[4] Alexandre Boulch, Bertrand Le Saux, and Nicolas Audebert.
Conference on Computer Analysis of Images and Patterns,
Unstructured point cloud semantic labeling using deep seg-
pages 95–107. Springer, 2017.
mentation networks. In Proceedings of the Workshop on 3D
Object Retrieval (3DOR), 2017. [18] Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self-
organizing network for point cloud analysis. In Proceed-
[5] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur
ings of the IEEE Conference on Computer Vision and Pattern
Szlam, and Pierre Vandergheynst. Geometric deep learning:
Recognition, pages 9397–9406, 2018.
going beyond euclidean data. IEEE Signal Processing Mag-
azine, 34(4):18–42, 2017. [19] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,
[6] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- and Baoquan Chen. Pointcnn: Convolution on x-transformed
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: points. In Advances in Neural Information Processing Sys-
Richly-annotated 3d reconstructions of indoor scenes. In tems, pages 820–830, 2018.
Proceedings of the IEEE Conference on Computer Vision [20] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias
and Pattern Recognition, pages 5828–5839, 2017. http: Zwicker. Point2sequence: Learning the shape representa-
// kaldir.vc.in.tum.de/ scannet benchmark. tion of 3d point clouds with an attention-based sequence to
[7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong sequence network. arXiv preprint arXiv:1811.02565, 2018.
Zhang, Han Hu, and Yichen Wei. Deformable convolutional [21] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel.
networks. In Proceedings of the IEEE international Confer- Understanding the effective receptive field in deep convolu-
ence on Computer Vision, pages 764–773, 2017. tional neural networks. In Advances in neural information
[8] Michaël Defferrard, Xavier Bresson, and Pierre Van- processing systems, pages 4898–4906, 2016.
dergheynst. Convolutional neural networks on graphs with [22] Jonathan Masci, Davide Boscaini, Michael Bronstein, and
fast localized spectral filtering. In Advances in Neural Infor- Pierre Vandergheynst. Geodesic convolutional neural net-
mation Processing Systems, pages 3844–3852, 2016. works on riemannian manifolds. In Proceedings of the
[9] Benjamin Graham, Martin Engelcke, and Laurens van der IEEE international conference on computer vision work-
Maaten. 3d semantic segmentation with submanifold sparse shops, pages 37–45, 2015.
convolutional networks. In Proceedings of the IEEE Con- [23] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-
ference on Computer Vision and Pattern Recognition, pages volutional neural network for real-time object recognition.
9224–9232, 2018. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ
[10] Fabian Groh, Patrick Wieschollek, and Hendrik P.A. Lensch. International Conference on, pages 922–928. IEEE, 2015.
Flex-convolution. In Asian Conference on Computer Vision, [24] Federico Monti, Davide Boscaini, Jonathan Masci,
pages 105–122. Springer, 2018. Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein.
[11] Timo Hackel, Nikolay Savinov, Lubor Ladicky, Jan D. Weg- Geometric deep learning on graphs and manifolds using
ner, Konrad Schindler, and Marc Pollefeys. Semantic3d. mixture model cnns. In Proceedings of the IEEE Confer-
net: A new large-scale point cloud classification bench- ence on Computer Vision and Pattern Recognition, pages
mark. arXiv preprint arXiv:1704.03847, 2017. http:// www. 5115–5124, 2017.
semantic3d.net. [25] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Pointnet: Deep learning on point sets for 3d classification
Deep residual learning for image recognition. In Proceed- and segmentation. In Proceedings of the IEEE Conference on
ings of the IEEE conference on Computer Vision and Pattern Computer Vision and Pattern Recognition, pages 652–660,
Recognition, pages 770–778, 2016. 2017.
[13] Pedro Hermosilla, Tobias Ristchel, Pere-Pau Vázquez, [26] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point-
Álvaro Vinacua, and Timo Ropinski. Monte carlo convo- net++: Deep hierarchical feature learning on point sets in a
lution for learning on non-uniformly sampled point clouds. metric space. In Advances in Neural Information Processing
ACM Transactions on Graphics (TOG), 37(6):235–1, 2018. Systems, pages 5099–5108, 2017.

6419
[27] Dario Rethage, Johanna Wald, Jurgen Sturm, Nassir Navab, [40] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
and Federico Tombari. Fully-convolutional point networks Pokrovsky, and Raquel Urtasun. Deep parametric continu-
for large-scale point clouds. In Proceedings of the Euro- ous convolutional neural networks. In Proceedings of the
pean Conference on Computer Vision (ECCV), pages 596– IEEE Conference on Computer Vision and Pattern Recogni-
611, 2018. tion, pages 2589–2597, 2018.
[28] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. [41] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neu-
Octnet: Learning deep 3d representations at high resolutions. mann. Sgpn: Similarity group proposal network for 3d point
In Proceedings of the IEEE Conference on Computer Vision cloud instance segmentation. In Proceedings of the IEEE
and Pattern Recognition, volume 3, 2017. Conference on Computer Vision and Pattern Recognition,
[29] Xavier Roynard, Jean-Emmanuel Deschaud, and François pages 2569–2578, 2018.
Goulette. Classification of point cloud scenes with multi- [42] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma,
scale voxel deep network. arXiv preprint arXiv:1804.03583, Michael M. Bronstein, and Justin M. Solomon. Dynamic
2018. graph cnn for learning on point clouds. ACM Transactions
[30] Xavier Roynard, Jean-Emmanuel Deschaud, and François on Graphics (TOG), 2019.
Goulette. Paris-lille-3d: A large and high-quality ground- [43] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
truth urban point cloud dataset for automatic segmentation guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
and classification. The International Journal of Robotics Re- shapenets: A deep representation for volumetric shapes. In
search, 37(6):545–557, 2018. http:// npm3d.fr . Proceedings of the IEEE conference on Computer Vision and
[31] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Min- Pattern Recognition, pages 1912–1920, 2015.
ing point cloud local structures by kernel correlation and [44] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao.
graph pooling. In Proceedings of the IEEE Conference on Spidercnn: Deep learning on point sets with parameterized
Computer Vision and Pattern Recognition, volume 4, 2018. convolutional filters. In Proceedings of the European Con-
[32] Martin Simonovsky and Nikos Komodakis. Dynamic edge- ference on Computer Vision (ECCV), pages 87–102, 2018.
conditioned filters in convolutional neural networks on [45] Li Yi, Vladimir G. Kim, Duygu Ceylan, I Shen, Mengyan
graphs. In Proceedings of the IEEE conference on computer Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer,
vision and pattern recognition, pages 3693–3702, 2017. Leonidas J. Guibas, et al. A scalable active framework for
[33] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, region annotation in 3d shape collections. ACM Transactions
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. on Graphics (TOG), 35(6):210, 2016.
Splatnet: Sparse lattice networks for point cloud processing. [46] Li Yi, Hao Su, Xingwen Guo, and Leonidas J. Guibas. Sync-
In Proceedings of the IEEE Conference on Computer Vision speccnn: Synchronized spectral cnn for 3d shape segmenta-
and Pattern Recognition, pages 2530–2539, 2018. tion. In CVPR, pages 6584–6592, 2017.
[34] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik [47] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barn-
Learned-Miller. Multi-view convolutional neural networks abas Poczos, Ruslan R. Salakhutdinov, and Alexander J
for 3d shape recognition. In Proceedings of the IEEE In- Smola. Deep sets. In Advances in Neural Information Pro-
ternational Conference on Computer Vision, pages 945–953, cessing Systems, pages 3391–3401, 2017.
2015. [48] Chris Zhang, Wenjie Luo, and Raquel Urtasun. Efficient con-
[35] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian- volutions for real-time semantic segmentation of 3d point
Yi Zhou. Tangent convolutions for dense prediction in 3d. clouds. In 2018 International Conference on 3D Vision
In Proceedings of the IEEE Conference on Computer Vision (3DV), pages 399–408. IEEE, 2018.
and Pattern Recognition, pages 3887–3896, 2018.
[36] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung
Gwak, and Silvio Savarese. Segcloud: Semantic segmen-
tation of 3d point clouds. In 2017 International Conference
on 3D Vision (3DV), pages 537–547. IEEE, 2017.
[37] Hugues Thomas, François Goulette, Jean-Emmanuel De-
schaud, and Beatriz Marcotegui. Semantic classification of
3d point clouds with multiscale spherical neighborhoods. In
2018 International Conference on 3D Vision (3DV), pages
390–398. IEEE, 2018.
[38] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feast-
net: Feature-steered graph convolutions for 3d shape anal-
ysis. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2598–2606, 2018.
[39] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spec-
tral graph convolution for point set feature learning. In Pro-
ceedings of the European Conference on Computer Vision
(ECCV), pages 52–66, 2018.

6420

You might also like