PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
1
posals, while the PointNet-based set abstraction operation rich scene context, which boosts the 3D detection perfor-
preserves accurate location information with flexible recep- mance significantly. (3) We propose a multi-scale RoI fea-
tive fields. We argue that the integration of these two types ture abstraction layer for grid points in each proposal, which
of feature learning frameworks can help learn more discrim- aggregates richer context information from the scene with
inative features for accurate fine-grained box refinement. multiple receptive fields for accurate box refinement and
The main challenge would be how to effectively com- confidence prediction. (4) Our proposed method PV-RCNN
bine the two types of feature learning schemes, specifi- outperforms all previous methods with remarkable margins
cally the 3D voxel CNN with sparse convolutions [6, 5] and ranks 1st on the highly competitive KITTI 3D detec-
and the PointNet-based set abstraction [24], into a unified tion benchmark [10], ans also surpasses previous methods
framework. An intuitive solution would be uniformly sam- on the large-scale Waymo Open dataset with a large margin.
pling several grid points within each 3D proposal, and adopt
the set abstraction to aggregate 3D voxel-wise features sur- 2. Related Work
rounding these grid points for proposal refinement. How-
3D Object Detection with Grid-based Methods. To
ever, this strategy is highly memory-intensive since both the
tackle the irregular data format of point clouds, most ex-
number of voxels and the number of grid points could be
isting works project the point clouds to regular grids to be
quite large to achieve satisfactory performance.
processed by 2D or 3D CNN. The pioneer work MV3D [1]
Therefore, to better integrate these two types of point projects the point clouds to 2D bird view grids and places
cloud feature learning networks, we propose a two-step lots of predefined 3D anchors for generating 3D bounding
strategy with the first voxel-to-keypoint scene encoding boxes, and the following works [11, 17, 16] develop better
step and the second keypoint-to-grid RoI feature abstraction strategies for multi-sensor fusion while [36, 35, 12] propose
step. Specifically, a voxel CNN with 3D sparse convolution more efficient frameworks with bird view representation.
is adopted for voxel-wise feature learning and accurate pro- Some other works [27, 41] divide the point clouds into 3D
posal generation. To mitigate the above mentioned issue of voxels to be processed by 3D CNN, and 3D sparse convo-
requiring too many voxels for encoding the whole scene, a lution [5] is introduced [34] for efficient 3D voxel process-
small set of keypoints are selected by the furtherest point ing. [30, 42] utilizes multiple detection heads while [26]
sampling (FPS) to summarize the overall 3D information explores the object part locations for improving the perfor-
from the voxel-wise features. The features of each key- mance. These grid-based methods are generally efficient for
point is aggregated by grouping the neighboring voxel-wise accurate 3D proposal generation but the receptive fields are
features via PointNet-based set abstraction for summarizing constraint by the kernel size of 2D/3D convolutions.
multi-scale point cloud information. In this way, the overall 3D Object Detection with Point-based Methods. F-
scene can be effectively and efficiently encoded by a small PointNet [22] first proposes to apply PointNet [23, 24] for
number of keypoints with associated multi-scale features. 3D detection from the cropped point clouds based on the 2D
For the second keypoint-to-grid RoI feature abstraction image bounding boxes. PointRCNN [25] generates 3D pro-
step, given each box proposal with its grid point locations, posals directly from the whole point clouds instead of 2D
a RoI-grid pooling module is proposed, where a keypoint images for 3D detection with point clouds only, and the fol-
set abstraction layer with multiple radii is adopted for each lowing work STD [37] proposes the sparse to dense strategy
grid point to aggregate the features from the keypoints with for better proposal refinement. [21] proposes the hough vot-
multi-scale context. All grid points’ aggregated features can ing strategy for better object feature grouping. These point-
then be jointly used for the succeeding proposal refinement. based methods are mostly based on the PointNet series, es-
Our proposed PV-RCNN effectively takes advantages of pecially the set abstraction operation [24], which enables
both point-based and voxel-based networks to encode dis- flexible receptive fields for point cloud feature learning.
criminative features at each box proposal for accurate con- Representation Learning on Point Clouds. Recently rep-
fidence prediction and fine-grained box refinement. resentation learning on point clouds has drawn lots of atten-
Our contributions can be summarized into four-fold. (1) tion for improving the performance of point cloud classifi-
We propose PV-RCNN framework which effectively takes cation and segmentation [23, 24, 41, 31, 7, 38, 15, 28, 33, 8,
advantages of both the voxel-based and point-based meth- 29, 3]. In terms of 3D detection, previous methods generally
ods for 3D point-cloud feature learning, leading to im- project the point clouds to regular bird view grids [1, 36] or
proved performance of 3D object detection with manage- 3D voxels [41, 2] for processing point clouds with 2D/3D
able memory consumption. (2) We propose the voxel- CNN. 3D sparse convolution [6, 5] are adopted in [34, 26] to
to-keypoint scene encoding scheme, which encodes multi- effectively learn sparse voxel-wise features from the point
scale voxel features of the whole scene to a small set of clouds. Qi et al. [23, 24] proposes the PointNet to directly
keypoints by the voxel set abstraction layer. These keypoint learn point-wise features from the raw point clouds, where
features not only preserve accurate location but also encode set abstraction operation enables flexible receptive fields by
2
setting different search radii. [19] combines both voxel- ture maps, high-quality 3D proposals are generated follow-
based CNN and point-based SharedMLP for efficient point ing the anchor-based approaches [34, 12]. Specifically, we
cloud feature learning. In comparison, our proposed PV- stack the 3D feature volume along the Z axis to obtain the
L W L W
RCNN takes advantages from both the voxel-based feature 8 × 8 bird-view feature maps. Each class has 2 × 8 × 8
learning (i.e., 3D sparse convolution) and PointNet-based 3D anchor boxes which adopt the average 3D object sizes
feature learning (i.e., set abstraction operation) to enable of this class, and two anchors of 0◦ , 90◦ orientations are
both high-quality 3D proposal generation and flexible re- evaluated for each pixel of the bird-view feature maps. As
ceptive fields for improving the 3D detection performance. shown in Table 4, the adopted 3D voxel CNN backbone
with anchor-based scheme achieves higher recall perfor-
3. PV-RCNN for Point Cloud Object Detection mance than the PointNet-based approaches [25, 37].
Discussions. State-of-the-art detectors mostly adopt
In this paper, we propose the PointVoxel-RCNN (PV-
two-stage frameworks. They require pooling RoI specific
RCNN), which is a two-stage 3D detection framework aim-
features from the resulting 3D feature volumes or 2D maps
ing at more accurate 3D object detection from point clouds.
for further proposal refinement. However, these 3D feature
State-of-the-art 3D detection approaches are based on either
volumes from the 3D voxel CNN have major limitations in
3D voxel CNN with sparse convolution or PointNet-based
the following aspects. (i) These feature volumes are gen-
networks as the backbone. Generally, the 3D voxel CNNs
erally of low spatial resolution as they are downsampled by
with sparse convolution are more efficient [34, 26] and are
up to 8 times, which hinders accurate localization of objects
able to generate high-quality 3D object proposals, while the
in the input scene. (ii) Even if one can upsample to obtain
PointNet-based methods can capture more accurate contex-
feature volumes/maps of larger spatial sizes, they are gener-
tual information with flexible receptive fields.
ally still quite sparse. The commonly used trilinear or bilin-
Our PV-RCNN deeply integrates the advantages of two
ear interpolation in the RoIPooling/RoIAlign operations can
types of networks. As illustrated in Fig. 2, the PV-RCNN
only extract features from very small neighborhoods (i.e., 4
consists of a 3D voxel CNN with sparse convolution as
and 8 nearest neighbors for bilinear and trilinear interpo-
the backbone for efficient feature encoding and proposal
lation respectively). The conventional pooling approaches
generation. Given each 3D object proposal, to effectively
would therefore obtain features with mostly zeros and waste
pool its corresponding features from the scene, we propose
much computation and memory for stage-2 refinement.
two novel operations: the voxel-to-keypoint scene encod-
ing, which summarizes all the voxels of the overall scene On the other hand, the set abstraction operation proposed
feature volumes into a small number of feature keypoints, in the variants of PointNet [23, 24] has shown the strong
and the point-to-grid RoI feature abstraction, which effec- capability of encoding feature points from a neighborhood
tively aggregates the scene keypoint features to RoI grids of an arbitrary size. We therefore propose to integrate a 3D
for proposal confidence prediction and location refinement. voxel CNN with a series of set abstraction operations for
conducting accurate and robust stage-2 proposal refinement.
3.1. 3D Voxel CNN for Efficient Feature Encoding A naive solution of using the set abstraction operation
and Proposal Generation for pooling the scene feature voxels would be directly ag-
gregating the multi-scale feature volume in a scene to the
Voxel CNN with 3D sparse convolution [6, 5, 34, 26] is
RoI grids. However, this intuitive strategy simply occupies
a popular choice by state-of-the-art 3D detectors for effi-
much memory and is inefficient to be used in practice. For
ciently converting the point clouds into sparse 3D feature
instance, a common scene from the KITTI dataset might
volumes. Because of its high efficiency and accuracy, we
result in 18, 000 voxels in the 4× downsampled feature vol-
adopt it as the backbone of our framework for feature en-
umes. If one uses 100 box proposal for each scene and each
coding and 3D proposal generation.
box proposal has 3 × 3 × 3 grids. The 2, 700 × 18, 000
3D voxel CNN. The input points P are first divided into
pairwise distances and feature aggregations cannot be effi-
small voxels with spatial resolution of L × W × H, where
ciently computed, even after distance thresholding.
the features of the non-empty voxels are directly calcu-
lated as the mean of point-wise features of all inside points. To tackle this issue, we propose a two-step approach to
The commonly used features are the 3D coordinates and first encode voxels at different neural layers of the entire
reflectance intensities. The network utilizes a series of scene into a small number of keypoints and then aggregate
3 × 3 × 3 3D sparse convolution to gradually convert the keypoint features to RoI grids for box proposal refinement.
point clouds into feature volumes with 1×, 2×, 4×, 8×
3.2. Voxel-to-keypoint Scene Encoding via Voxel Set
downsampled sizes. Such sparse feature volumes at each
Abstraction
level could be viewed as a set of voxel-wise feature vectors.
3D proposal generation. By converting the encoded 8× Our proposed framework first aggregates the voxels at
downsampled 3D feature volumes into 2D bird-view fea- the multiple neural layers representing the entire scene into
3
3D Sparse Convolution
RPN
Box
Raw Point Cloud Confidence
Refinement
z Classification
y
FC (256, 256)
FPS
3D
Bo
xP
z ro
po
Predicted Keypoint
Weighting Module
y sa
ls
x Keypoints
with features
Keypoints Sampling Voxel Set Abstraction Module RoI-grid Pooling Module
Figure 2. The overall architecture of our proposed PV-RCNN. The raw point clouds are first voxelized to feed into the 3D sparse convolution
based encoder to learn multi-scale semantic features and generate 3D object proposals. Then the learned voxel-wise feature volumes at
multiple neural layers are summarized into a small set of key points via the novel voxel set abstraction module. Finally the keypoint features
are aggregated to the RoI-grid points to learn proposal specific features for fine-grained proposal refinement and confidence prediction.
of voxel-wise feature vectors in the k-th level of 3D voxel voxel CNN-based feature learning from voxel-wise feature
CNN, V (lk ) = {v1(lk ) , · · · , vN
(lk )
k
} as their 3D coordinates cal- (l )
fj k and the PointNet-based feature learning from voxel set
culated by the voxel indices and actual voxel sizes of the
abstraction as Eq. (2). Besides, the 3D coordinate of pi also
k-th level, where Nk is the number of non-empty voxels in
the k-th level. For each keypoint pi , we first identify its preserves accurate location information.
neighboring non-empty voxels at the k-th level within a ra- Extended VSA Module. We extend the VSA module by
dius rk toretrieve the set of voxel-wise feature vectors as further enriching the keypoint features from the raw point
2
clouds P and the 8× downsampled 2D bird-view feature
(lk )
iT
vj − pi
< r k ,
maps (as described in Sec. 3.1), where the raw point clouds
h
(lk ) (l ) (l )
Si = fj k ; vj k − pi (lk )
∀vj ∈ V (lk )
, , (1)
(lk ) (lk )
partially make up the quantization loss of the initial point-
∀fj ∈F
cloud voxelization while the 2D bird-view maps have larger
(l )
where we concatenate the local relative coordinates vj k receptive fields along the Z axis. The raw point-cloud fea-
−pi to indicate the relative location of semantic voxel fea- ture fi
(raw)
is also aggregated as in Eq. (2). For the bird
(l )
ture fj k . The voxel-wise features within the neighboring view feature maps, we project the keypoint pi to the 2D
(l )
voxel set Si k of pi are then transformed by a PointNet- bird-view coordinate system, and utilize bilinear interpola-
(bev)
block [23] to generate the feature for the key point pi as tion to obtain the features fi from the bird-view feature
4
3D GT
Keypoint Boxes
Coordinates Training
Label
nx3
Part
FL
Foreground
Point Check
Keypoint
RoI-grid Point Features
Sigmoid
n x 256
Features
nxC
nxC
nx1 Grid Point Key Point Raw Point
Predicted Keypoint Weighting Module
Figure 4. Illustration of RoI-grid pooling module. Rich context
Figure 3. Illustration of Predicted Keypoint Weighting module.
information of each 3D RoI is aggregated by the set abstraction
maps. Hence, the keypoint feature for pi is further enriched operation with multiple receptive fields.
by concatenating all its associated features
RoI-grid Pooling via Set Abstraction. Given each 3D
h i
(p) (pv) (raw) (bev)
fi = fi , fi , fi , for i = 1, · · · , n, (4)
RoI, as shown in Fig. 4, we propose the RoI-grid pooling
which have the strong capability of preserving 3D structural module to aggregate the keypoint features to the RoI-grid
information of the entire scene and can also boost the final points with multiple receptive fields. We uniformly sample
detection performance by large margins. 6 × 6 × 6 grid points within each 3D proposal, which are
Predicted Keypoint Weighting. After the overall scene denoted as G = {g1 , · · · , g216 }. The set abstraction opera-
is encoded by a small number of keypoints, they would be tion is adopted to aggregate the features of grid points from
further utilized by the succeeding stage for conducting pro- the keypoint features. Specifically, we firstly identify the
neighboring keypoints of grid point gi within a radius r̃ as
posal refinement. The keypoints are chosen by the Further
( )
Point Sampling strategy and some of them might only repre- h
(p)
iT kp − g k2 < r̃,
j i
Ψ̃ = f˜j ; pj − gi , (6)
sent the background regions. Intuitively, keypoints belong- ∀pj ∈ K, ∀f˜j(p) ∈ F̃
ing to the foreground objects should contribute more to the
accurate refinement of the proposals, while the ones from where pj − gi is appended to indicate the local relative lo-
(p)
the background regions should contribute less. cation of features f˜j from keypoint pj . Then a PointNet-
Hence, we propose a Predicted Keypoint Weighting block [23] is adopted to aggregate the neighboring keypoint
(PKW) module (see Fig. 3) to re-weight the keypoint fea- feature set Ψ̃ to generate the feature for grid point gi as
tures with extra supervisions from point-cloud segmenta-
n o
(g)
f˜i = max G M Ψ̃ , (7)
tion. The segmentation labels can be directly generated by
the 3D detection box annotations, i.e. by checking whether where M(·) and G(·) are defined as the same in Eq. (2).
each key point is inside or outside of a ground-truth 3D box We set multiple radii r̃ and aggregate keypoint features with
since the 3D objects in autonomous driving scenes are nat- different receptive fields, which are concatenated together
urally separated in 3D space. The predicted feature weight- for capturing richer multi-scale contextual information.
(p)
ing for each keypoint’s feature f˜i can be formulated as After obtaining each grid’s aggregated features from its
(p) (p) (p) surrounding keypoints, all RoI-grid features of the same RoI
f˜i = A(fi ) · fi , (5)
can be vectorized and transformed by a two-layer MLP with
where A(·) is a three-layer MLP network with a sigmoid 256 feature dimensions to represent the overall proposal.
function to predict foreground confidence between [0, 1]. Compared with the point cloud 3D RoI pooling opera-
The PKW module is trained by focal loss [18] with de- tions in previous works [25, 37, 26], our proposed RoI-grid
fault hyper-parameters for handling the unbalanced number pooling operation targeting the keypoints is able to cap-
of foreground/background points in the training set. ture much richer contextual information with flexible re-
3.3. Keypoint-to-grid RoI Feature Abstraction for ceptive fields, where the receptive fields are even beyond
Proposal Refinement the RoI boundaries for capturing the surrounding keypoint
features outside the 3D RoI, while the previous state-of-the-
In the previous step, the whole scene is summarized into art methods either simply average all point-wise features
a small number of keypoints with multi-scale semantic fea- within the proposal as the RoI feature [25], or pool many
tures. Given each 3D proposal (RoI) generated by the 3D uninformative zeros as the RoI features [26, 37].
voxel CNN, the features of each RoI need to be aggre- 3D Proposal Refinement and Confidence Predic-
(p) (p)
gated from the keypoint features F̃ = {f˜1 , · · · , f˜n } for tion. Given the RoI feature of each box proposal, the pro-
accurate and robust proposal refinement. We propose the posal refinement network learns to predict the size and lo-
keypoint-to-grid RoI feature abstraction based on the set ab- cation (i.e., center, size and orientation) residuals relative to
straction operation for multi-scale RoI feature encoding. the input 3D proposal. The refinement network adopts a 2-
5
layer MLP and has two branches for confidence prediction 4.1. Experimental Setup
and box refinement respectively.
Datasets. KITTI Dataset [4] is one of the most popular
For the confidence prediction branch, we follow [14, 9,
dataset of 3D detection for autonomous driving. There are
26] to adopt the 3D Intersection-over-Union (IoU) between
7, 481 training samples and 7, 518 test samples, where the
the 3D RoIs and their corresponding ground-truth boxes as
training samples are generally divided into the train split
the training targets. For the k-th 3D RoI, its confidence
(3, 712 samples) and the val split (3, 769 samples). We com-
training target yk is normalized to be between [0, 1] as
pare PV-RCNN with state-of-the-art methods on both the
yk = min (1, max (0, 2IoUk − 0.5)) , (8) val split and the test split on the online learderboard.
Waymo Open Dataset is a recently released and currently
where IoUk is the IoU of the k-th RoI w.r.t. its ground-truth
the largest dataset of 3D detection for autonomous driv-
box. Our confidence branch is then trained to minimize the
ing. There are totally 798 training sequences with around
cross-entropy loss on predicting the confidence targets,
158, 361 LiDAR samples, and 202 validation sequences
Liou = −yk log(ỹk ) − (1 − yk ) log(1 − ỹk ), (9) with 40, 077 LiDAR samples. It annotated the objects in
where ỹk is the predicted score by the network. Our exper- the full 360◦ field instead of 90◦ in KITTI dataset. We eval-
iments in Table 9 show that this quality-aware confidence uate our model on this large-scale dataset to further validate
prediction strategy achieves better performance than the tra- the effectiveness of our proposed method.
ditional classification targets. Network Architecture. As shown in Fig. 2, the 3D voxel
The box regression targets of the box refinement branch CNN has four levels with feature dimensions 16, 32, 64, 64,
are encoded by the traditional residual-based method as in respectively. Their two neighboring radii rk of each level
[34, 26] and are optimized by smooth-L1 loss function. in the VSA module are set as (0.4m, 0.8m), (0.8m, 1.2m),
(1.2m, 2.4m), (2.4m, 4.8m), and the neighborhood radii of
3.4. Training losses set abstraction for raw points are (0.4m, 0.8m). For the
The proposed PV-RCNN framework is trained end-to- proposed RoI-grid pooling operation, we uniformly sam-
end with the region proposal loss Lrpn , keypoint segmenta- ple 6 × 6 × 6 grid points in each 3D proposal and the two
tion loss Lseg and the proposal refinement loss Lrcnn . (1) We neighboring radii r̃ of each grid point are (0.8m, 1.6m).
adopt the same region proposal loss Lrpn with [34] as For the KITTI dataset, the detection range is within
X
da , ∆ra ),
[0, 70.4]m for the X axis, [−40, 40]m for the Y axis and
Lrpn = Lcls + β Lsmooth-L1 (∆r (10)
[−3, 1]m for the Z axis, which is voxelized with the voxel
r∈{x,y,z,l,h,w,θ}
size (0.05m, 0.05m, 0.1m) in each axis. For the Waymo
where the anchor classification loss Lcls is calculated with Open dataset, the detection range is [−75.2, 75.2]m for the
focal loss [18] with default hyper-parameters and smooth- X and Y axes and [−2, 4]m for the Z axis, and we set the
L1 loss is utilized for anchor box regression with the pre-
da and the regression target ∆ra . (2) The voxel size to (0.1m, 0.1m, 0.15m).
dicted residual ∆r Training and Inference Details. Our PV-RCNN frame-
keypoint segmentation loss Lseg is also calculated by the
work is trained from scratch in an end-to-end manner with
focal loss as mentioned in Sec. 3.2. (3) The proposal refine-
ment loss Lrcnn includes the IoU-guided confidence predic- the ADAM optimizer. For the KITTI dataset, we train the
tion loss Liou and the box refinement loss as entire network with the batch size 24, learning rate 0.01 for
X
dp , ∆rp ), 80 epochs on 8 GTX 1080 Ti GPUs, which takes around
Lrcnn = Liou + Lsmooth-L1 (∆r (11)
r∈{x,y,z,l,h,w,θ}
5 hours. For the Waymo Open Dataset, we train the en-
dp is the predicted box residual and ∆rp is the pro- tire network with batch size 64, learning rate 0.01 for 30
where ∆r
epochs on 32 GTX 1080 Ti GPUs. The cosine annealing
posal regression target which are encoded same with ∆ra .
learning rate strategy is adopted for the learning rate decay.
The overall training loss are then the sum of these three
For the proposal refinement stage, we randomly sample 128
losses with equal loss weights. Further training loss details
proposals with 1:1 ratio for positive and negative proposals,
are provided in the supplementary file.
where a proposal is considered as a positive proposal for
box refinement branch if it has at least 0.55 3D IoU with
4. Experiments
the ground-truth boxes, otherwise it is treated as a negative
In this section, we introduce the implementation details proposal.
of our PV-RCNN framework (Sec. 4.1) and compare with During training, we utilize the widely adopted data aug-
previous state-of-the-art methods on both the highly com- mentation strategy of 3D object detection, including ran-
petitive KITTI dataset [4] (Sec. 4.2) and the newly intro- dom flipping along the X axis, global scaling with a ran-
duced large-scale Waymo Open Dataset [20, 40] (Sec. 4.3). dom scaling factor sampled from [0.95, 1.05], global rota-
In Sec. 4.4, we conduct extensive ablation studies to inves- tion around the Z axis with a random angle sampled from
tigate each component of PV-RCNN to validate our design. [− π4 , π4 ]. We also conduct the ground-truth sampling aug-
6
Car - 3D Detection Car - BEV Detection Cyclist - 3D Detection Cyclist - BEV Detection
Method Reference Modality
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
MV3D [1] CVPR 2017 RGB + LiDAR 74.97 63.63 54.00 86.62 78.93 69.80 - - - - - -
ContFuse [17] ECCV 2018 RGB + LiDAR 83.68 68.78 61.67 94.07 85.35 75.88 - - - - - -
AVOD-FPN [11] IROS 2018 RGB + LiDAR 83.07 71.76 65.73 90.99 84.82 79.62 63.76 50.55 44.93 69.39 57.12 51.09
F-PointNet [22] CVPR 2018 RGB + LiDAR 82.19 69.79 60.59 91.17 84.67 74.77 72.27 56.12 49.01 77.26 61.37 53.78
UberATG-MMF [16] CVPR 2019 RGB + LiDAR 88.40 77.43 70.22 93.67 88.21 81.99 - - - - - -
SECOND [34] Sensors 2018 LiDAR only 83.34 72.55 65.82 89.39 83.77 78.59 71.33 52.08 45.83 76.50 56.05 49.45
PointPillars [12] CVPR 2019 LiDAR only 82.58 74.31 68.99 90.07 86.56 82.81 77.10 58.65 51.92 79.90 62.73 55.58
PointRCNN [25] CVPR 2019 LiDAR only 86.96 75.64 70.70 92.13 87.39 82.72 74.96 58.82 52.53 82.56 67.24 60.28
3D IoU Loss [39] 3DV 2019 LiDAR only 86.16 76.50 71.39 91.36 86.22 81.20 - - - - - -
Fast Point R-CNN [2] ICCV 2019 LiDAR only 85.29 77.40 70.24 90.87 87.84 80.52 - - - - - -
STD [37] ICCV 2019 LiDAR only 87.95 79.71 75.09 94.74 89.19 86.42 78.69 61.59 55.30 81.36 67.23 59.35
Patches [13] Arxiv 2019 LiDAR only 88.67 77.20 71.82 92.72 88.39 83.19 - - - - - -
Part-A2-Net [26] TPAMI 2020 LiDAR only 87.81 78.49 73.51 91.70 87.79 84.61 - - - - - -
PV-RCNN (Ours) - LiDAR only 90.25 81.43 76.82 94.98 90.65 86.14 78.60 63.71 57.65 82.49 68.89 62.41
Improvement - - +1.58 +1.72 +1.73 +0.24 +1.46 -0.28 -0.06 +2.12 +2.35 -0.07 +1.65 +2.13
Table 1. Performance comparison on the KITTI test set. The results are evaluated by the mean Average Precision with 40 recall positions.
7
3D mAP (IoU=0.7) 3D mAPH (IoU=0.7) BEV mAP (IoU=0.7) BEV mAPH (IoU=0.7)
Difficulty Method
Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf
PointPillar [12] 56.62 81.01 51.75 27.94 - - - - 75.57 92.1 74.06 55.47 - - - -
LEVEL 1 MVF [40] 62.93 86.30 60.02 36.02 - - - - 80.40 93.59 79.21 63.09 - - - -
PV-RCNN (Ours) 70.30 91.92 69.21 42.17 69.69 91.34 68.53 41.31 82.96 97.35 82.99 64.97 82.06 96.71 82.01 63.15
Improvement +7.37 +5.62 +9.19 +6.15 - - - - +2.56 +3.76 +3.78 +1.88 - - - -
LEVEL 2 PV-RCNN (Ours) 65.36 91.58 65.13 36.46 64.79 91.00 64.49 35.70 77.45 94.64 80.39 55.39 76.60 94.03 79.40 53.82
Table 5. Performance comparison on the Waymo Open Dataset (version 1.0 released in August, 2019) with 202 validation sequences for
the vehicle detection. Note that the results of PointPillar [12] on the Waymo Open Dataset are reproduced by [40].
Vehicle (LEVEL 1) Vehicle (LEVEL 2) Ped. (LEVEL 1) Ped. (LEVEL 2) Cyc. (LEVEL 1) Cyc. (LEVEL 2)
Method Reference
mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH
*StarNet [20] NeurIPSw 2019 53.70 - - - 66.80 - - - - - - -
*PointPillar [12] CVPR 2019 56.62 - - - 59.25 - - - - - - -
*MVF [40] CoRL 2019 62.93 - - - 65.33 - - - - - - -
†
SECOND [34] Sensors 2018 72.27 71.69 63.85 63.33 68.70 58.18 60.72 51.31 60.62 59.28 58.34 57.05
PV-RCNN (Ours) - 77.51 76.89 68.98 68.41 75.01 65.65 66.04 57.61 67.81 66.35 65.39 63.98
Table 6. Performance comparison on the Waymo Open Dataset (version 1.2 released in March 2020) with 202 validation sequences for
three categories. †: re-implemented by ourselves with their open source code. ∗: performance on the version 1.0 of Waymo Open Dataset.
8
aggregating multi-scale feature volumes of encoder to the RoI Confidence
PKW Easy Moderate Hard
RoI-grid points as mentioned in Sec. 3.1. As shown in the Pooling Prediction
7 RoI-grid Pooling IoU-guided scoring 92.09 82.95 81.93
2nd and 3rd rows of Table 7, the voxel-to-keypoint scene en- X RoI-aware Pooling IoU-guided scoring 92.54 82.97 80.30
coding strategy contributes significantly to the performance X RoI-grid Pooling Classification 91.71 82.50 81.41
in all three difficulty levels. This benefits from that the key- X RoI-grid Pooling IoU-guided Scoring 92.57 84.83 82.69
points enlarge the receptive fields by bridging the 3D voxel Table 9. Effects of predicted keypoint weighting module, RoI-grid
CNN and RoI-grid points, and the segmentation supervi- pooling module and IoU-guided confidence prediction.
sion of keypoints also enables a better multi-scale feature
keypoints by the new proposed voxel set abstraction layer,
learning from the 3D voxel CNN. Besides, a small set of
and the learned discriminative features of keypoints are then
keypoints as the intermediate feature representation also de-
aggregated to the RoI-grid points with multiple receptive
creases the GPU memory usage when compared with the
fields to capture much richer context information for the
directly pooling strategy.
fine-grained proposal refinement. Experimental results on
Effects of different features for VSA module. In Table 8,
the KITTI dataset and the Waymo Open dataset demon-
we investigate the importance of each feature component of
strate that our proposed voxel-to-keypoint scene encoding
keypoints in Eq. (3) and Eq. (4). The 1st row shows that
and keypoint-to-grid RoI feature abstraction strategy signif-
the performance drops a lot if we only aggregate features
(raw) icantly improve the 3D object detection performance com-
from fi , since the shallow semantic information is not pared with previous state-of-the-art methods.
enough for the proposal refinement. The high level seman-
(pv ) (pv ) (bev)
tic information from fi 3 , fi 4 and fi improves the References
performance significantly as shown in 2 to 5th rows. As
nd
[1] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia.
shown in last four rows, the additions of relative shallow se-
(pv ) (pv ) (raw) Multi-view 3d object detection network for autonomous
mantic features fi 1 , fi 2 , fi further improves the
driving. In The IEEE Conference on Computer Vision and
performance slightly and the best performance is achieved Pattern Recognition (CVPR), July 2017.
with all the feature components as the keypoint features. [2] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast
Effects of PKW module. We propose the predicted key- point r-cnn. In Proceedings of the IEEE international con-
point weighting (PKW) module in Sec. 3.2 to re-weight ference on computer vision (ICCV), 2019.
the point-wise features of keypoint with extra keypoint seg- [3] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d
mentation supervision. Table 9 (1st and 4th rows) shows spatio-temporal convnets: Minkowski convolutional neural
that removing the PKW module drops performance a lot, networks. In Proceedings of the IEEE Conference on Com-
which demonstrates that the PKW module enables better puter Vision and Pattern Recognition, pages 3075–3084,
multi-scale feature aggregation by focusing more on the 2019.
foreground keypoints, since they are more important for the [4] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
ready for autonomous driving? the kitti vision benchmark
succeeding proposal refinement network.
suite. In Conference on Computer Vision and Pattern Recog-
Effects of RoI-grid pooling module. We investigate the nition (CVPR), 2012.
effects of RoI-grid pooling module by replacing it with the [5] Benjamin Graham, Martin Engelcke, and Laurens van der
RoI-aware pooling [26] and keeping the other modules con- Maaten. 3d semantic segmentation with submanifold sparse
sistent. Table 9 shows that the performance drops signifi- convolutional networks. CVPR, 2018.
cantly when replacing RoI-grid pooling module, which val- [6] Benjamin Graham and Laurens van der Maaten. Submani-
idates that our proposed set abstraction based RoI-grid pool- fold sparse convolutional networks. CoRR, abs/1706.01307,
ing could learn much richer contextual information, and the 2017.
pooled features also encode more discriminative RoI fea- [7] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Re-
tures by pooling more effective features with large search current slice networks for 3d segmentation of point clouds.
radii for each grid point. 1st and 2nd rows of Table 7 also In Proceedings of the IEEE Conference on Computer Vision
shows that comparing with the 3D voxel RPN, the perfor- and Pattern Recognition, pages 2626–2635, 2018.
mance increases a lot after the proposal is refined by the [8] Maximilian Jaritz, Jiayuan Gu, and Hao Su. Multi-view
pointnet for 3d scene understanding. In Proceedings of the
features aggregated from the RoI-grid pooling module.
IEEE International Conference on Computer Vision Work-
shops, pages 0–0, 2019.
5. Conclusion [9] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
ing Jiang. Acquisition of localization confidence for accurate
We have presented the PV-RCNN framework, a novel object detection. In Proceedings of the European Conference
method for accurate 3D object detection from point clouds. on Computer Vision (ECCV), pages 784–799, 2018.
Our method integrates both the multi-scale 3D voxel CNN [10] KITTI leader board of 3D object detection benchmark.
features and the PointNet-based features to a small set of http://www.cvlibs.net/datasets/kitti/
9
eval_object.php?obj_benchmark=3d, Accessed [25] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr-
on 2019-11-15. cnn: 3d object proposal generation and detection from point
[11] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, cloud. In Proceedings of the IEEE Conference on Computer
and Steven Waslander. Joint 3d proposal generation and ob- Vision and Pattern Recognition, pages 770–779, 2019.
ject detection from view aggregation. IROS, 2018. [26] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang,
[12] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, and Hongsheng Li. From points to parts: 3d object detection
Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders from point cloud with part-aware and part-aggregation net-
for object detection from point clouds. CVPR, 2019. work. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2020.
[13] Johannes Lehner, Andreas Mitterecker, Thomas Adler,
[27] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
Markus Hofmarcher, Bernhard Nessler, and Sepp Hochre-
amodal 3d object detection in rgb-d images. In Proceed-
iter. Patch refinement - localized 3d object detection. CoRR,
ings of the IEEE Conference on Computer Vision and Pattern
abs/1910.04093, 2019.
Recognition, pages 808–816, 2016.
[14] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiao- [28] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
gang Wang. Gs3d: An efficient 3d object detection frame- Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
work for autonomous driving. In Proceedings of the IEEE Splatnet: Sparse lattice networks for point cloud processing.
Conference on Computer Vision and Pattern Recognition, In Proceedings of the IEEE Conference on Computer Vision
pages 1019–1028, 2019. and Pattern Recognition, pages 2530–2539, 2018.
[15] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, [29] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud,
and Baoquan Chen. Pointcnn: Convolution on x-transformed Beatriz Marcotegui, Francois Goulette, and Leonidas J.
points. In Advances in Neural Information Processing Sys- Guibas. Kpconv: Flexible and deformable convolution for
tems, pages 820–830, 2018. point clouds. In The IEEE International Conference on Com-
[16] Ming Liang*, Bin Yang*, Yun Chen, Rui Hu, and Raquel Ur- puter Vision (ICCV), October 2019.
tasun. Multi-task multi-sensor fusion for 3d object detection. [30] Bei Wang, Jianping An, and Jiayan Cao. Voxel-fpn: multi-
In CVPR, 2019. scale voxel feature aggregation in 3d object detection from
[17] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. point clouds. CoRR, abs/1907.05286, 2019.
Deep continuous fusion for multi-sensor 3d object detection. [31] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
In ECCV, 2018. Michael M Bronstein, and Justin M Solomon. Dynamic
[18] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and graph cnn for learning on point clouds. ACM Transactions
Piotr Dollár. Focal loss for dense object detection. IEEE on Graphics (TOG), 38(5):146, 2019.
transactions on pattern analysis and machine intelligence, [32] Zhixin Wang and Kui Jia. Frustum convnet: Sliding frustums
2018. to aggregate local point-wise features for amodal 3d object
[19] Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. detection. In IROS. IEEE, 2019.
Point-voxel CNN for efficient 3d deep learning. CoRR, [33] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep
abs/1907.03739, 2019. convolutional networks on 3d point clouds. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
[20] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang,
Recognition, pages 9621–9630, 2019.
Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif,
[34] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed-
Patrick Nguyen, Zhifeng Chen, Jonathon Shlens, and Vijay
ded convolutional detection. Sensors, 18(10):3337, 2018.
Vasudevan. Starnet: Targeted computation for object detec-
[35] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploit-
tion in point clouds. CoRR, abs/1908.11069, 2019.
ing hd maps for 3d object detection. In 2nd Conference on
[21] Charles R. Qi, Or Litany, Kaiming He, and Leonidas J. Robot Learning (CoRL), 2018.
Guibas. Deep hough voting for 3d object detection in point
[36] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-
clouds. In The IEEE International Conference on Computer
time 3d object detection from point clouds. In Proceed-
Vision (ICCV), October 2019.
ings of the IEEE Conference on Computer Vision and Pattern
[22] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Recognition, pages 7652–7660, 2018.
Leonidas J. Guibas. Frustum pointnets for 3d object detec- [37] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya
tion from rgb-d data. In The IEEE Conference on Computer Jia. STD: sparse-to-dense 3d object detector for point cloud.
Vision and Pattern Recognition (CVPR), June 2018. ICCV, 2019.
[23] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. [38] Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia.
Pointnet: Deep learning on point sets for 3d classification Pointweb: Enhancing local neighborhood features for point
and segmentation. In Proceedings of the IEEE Conference on cloud processing. In Proceedings of the IEEE Conference
Computer Vision and Pattern Recognition, pages 652–660, on Computer Vision and Pattern Recognition, pages 5565–
2017. 5573, 2019.
[24] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J [39] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo
Guibas. Pointnet++: Deep hierarchical feature learning on Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d
point sets in a metric space. In Advances in Neural Informa- object detection. In International Conference on 3D Vision
tion Processing Systems, pages 5099–5108, 2017. (3DV). IEEE, 2019.
10
[40] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang
Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Va-
sudevan. End-to-end multi-view fusion for 3d object detec-
tion in lidar point clouds. CoRR, abs/1910.06528, 2019.
[41] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
for point cloud based 3d object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4490–4499, 2018.
[42] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and
Gang Yu. Class-balanced grouping and sampling for point
cloud 3d object detection. CoRR, abs/1908.09492, 2019.
11