Li, Kong 2022 - SRIF-RCNN Sparsely Represented Inputs Fusion of Different Sensors For 3D Object Detection
Li, Kong 2022 - SRIF-RCNN Sparsely Represented Inputs Fusion of Different Sensors For 3D Object Detection
https://doi.org/10.1007/s10489-022-03594-1
Abstract
3D object detection is a vital task in many practical applications, such as autonomous driving, augmented reality and
robot navigation. Significant advances have been made in recent LiDAR-only 3D detection methods, but sensor fusion
3D detection methods received less attention and have not made much progress. This paper aims to lift the 3D detection
performance of sensor fusion methods. To this end, we present a novel sensor fusion strategy to effectively extract and
fuse the features from different sensors. Firstly, the different sensor outputs are transformed in to sparsely represented
inputs. Secondly, features are extracted from the inputs through an efficient backbone. Finally, the extracted features of
different sensors are fused in a point-wise manner with the help of a gate mechanism. In addition, color supervision is also
introduced to learn color distribution for the first time, which can provide discriminative features for proposal refinement.
Based on the sensor fusion strategy and color distribution estimation, a multi-sensor 3D object detection network, named
Sparsely Represented Inputs Fusion RCNN (SRIF-RCNN), is proposed. It achieves state-of-the-art performance on the
highly competitive KITTI official 3D detection leaderboard, which ranks 1st and 2nd among sensor fusion methods and
LiDAR-only methods with published works, respectively. Extensive experiments were implemented, and the effectiveness
of the proposed network was validated
Keywords 3D object detection · LiDAR point cloud · RGB image · Sensor fusion · Autonomous driving
13
SRIF-RCNN: Sparsely represented... 5533
Theoretically speaking, sensor fusion methods should its efficiency. But for RGB images, almost all the existing
obtain more accurate 3D bounding boxes due to the extra sensor fusion methods directly use dense RGB images
input over image-only or LiDAR-only methods. However, as inputs and employ traditional convolutions to extract
the sensor fusion methods [12–23] do not perform well features. Such a practice will further reduce the efficiency
compared with LiDAR-only methods [24–46]. The reasons of sensor fusion methods which need to process extra
may be summarized into two aspects. (i) There is not an inputs. We found that not all the pixels in RGB images are
effective unified pattern to directly process the two kinds of meaningful for 3D object detection with different sensors.
inputs. RGB images and LiDAR point clouds are absolutely In detail, camera always has bigger vertical field of view
different data, and hence the data representation is not (FOV) than LiDAR, which means that RGB images are
consistent. For RGB images, the data format is regular “taller” than LiDAR point cloud in height. The upper part
and dense, but for LiDAR point clouds, the data format of RGB images always contains sky and upper building
is irregular and sparse. Some existing methods [15–19] facades, which belongs to background and cannot be
utilized the two kinds of inputs in a cascade manner, they scanned by LiDAR. But the commonly detected objects
used mature 2D object detectors to generate 2D proposals such as cars are always in the lower part of the image, and
from dense RGB images first and projected it into 3D space can be scanned by LiDAR. That is to say, the overlapped
to obtain 3D proposals, then used PointNet series methods part between LiDAR point clouds and RGB images is
[47, 48] to refine the proposals. But performance of such the most reliable information for object detection in real
methods heavily relies on the performance of the utilized scenes. Motivated by the observation, we use LiDAR point
2D object detectors. Some other methods [12–14] parallel clouds as the benchmark to segment the overlapped part
processed the two kinds of inputs, they projected the 3D in RGB images, and transfer the information in dense
LiDAR point clouds into dense LiDAR BEV map, and RGB image to lightweight and sparse pixels. By this way,
employed 2D convolutions to process the LiDAR BEV map the different sensor inputs are all transformed to sparsely
and RGB image for LiDAR and RGB feature extraction, represented inputs(sparse voxels and pixels), which can be
respectively. But the projection operation from 3D LiDAR processed by the efficient sparse convolutions. We also
point clouds to BEV map will compress features along observe that in every down-sampling or up-sampling stage
height channel, which will cause spatial information loss. of 2D or 3D backbones, the coordinates of LiDAR points are
(ii) There is not an optimal way to fuse the features learnt unchanged, while the coordinates of feature map pixels or
from different sensors. LiDAR point clouds are in 3D voxels are continuously changed. The LiDAR points could
view, and contain 3D information of the surroundings; RGB be effective middle hosts to encode, store and fuse features
images are in front view and contain RGB information from different sensors to solve the sensitivity to alignment
of the surroundings. The two types of inputs are different of different feature maps. The 2D or 3D feature maps
in information contained and view. Most existing methods only need to be aligned with the unchangeable middle host
fused RGB image feature map, LiDAR BEV feature map points instead of the other feature maps whose coordinates
or LiDAR front-view feature map for feature fusion. In are accordingly changed. For feature fusion, a feature
detail, Chen et al. [12] used pooling operation to normalize complement module with gate mechanism is proposed to
different feature maps to same resolution, and used element- control the proportion of encoded features. It makes the
wise mean to fuse the features. Ku et al. [13] used resize & encoded features with high gate weights pass through to
crop operation to normalize the RGB image feature map and the fused features and resists the encoded features with
LiDAR BEV feature map to same resolution. Liang et al. low gate weights flowing to the fused features. As RGB
[14] adopted continuous convolution to build dense LiDAR information is utilized, color supervision is also introduced
BEV feature maps, and did point-wise feature fusion with to learn color distribution of LiDAR points, which can
dense image feature maps by using element-wise mean provide discriminative features for 3D object detection.
operation. However, pooling operation and resize & crop Based on the observations, we propose a new state-of-
operation will cause inaccurate correspondence between the-art sensor fusion 3D object detection network SRIF-
these features from different sensors. Element-wise mean or RCNN in this paper, which lifts the 3D object detection
summation is inappropriate to fuse such view-different and performance. Our contributions could be summarized into
information-different features. Thus it can be seen that how the following five aspects. (1) Information in dense RGB
to effectively and efficiently process the different sensor images is transferred to sparse pixels for efficient image
inputs and fuse the features from different sensors became a feature extraction for the first time. (2) A novel sensor
challenge for 3D object detection task. fusion backbone is proposed to process different sensor
In recent years, voxelization has been a mature way to inputs, it parallel extracts features from sparsely represented
encode LiDAR point clouds, and sparse convolution [49, inputs, and encodes them to point-wise features. (3) Feature
50] has been widely adopted for 3D object detection due to complement module is proposed to fuse the three types
13
5534 X. Li and D. Kong
of point-wise features with the help of gate mechanism, network based on equally distributed 3D voxels, which can
which can effectively supply single source features by other learn more spatial features compared with 2D grid-based
source features. (4) Color supervision is introduced to learn methods. With the help of the works on sparse submanifold
color distribution of points which has never been explored convolution [49, 50], Yan et al. [24] improved VoxelNet
in previous works. (5) The proposed SRIF-RCNN network and designed an efficient 3D convolution structure to learn
outperforms all the sensor fusion methods with published voxel-wise features, which reduced computation cost of 3D
works on the highly competitive KITTI official 3D detection voxel-based backbone. Based on this, He et al. [26] and
benchmark in the commonly used moderate subset for cars. Shi et al. [27] dug the potential information in ground truth
labels to boost the 3D detection performance, Shi et al. [28]
deeply integrated the advantages of the grid-based 3D voxel
2 Related works CNNs backbone and point-based set abstraction operation,
and achieved impressive 3D detection performance. Zheng
2.1 Image-only 3D object detection methods et al. [32] lifted the performance by introducing knowledge
distillation mechanism. Other researchers [33, 34, 39–42,
Chen et al. [5] generated proposals by fusing several 51–53] also proposed effective network to detect 3D objects
features such as semantic information, contour, object based on grids. In summary, the grid based methods are
shape, context, and location prior from monocular images, mainly based on 2D or 3D CNNs, which can learn rich
and refined the proposals through several convolution voxel-wise features and therefore generate accurate 3D
layers. The authors also proposed a method [6] that can proposals. However, it suffers from 3D information loss
generate high quality proposals by stereo imagery, which during progressively down-sampling, and the associated
shows good results at that time. Mosavian et al. [7] used two field is limited due to the small kernel size of 3D
DCNNs to regress the relatively stable 3D object properties convolutions.
(direction, length, width and height), and then combined
the properties with the geometric constraints of the 2D 3D detection by point based representation With the help
target bounding boxes to generate complete 3D bounding of PointNet and PointNet++ [47, 48], Shi et al. [30] and
boxes. Chabot et al. [8] and Xiang et al. [9] estimated Yang et al. [42] proposed point-based backbones to directly
3D object information with the help of CAD models of generate 3D proposals from raw point cloud instead of
3D objects, while Wang et al. [10] and You et al. [11] using grids or voxels. Qi et al. [43] utilized Hough voting
recovered pseudo point clouds from corresponding stereo strategy for more effective feature grouping. Li et al. [44]
RGB images first, and predict 3D bounding boxes by using utilized a PointNet-based encoder and decoder to generate
some general 3D detection baselines. The abovementioned proposals and gave a solution for the intersection of union
methods explored several ways to predict 3D bounding (IoU) assignment mismatching problem for better proposal
boxes from 2D RGB images. However, the results of the refinement. Yang et al. [45] proposed a novel point sampling
abovementioned methods are not good in recovering depth method to reduce computation cost of point-based methods.
information and cannot meet the requirements of real-time Most point-based methods rely on set abstraction operation
and high quality 3D detection. [47], which provides more flexible associated fields for
point feature learning. However, the point-based methods
2.2 LiDAR-only 3D object detection methods are inefficient for big scene with massive raw points.
Recently published works are mostly based on LiDAR point 2.3 Sensor fusion 3D object detection methods
cloud due to the rich 3D information it contains. A large
fraction of LiDAR-only methods transforms the raw point Existing Sensor fusion methods generally extract and fuse
cloud to regular 2D grids or 3D voxels and minority of them features following deep fusion strategy. Among them,
directly learn features from raw point cloud. MV3D [12], which was extended from a classical 2D
object detector, Faster R-CNN, projected the point cloud
3D detection by grid based representation Early works of to bird-eye-view and front-view to extract LiDAR features,
this branch primarily use DCNNs to process 2D grids then fused them with RGB image features for 3D object
or 3D voxels. Among them, Yang et al. [35] effectively detection. Ku et al. [13] and Liang et al. [14] further
exploited four view features point clouds for better semantic improved MV3D by fusing full resolution features from
feature learning and 3D detection. Lang et al. [25] and different views and by adding depth information to the
Yang et al. [36, 37] also proposed efficient 3D detection fusion process, respectively. Yoo et al. [54] proposed a
frameworks based on 2D BEV representation of LiDAR cross-view feature mapping module to obtain dense RGB
point cloud. Zhou et al. [38] first proposed an end-to-end voxel feature, and then fuse with LiDAR voxel features
13
SRIF-RCNN: Sparsely represented... 5535
for better 3D detection. In addition to deep fusion strategy, RGB image are first transformed to sparsely represented
Pang et al. [55] conducted sensor fusion following late inputs (sparse voxels for LiDAR point cloud and sparse
fusion strategy. F-PointNet [16] built connection between pixels for RGB image), and fed into a backbone to
image and point cloud in a cascade manner. Xu et al. extract sparsely represented features. There are three sub-
[15] and Du et al. [17] also utilized similar 2D-driven-3D backbones in the backbone: (i) The first one is a 3D
idea for 3D object detection. Compared with LiDAR-only sub-backbone [27], it extracts 3D voxel-wise features and
methods, although sensor fusion methods exploit extra RGB encodes the features to the middle host points at every
information as inputs, the 3D detection performance is still down-sampling or up-sampling stage. (ii) The second one is
lower than top LiDAR-only methods. a 2D sub-backbone composed of 2D sparse and submanifold
convolutions to process 2D lightweight sparse pixels, it
extracts 2D pixel-wise features and encodes the features
3 SRIF-RCNN: Sparsely represented inputs to middle host points by bilinear interpolation at every
fusion for 3D object detection from RGB down-sampling or up-sampling stage. (iii) The last one
image and LiDAR point cloud is a point-wise sub-backbone composed of several Multi-
Layer Perception (MLP) blocks to directly learn middle host
In this section, the proposed SRIF-RCNN is introduced point features from raw point cloud. Then, the three types
in detail. Section 3.1 introduces overall architecture of the of encoded point-wise features are fused by the proposed
proposed network. Sections 3.2 and 3.3 introduce stage I feature complement module. Finally, a region proposal
and stage II details of the proposed network. Section 3.4 network (RPN) is adopted to obtain 3D proposals. In stage
shows the overall loss functions of the proposed framework. II, the fused features are first fed into a multi-task point head
to predict color distribution, point scores and intra-object
3.1 Architecture of SRIF-RCNN part locations [27]. Then, the fused point features with
color distribution and intra-object part locations are fused
The overall architecture of the proposed SRIF-RCNN is and fed into RoI-grid pooling module [28] to aggregate
shown in Fig. 1. In stage I, LiDAR point cloud and color distribution information and intra-object locations to
13
5536 X. Li and D. Kong
several eventually distributed grids. Finally, 3D proposals clouds is formulated as (2), where Pi is the intrinsic matrix
are refined and scored based on the aggregated features of camera i, R0 is the rectifying rotation matrix of camera 0
through several fully connected layers. and T r velo to cam is the extrinsic matrix to project points
from LiDAR to camera 0.
3.2 Stage I: Feature extraction and fusion of sparsely ⎛ ⎞
⎛ ⎞ x
represented inputs and 3D proposal generation ui ⎜y ⎟
⎝ vi ⎠ = Pi ∗ R0 ∗ T r velo to cam ∗ ⎜ ⎟ (2)
⎝z ⎠
In this section, Stage I of the proposed SRIF-RCNN is 1
1
introduced in detail. It aims to extract features from different
sensors, and effectively fuse the obtained different form Guided by (1) and (2), raw dense RGB image and the
features to unified form point-wise features. front-view image are associated through the LiDAR points.
The RGB value in raw dense RGB image can project to
3.2.1 Sparsely represented inputs generation LiDAR points and then project to the corresponding sparse
pixels of the generated front-view image. There are two
Different from the previous works [12–14], we first issues that should be noted: (i) In the generation of front-
transform the dense RGB image to sparse form instead view image, multiple points may connect to only one pixel,
of processing it directly. In detail, front-view images hence the projected points may be laid in one pixel region.
with sparse pixels are first generated from LiDAR point Facing this issue, we use the features of LiDAR point
clouds, and RGB information in raw RGB images is with minimum range as the initial value of the projected
projected to the corresponding sparse pixels guiding by the pixel that associated with multiple points. (ii) Due to the
correspondence between camera and LiDAR sensors. By scan range of LiDAR and the existence of space between
this way, information in dense RGB image pixels can be LiDAR scan lines, not all the pixels in front-view image
transferred to lightweight sparse pixels. On the other hand, have its associated points, that is to say, there will be some
we follow the previous way [24, 26, 28, 38] to transform the empty pixels in the generated front-view image. Instead
LiDAR point cloud to sparse 3D voxels which could cause of encoding the empty pixels by a constant value [12, 13,
less spatial information loss compared with projecting it to 57] to obtain dense image, such pixels will not be encoded
BEV map in the proposed method, and thus sparsity is preserved
in the generated front-view image. Finally, lightweight
sparse front-view image is generated, the non-empty pixels
2D sparse pixels generation Denote P : pi = (xi , yi , zi ),
are encoded by red, green, blue channel value, reflective
i = 1, 2, 3, · · · , N as the LiDAR points with size of
intensity and range. The dense RGB images can be replaced
N×3, where N is the number of points, x, y and z are
by the 2D sparse pixels in the generated lightweight sparse
the coordinates along depth, width and height dimension.
front-view images. There are about 15000 non-empty pixels
Points in P can be projected to front-view plane to form a
encoded by using the proposed strategy which are only
2D front-view image by using spherical projection [56], the
about 3.22% of the raw RGB image with resolution of
pixel coordinates in the obtained image is normalized to [0,
1242×375. Figure 2a and b show raw RGB image and the
1], and can be formulated as (1).
⎛ ⎞ generated lightweight sparse front-view image. The yellow
0.5 1 − arctan(y,x)
× w bounding box in Fig. 2a represents the overlapped part
u ⎠
= ⎝
π
( arcsin(z,r)+f ovup ) (1) between LiDAR point cloud and RGB image, the upper
v 1− ×h
f ov figure in Fig. 2b shows lightweight sparse front-view image
Where u and v are pixel coordinates along width and height that colored by the distance between point and LiDAR,
dimension, r = x 2 + y 2 + z2 is the range between points the lower figure in Fig. 2b shows the binary image of
and LiDAR sensor, w and h are pre-customized, referring to lightweight front-view image, where white pixels represent
the width and height of the front-view image (512 and 64 in the generated 2D sparse pixels and black pixels represent
this paper), f ov = f ovup + f ovdown is the vertical FOV of the non-encoded pixels. It can be seen that several pixels
the LiDAR sensor. do not contain foreground LiDAR points are non-encoded.
In order to transfer the RGB information to the front- Such a practice will save computation time compared with
view image, the correspondence between RGB pixels directly using raw RGB images, while important foreground
and raw points needs to be found. Take KITTI dataset information in RGB images is retained.
as example, the autonomous vehicle is equipped with 2
grey cameras (0 and 1) and 2 RGB cameras (2 and 3D sparse voxels generation As aforementioned, voxeliza-
3), the grey camera 0 is used as reference camera. tion is employed to transform the irregular distributed
Correspondence between RGB images and LiDAR point LiDAR point clouds to regular 3D voxels. Denote D, W
13
SRIF-RCNN: Sparsely represented... 5537
(a)
(b)
and H as the depth, width and height of the segmented 3D Due to the inevitably information loss of voxelization and
space ([0.05, 0.05, 0.1]m in this paper), d = [dx, dy, dz] as spherical projection in Section 3.2.1, a point flow is adopted,
the quantization step along depth, width and height dimen- it stacks several convolution layers with 1x1 kernel size
sion. The 3D space can be divided into several 3D voxels to form MLPs, and directly learns point-wise features to
with size of {D/dx} × {W/dy} × {H /dz}, where {·} rep- supply extra information from raw LiDAR points. The
resents a floor function. LiDAR points are then assigned parameters of each layer in the proposed backbone are also
to the obtained 3D voxels according to their coordinates, given in Fig. 3.
i.e., if a point is inside a voxel, the point is assigned to this
voxel, and vice versa. For non-empty voxels, mean value 3.2.3 Point-wise feature fusion of multi-sensor
of the point input features (3D coordinates and reflectance
intensity) is used as the voxel feature. Finally, the irregu- In Section 3.2.2, voxel-wise, pixel-wise and point-wise
lar LiDAR points are transformed to regular 3D voxels with features from different inputs are extracted. In this section,
resolution of 4 × {D/dx} × {W/dy} × {H /dz}. the three types of features are encoded to middle host points,
and the encoded features are effectively fused by using the
3.2.2 Sparse convolution based sensor fusion backbone for proposed feature complement module, which can provide
multi-sensor feature extraction discriminative features for point-wise feature learning and
proposal refinement in Stage II. Denote Fv ∈ RNv ×S ,
In this section, a sparse convolution based sensor fusion Fp ∈ RNp ×S , Fmh ∈ RM×S as the extracted 3D voxel-wise
backbone is introduced for the feature extraction of features, 2D pixel-wise features and point-wise features.
sparsely represented inputs. Figure 3 shows details of the The encoding operation of voxels or pixels to middle host
proposed backbone. For 3D voxel flow, an encoder-decoder points can be formulated as (3), where ϕ (·) is defined as
structure [27, 58] is utilized to learn voxel-wise features the encoding operation function, F ∈ RM×S is the encoded
and full resolution point features. The 3D sparse voxels feature, m and n represent elements in voxels (pixels) and
with resolution of 4 × {D/dx} × {W/dy} × {H /dz} are middle host points, respectively.
down sampled by 1 time, 2 times, 4 times and 8 times,
F (m) = ϕ Fv , Fp , n (3)
and then up sampled by same proportion to the original
resolution. The down-sampling and up-sampling blocks are In this paper, set abstraction operation is utilized to
all made of sparse convolution layers (sparse convolution encode the voxel-wise features to the middle host points.
and submanifold sparse convolution). For 2D pixel flow, For voxels, the center coordinates are used as coordinates
an encoder-decoder structure similar with 3D voxel flow is of each voxel, hence the 3D voxels can be regarded as a set
adopted, the major difference is that 2D sparse convolution of points. The formula for encoding operation from voxels
layers are employed instead of 3D sparse convolution layers. to middle host points is listed in (4), where Concatenate(·)
13
5538 X. Li and D. Kong
Fig. 3 Details of the proposed sparse convolution based sensor fusion backbone
is concatenation operation for features with two different coordinates of the pixels along width and height dimension
grouping radius, G(·) is grouping operation, r is the in front-view image, respectively.
grouping radius, MLP represents a block composed of two
stacked convolution-ReLU-batch normalization sub-blocks Fp2mh (m) = ϕ Fp , n
and Max(·) is a max pooling layer to remain discriminative = 1 − (nx − nulx ) 1 − (nx − nblx )
features from the grouped voxel-wise features. (5)
Fv (nul ) Fv (nur ) 1 − ny − nuly
×
Fv (nbl ) Fv (nbr ) 1 − ny − nbly
Fv2mh (m) = ϕ (Fv , n) = Concatenate (Max (MLP (G (Fv (n), r))))
i=1,2,3,4,5,6,7 By using set abstraction and bilinear interpolation,
(4) the voxel-wise and pixel-wise features are encoded in
Different from encoding operation of 3D voxels, we use a point-wise manner, and therefore the three kinds of
four neighbor bilinear interpolation to encode pixel features features can be fused to obtain discriminative features.
to middle host points instead of set abstraction. The However, there will be information redundancy and
initial resolution of 2D front-view image used in this pollution if the obtained three kinds of point-wise features
paper is 512×64, which is lower than voxel resolution, a are fused by directly summation, element-wise mean,
pixel represents about a [0.16, 0.08]m grid in width and concatenation or multiplication operation. On the other
height dimension, the four neighborhood pixels can provide words, the importance of the three kinds of features is
decent associated field for pixel-wise feature encoding. The not equal, they should be re-weighted and then carefully
formula for pixel-wise feature encoding is listed in (5), complemented to each other for feature fusion. Inspired
where nul , nur , nbl , nbr are four neighbor pixels around the by the mature developed gate mechanism, a novel multi-
middle point to be encoded. The subscript x and y represent sensor feature fusion module named feature complement
13
SRIF-RCNN: Sparsely represented... 5539
Fp2mh ∈ RM×S and Fmh ∈ RM×S pass through a weight Where Fv and Fp are full resolution voxel-wise and pixel-
learning function wlf(·) composed of several linear layers wise features, respectively. Ff rv and Ff rp are features
(blue block) followed by batch normalization layers(green encoded from full resolution voxel-wise features and
block) and ReLU layer to learn weights W, which is pixel-wise features, respectively. w(m) is the interpolation
regarded as point-wise soft masks to weight the importance weight, and listed in (10). It is calculated following inverse
of the features. Furthermore, we add a gate structure, which distance weighting method, where rbp is the neighbor radius
is shown in the right part of Fig. 4, to control the proportion in the interpolation process, η(m) and η(n) are coordinates
of features to be complemented. The gate-based fusion of elements m and n, respectively.
operation can be formulated and listed in (6) and (7) in total.
1/η(m) − η(n)2 ∀η(m) − η(n)2 ≤ rbp
As listed in (6), if the weights wlf(Fv2mh ) are low, the gated w(m) =
0 otherwise
parameter, 1-wlf(Fv2mh ), will be high, and more features
from Fp2mh and Fmh will flow to Fv2mh . Vice versa, features (10)
from Fp2mh and Fmh will be limited if wlf(Fv2mh ) is high. Finally, the fused point features are obtained by concatenat-
ing the obtained fused features at each stage and interpo-
Fcv =wlf (Fv2mh ) Fv2mh +[1−wlf(Fv2mh )] lated full resolution features. It can be formulated as (11),
(6) where i represents the sample stages (four down-sampling
× wlf Fp2mh Fp2mh +wlf(Fmh ) Fmh
stages and three up-sampling stages).
Fcp =wlf Fp2mh Fp2mh + 1−wlf Fp2mh F P F = Concatenate Fcv i i
, Fcp , Ff rv , Ff rp (11)
(7) i=1,2,3,4,5,6,7
× [wlf(Fv2mh ) Fv2mh +wlf(Fmh ) Fmh ]
In addition, after the sparse inputs are up sampled to the 3.2.4 3D proposals generation
initial resolution at the end of the backbone, full resolution
features of voxels and pixels can be obtained. As shown In this paper, voxel feature map is used to assign anchors
in Fig. 3, we use trilinear interpolation to encode the full and generate 3D proposals for three reasons. (i) The size
resolution features to middle host points as a supplement of of objects varies different in the front-view 2D image due
the finally fused features, which can be formulated as (8) to their distance away from camera, but is similar in 3D
and (9). view or BEV view, which is reasonable to set anchors; (ii)
M The objects are occluded by each other in front-view but
w(m)
Ff rv (n) = ϕ (Fv , m) = M Fv (m) (8) not in 3D view or BEV view, assigning anchors on image
m=1 m=1 w(m) feature map will inevitably cause occlusion issue because
13
5540 X. Li and D. Kong
bounding box cannot divide the objects separately; (iii) is proposed to measure spatial distribution of the colors for
Voxel feature map are obtained from inputs with higher objects. It is encoded by RGB value, reflective intensity and
resolution, which will provides fine-grained features for distance to the car center.
3D proposal generation. As shown in Fig. 1, the backbone Take a ground truth box as example, denote
produces 8 times down sampled voxel feature map with (Ci,r , Ci,g , Ci,b ) and (ci,r , ci,g , ci,b ) as the suppressed RGB
resolution of S × D/8 × W/8 × H /8. It is first processed by value and ground truth RGB value, respectively. Firstly,
a series of 2D convolution layers to obtain BEV feature map the ground truth RGB values of points are suppressed to
with resolution of SH /8 × D/8 × W/8. After that, an RPN [0,1]. It can be formulated as (12), where i represents the
similar with [27–29] is utilized to generate high quality 3D foreground point index in ground truth box.
proposals. ci,r ci,g ci,b
Ci,r , Ci,g , Ci,b = , , (12)
255 255 255
3.3 Stage II: Point-wise supervision and proposal Secondly, the distance between foreground points in ground
refinement truth box and ground truth box center is calculated by (13),
where the subscript c represents ground truth box center and
In this section, Stage II of the proposed SRIF-RCNN, which the subscript gt represents ground truth.
aims to refine the 3D proposals and predict accurate 3D
2 (xi , yi , zi ) − (xc , yc , zc )2
bounding boxes, is introduced in detail. The labels in 2D Disti = (13)
wgt , hgt , lgt
ground truth boxes provides object positions in image, but 2
due to the occlusion issue, pixels from different objects In order to integrate intensity information, the reflective
cannot be clearly separated. But for the labels of 3D ground intensity RIi is multiplied as a parameter to the suppressed
truth boxes in LiDAR points, the points in each object RGB value (Ci,r , Ci,g , Ci,b ) and relative distance Disti .
can be absolutely separated, the ground truth information Finally, the proposed distance-color can be obtained and
of points can be used as regression masks to learn some formulated as (14).
point-wise features like intra-object part locations [27].
(dc) (dc) (dc)
As discussed in Section 3.2, the proposed sensor fusion DCi = Ri , Gi , Bi = RIi ∗Disti ∗ Ci,r , Ci,g , Ci,b
backbone produced fused point features with spatial, color
(14)
and texture information. It provides feasibility to predict
spatial-based and color-based features of the foreground Since Ci,r , Ci,g , Ci,b , Disti and RIi are all laid in [0,1], the
points, which is crucial for object detection. Therefore, three distance-color DC is also laid in [0,1], binary cross entropy
point-wise tasks: color distribution estimation, point scores loss is utilized and applied to the points, which can be
prediction and intra-object part locations estimation are formulated as (15), where the subscript pd represents the
carried out in this stage. The fused point features are fed into network prediction.
two MLP blocks followed by sigmoid functions to predict
color distribution, point scores and intra-part locations. L(dc) = −DCpd log DCgt − 1 − DCpd log 1 − DCgt
(15)
3.3.1 Point-wise tasks
Point segmentation and intra-object part location estima-
Color distribution estimation As discussed, 3D coordinates tion In big scene LiDAR point clouds, foreground points
provide rich information for object detection, and are made are much less than background points, but they should con-
full use of in previous 3D object detection methods [26–28], tribute more than the background points. Therefore, the
but color information has not been paid enough attention. foreground segmentation task [28] is carried out to pre-
In fact, RGB information is also crucial to recognize dict foreground scores to re-weight the fused point features.
3D objects, and reflective intensity also makes sense in To overcome the imbalance issue of the foreground and
recognizing different texture part of objects. For example, background points, focal loss [59] is used to calculate the
wheels made of rubber are always in black color and have foreground segmentation loss. For intra-object part location
low reflectivity, and the large area car bodies made of metal estimation, we follow the definition and settings in [27] to
are in unified color such as red, yellow and silver, which define the relative positions and calculate loss.
have high reflective intensity. In other words, the wheel
points at four lower locations of the car are in black color 3.3.2 3D proposal refinement
and low reflective intensity, and almost all the other points
around the car center are in unified color and high reflective For 3D proposal refinement, the point features in a 3D
intensity. Therefore, color distribution estimation is carried proposal should be aggregated and pooled to same size.
out in this section. A new parameter named distance-color In this section, we divide the 3D proposals to eventual
13
SRIF-RCNN: Sparsely represented... 5541
(dir)
grid points(voxel centers) following the way in [27, 28], Lrpn is a direction classification loss to handle the
and aggregate point features to grid points by using set ambiguous issue of the two opposite direction and
abstraction operation. Specifically, the fused point features calculated by cross entropy loss. To summarize, the overall
are first re-weighted following predicted scores obtained in region proposal loss can be obtained by (20).
Section 3.3.1, and concatenated with distance-colors and (reg)
Lrpn = L(cls)
rpn + 2Lrpn + 0.2Lrpn
(dir)
(20)
intra-part locations, which can be listed in (16).
(dc) (dc) (dc) (part) (part) (part)
F = Concatenate Ff used , Rpd , Gpd , Bpd , xpd , ypd , zpd Middle host point loss For middle host point loss Lmhp ,
(dc)
(16) it is composed of the distance color loss Lmhp , foreground
Then, n × n × n grid points are evenly sampled inside the (seg)
segmentation loss Lmhp and intra-object part location loss
3D proposals along three dimension. F is aggregated to the (ip) (dc)
Lmhp . For distance color loss Lmhp and intra-part location
sampled grid points by using set abstraction operation, the (ip)
equation is similar with (4) and listed in (17). loss Lmhp , binary cross entropy loss [27] is utilized. Facing
(seg)
Fag = Concatenate(Max(MLP(G(F, r)))) (17) the similar imbalance issue with L(cls)
rpn , Lmhp is also
r=r1 ,r2 calculated by focal loss with default hyper parameters
As shown in Fig. 1, after aggregating the discriminative for point foreground-background segmentation. The middle
point features F to grid points, the aggregated grid point host point loss is as follows.
features Fag are then flattened to represent the whole (dc) (seg) (ip)
Lmhp = Lmhp + Lmhp + Lmhp (21)
features of the 3D proposal. Based on the flattened 3D
proposal features, the box confidence scores and 3D
Proposal refinement loss As discussed in Section 3.3.2,
bounding box sizes are predicted by two fully connected
the proposal refinement loss Lrcnn utilizes the 3D IoU-
branches. A 3D IoU-based cross entropy loss [26–28] is (cd)
based cross entropy loss Lrcnn for confidence prediction, the
utilized for box confidence loss calculation and the classical (reg)
Smooth-L1 loss Lrcnn for 3D box refinement and the cross
Smooth-L1 loss [3] is used to calculate the box residuals (dir)
entropy loss Lrcnn for orientation direction classification.
regression loss.
The box regression targets used in Smooth-L1 loss are
encoded as same as (18). The proposal refinement loss is
3.4 Overall loss
listed in (22).
(reg)
The overall losses of the proposed network are composed of Lrcnn = L(cd)
rcnn + Lrcnn + Lrcnn
(dir)
(22)
region proposal loss Lrpn , middle host point loss Lmhp and
proposal refinement loss Lrcnn with equal weights.
4 Experiments
Region proposal loss For the region proposal loss Lrpn ,
we adopt the similar loss with PV-RCNN [28], which is In this section, experiments were implemented on the fre-
(cls) (reg) (dir) quently used KITTI dataset. Section 4.1 introduces KITTI
composed of Lrpn , Lrpn and Lrpn . Facing the imbalance
of positive and negative anchors, focal loss [59] with default dataset and implementing details about training and infer-
hyper parameters is used to calculate the classification loss ence. Section 4.2 introduces the performance comparison
(cls) (reg) with the state-of-the-art approaches. In Section 4.3, several
Lrpn . For the box regression loss Lrpn , the regression
targets are encoded to residuals following (18), where x, y extensive experiments were implemented to demonstrate the
and z are the center coordinates, w, l and h are the width, effectiveness, usefulness and universality of the proposed
length and height, θ is the yaw angle, d is the diagonal of the sensor fusion strategy. Several qualitative results are illus-
base, respectively. Ground truth and anchor are indicated by trated and analyzed in Section 4.4 to show the performance
superscript gt and a, respectively. of the proposed network. Section 4.5 lists some future works
which could further improve the 3D detection performance.
x gt − x a y gt − y a zgt − za wgt
res = a
a
, a
, a
, log a ,
d d h w 4.1 Details of KITTI dataset and experiment
implementation
l gt hgt
log a , log a , sin θ gt − θ a (18)
l h
Dataset details KITTI dataset [60], which is one of the
Then Smooth-L1 loss is adopted to regress the predicted most authoritative and competitive datasets in the field of
residuals of anchors, which is listed in (19). autonomous driving, was used in our experiments. There
are 7481 training samples and 7518 test samples in KITTI
(reg)
Lrpn = Smooth − L1 res pd , res gt (19) object detection task with a [0, -40, -1]m and [70.4, 40,
13
5542 X. Li and D. Kong
3]m 3D detection range along depth, width and height Finally, rotated NMS with a threshold of 0.1 is adopted to
dimension. Followed by the frequently used partition [24– remove redundant predicted boxes.
28], we split the training samples (LiDAR points and
corresponding RGB images) into a 3712 training split for 4.2 3D detection results comparison with
training and a 3769 validation split for parameter tuning state-of-the-art approaches
and ablation studies. To evaluate the performance on KITTI
test server, 6000 samples were randomly selected from the Results on test set For an authoritative and fair comparison
7481 training samples as training split, and the rest 1481 with other state-of-the-art methods, 3D detection results on
samples were used as validation split. Since vehicles are the KITTI test set are produced and uploaded to the official
commonly used category in previous works, we evaluate our KITTI test server. Table 1 shows the performance of the
method on car subset of KITTI dataset. proposed network and sensor fusion methods published in
recent 3 years on the official KITTI 3D object detection
Training and inference details In training stage for the 3712 leaderboard as of July 15th , 2021. In Tables 1–5, 7,10–11,
training split, a set of 2048 points were sampled as the the bold numbers and underlined numbers indicate the best
middle host points by using farthest sampling strategy. The and second best results, respectively. In Tables 6, 8 and
proposed network was trained with only one RTX 2080TI 9, the bold numbers indicate beneficial improvements. To
GPU, Adaptive moment estimation(Adam) was used as the evaluate the detection results on KITTI test set, we use the
optimizer. The batch size, learning rate and training epochs authoritative recommended metric [5, 61], average precision
were 2, 0.00125 and 50, respectively. The learning rate (AP) with 40 recall positions, as the evaluation metric. 0.7
was decayed following cosine annealing strategy. It took is used as the IoU threshold for prediction assignment. For
about 35 hours for one complete training period. In our a comprehensive comparison of the existing methods, we
experiments, 128 proposals (64 positive and 64 negative also list results of the recently published high performance
proposals) were randomly selected for proposal refinement. LiDAR-only methods on KITTI test leaderboard in Table 2.
Specifically, if a proposal is of more than 0.55 IoU with As shown in Table 1, the proposed method ranks 2nd
its corresponding ground truth box, it is considered as in hard and average BEV detection, which shows bal-
a positive proposal, otherwise it is a negative proposal. anced performance in the three difficulties. Among the
For data augmentation, the input points were (i) randomly methods, the best performance for easy BEV detection
flipped along X axis, global scaled with a random sampled is achieved by UberATG-ContFuse, but its moderate and
value in [0.95, 1.05]; (ii) global rotated around Z axis with hard BEV detection results show that it cannot provide
a randomly sampled angle in [−π/4, π/4]. In addition, (iii) high quality BEV detection results when the objects are
“copy&paste” strategy [24, 28] was also utilized for data relatively difficult to be detected (occluded or in dis-
augmentation. In inference stage, the top 100 proposals tant). CLOC PVCas achieves best performance in moderate
according to the confidence scores obtained from Stage I and hard BEV detection. But for 3D detection, the pro-
were selected and fed into Stage II for further refinement. posed SRIF-RCNN outperforms all the other sensor fusion
Table 1 Performance comparison on the KITTI official test server among sensor fusion methods. Results were reported by AP with 40 recall
positions
MV3D [12] CVPR2017 86.49 78.98 72.23 79.23 74.97 63.63 54.00 64.20 360
AVOD [13] IROS2018 90.99 84.82 79.62 85.14 83.07 71.76 65.73 73.52 100
UberATG-ContFuse [14] ECCV2018 94.07 85.35 75.88 85.10 83.68 68.78 61.67 71.38 60
F-PointNet [16] CVPR2018 91.17 84.67 74.77 83.54 82.19 69.79 60.59 70.86 170
F-ConvNet [19] CVPR2019 91.51 85.84 76.11 84.49 87.36 76.39 66.69 76.81 470
UberATG-MMF [21] CVPR2019 93.67 88.21 81.99 87.96 88.40 77.43 70.22 78.68 80
3D-CVF [54] ECCV2020 93.52 89.56 82.45 88.51 89.20 80.05 73.11 80.79 75
CLOCs PVCas [55] IROS2020 93.05 89.80 86.57 89.81 88.94 80.67 77.15 82.25 100
PI-RCNN [18] AAAI2020 91.44 85.81 81.00 86.08 84.37 74.82 70.03 76.41 100
PointPainting [20] CVPR2020 92.45 88.11 83.36 87.97 82.11 71.70 67.08 73.63 400
SRIF-RCNN(Ours) - 92.10 88.77 86.06 88.98 88.45 82.04 77.54 82.68 94.69
13
SRIF-RCNN: Sparsely represented... 5543
Table 2 Performance comparison on the KITTI official test server among LiDAR-only methods. Results were reported by AP with 40 recall
positions
SECOND [24] SENSORS2018 89.39 83.77 78.59 83.92 83.34 72.55 65.82 73.90 250
PointPillars [25] CVPR2019 90.07 86.56 82.81 86.48 82.58 74.31 68.99 75.29 23.6
PointRCNN [30] CVPR2019 92.13 87.39 82.72 87.41 86.96 75.64 70.70 77.77 100
STD [42] ICCV2019 94.74 89.19 86.42 90.12 87.95 79.71 75.09 80.92 80
RangeRCNN [57] arXiv2020 92.15 88.40 85.74 88.76 88.47 81.33 77.09 82.30 60
SA-SSD [26] CVPR2020 95.03 91.03 85.96 90.67 88.75 79.79 74.16 80.9 40.1
EBM3DOD [51] arXiv2020 95.44 89.63 84.34 89.80 91.05 80.12 72.78 81.32 120
TANET [31] AAAI2020 91.58 86.54 81.19 86.44 84.39 75.94 68.82 76.38 34.75
Point-GNN [52] CVPR2020 93.11 89.17 83.90 88.73 88.33 79.47 72.29 80.03 643
3DSSD [45] CVPR2020 92.66 89.02 85.86 89.18 88.36 79.57 74.55 80.83 38
PartA2Net [27] IEEEPAMI2020 91.70 87.79 84.61 88.03 87.81 78.49 73.51 79.94 80
PV-RCNN [28] CVPR2020 94.98 90.65 86.14 90.59 90.25 81.43 76.82 82.83 80
DVFENet [53] NeuroComputing2021 90.93 87.68 84.40 87.67 86.20 79.18 74.58 79.99 33.1
P2V-RCNN [40] IEEEAccess2021 92.72 88.63 86.14 89.16 88.34 81.45 77.20 82.33 100
SIE-NET [34] arXiv2021 92.38 88.65 86.03 89.02 88.22 81.71 77.22 82.38 80
Voxel-RCNN [39] AAAI2021 94.85 88.83 86.13 89.94 90.90 81.62 77.06 83.19 40
FromVoxelToPoint [41] ACM2021 92.23 88.61 86.11 88.98 88.53 81.58 77.37 82.49 100
CIA-SSD [33] AAAI2021 93.74 89.84 82.39 88.66 89.59 80.28 72.87 80.91 30.76
SE-SSD [32] CVPR2021 95.68 91.84 86.72 91.41 91.49 82.54 77.15 83.73 30.56
SRIF-RCNN(Ours) 92.10 88.77 86.06 88.98 88.45 82.04 77.54 82.68 94.69
methods in the important moderate and difficult hard sub- state-of-the-art performance. Therefore, it can be deduced
sets. It leads the 2nd best CLOCs PVCas by 1.37% and that the performance of SE-SSD is mainly attributed to
0.39% on moderate and hard subsets, respectively. Although the employed knowledge distillation mechanism. But the
the AP for easy car is not in the top 2, the average proposed sensor fusion SRIF-RCNN still achieves second
AP of the proposed SRIF-RCNN also ranks 1st among place in the commonly used moderate subset and first
all the sensor fusion methods. The results show that the place in the difficult hard subset, which reflects a certain
proposed SRIF-RCNN achieves balanced and comparable competitiveness for moderate and hard object detection
performance in BEV detection and top performance in 3D compared with the top LiDAR-only method.
detection among sensor fusion methods. As shown in the
last column of Table 1, it can be seen that UberATG- Results on validation set Following the previous works,
Contfuse is the most efficient sensor fusion network. experiment was also conducted on the 3769 validation
However, AP of UberATG-Contfuse falls behind the pro- set to validate the effectiveness of the proposed network.
posed method by large margins (4.77%, 13.26%, 15.87% Because almost all the sensor fusion methods only report
and 11.3% for each subset). Compared with 3D-CVF and AP with 11 recall positions on validation set, we use this
CLOC PVCas which have relatively high detection preci- evaluation metric to evaluate the performance. Performance
sion, the proposed method runs second fast while achiev- comparison on validation set among sensor fusion methods
ing best performance. The proposed sensor fusion process is reported and listed in Table 3.
requires about 15 extra milliseconds compared with LiDAR As shown in Table 3, the proposed SRIF-RCNN
baseline. outperforms all the other sensor fusion methods on the
As shown in Tables 1 and 2, it can be seen that important moderate and hard subsets with a big margin.
SE-SSD shows top performance on both BEV and 3D Specifically, the proposed method leads the 2nd best 3D-
detection among LiDAR-only and sensor fusion methods. CVF by 4.82% and 0.71% on moderate and hard subsets. In
It utilizes a simple structure in CIA-SSD [33], and addition, it also ranks 1st on average, and leads 3D-CVF by
introduces knowledge distillation mechanism in the training 1.74%, which shows balanced capability to handle objects
process. As shown in Table 2, CIA-SSD only ranks with different detection difficulties. Because CLOC PVCas
9th in the moderate car subset, which cannot achieve provided results with 40 recall positions, we compare
13
5544 X. Li and D. Kong
Table 3 Performance comparison on 3769 validation set among sensor fusion methods. Results were reported by AP with 11 recall positions
the proposed method with CLOC PVCas in Table 4 with comprehensive comparison. As shown in Table 5, the
40 recall positions. As shown in Table 4, CLOC PVCas proposed SRIF-RCNN also shows state-of-the-art capability
outperforms the proposed method in the 3769 validation to detect hard objects. Besides, the proposed method
set. We think it is because the employed two detectors (PV- outperforms other LiDAR-only methods in the important
RCNN [28] and Cascade-RCNN [62]) in CLOC PVCas are moderate subsets except SE-SSD, and ranks 1st on average
mature in 3D and 2D detection, which could have given except SIENET and SE-SSD.
satisfactory 3D and 2D detection results. And based on In addition, several state-of-the-art methods [27, 28,
the satisfactory detection results, CLOC PVCas achieved 30] with well open sourced codes are re-tested in same
improved performance on the 3769 validation set. But it experiment environment, recalls of RoIs and predicted
should be noted that the data distribution of the 3712 boxes are reported and shown in Fig. 5. As shown
training set and 3769 validation set is consistent, while in Fig. 5, PartA2Net-free [27] and Point-RCNN [30]
the data distribution of the authoritative test set is different that use point-based proposal generation strategy have
with the training and validation set. Although the proposed low recall of RoIs, while PartA2Net [27], PV-RCNN
method is worse than CLOC PVCas on the 3769 validation [28] and the proposed SRIF-RCNN that use voxel-based
set, it achieves better performance on the authoritative proposal generation scheme achieve better performance
validation set as shown in Table 1, which shows good for recall of RoIs. And it can be seen from Fig. 5
generalization ability. Because real scenes are always not that the proposed SRIF-RCNN shows the capability to
consistent and variable, such an ability is important in real generate high quality proposals and predict accurate 3D
applications. On the other hand, CLOC PVCas needs to bounding boxes with the help of the proposed sensor fusion
generate 2D and 3D candidates from the 2D and 3D object strategy and the rich multi-modal information in point
detectors first and then fuse the candidates to obtain more features.
accurate 3D boxes. Such a strategy needs more time than the It has been proved that the point-wise supervision will
proposed method, which can be seen from the last column influence the performance of the network [26]. PV-RCNN
of Table 1. only carries out foreground segmentation task for point-
We also report performance comparison among the wise supervision, while the proposed SRIF-RCNN carried
high performance LiDAR-only methods in Table 5 for out extra color distribution and intra-part location estimation
tasks. In Table 6, the improvements of the proposed
Table 4 Performance comparison with CLOC PVCas on the 3769 SRIF-RCNN over PV-RCNN are reported. As shown in
validation set. Results were reported by AP with 40 recall positions Table 6, recall of RoI is increased by 0.86%, which shows
Car (3D) (%)
that the extra point-wise supervision tasks contribute to a
Method Reference performance improvement for the quality of proposals. The
Easy Mod. Hard mAP. improvement on recall of predicted boxes shows that intra-
object part locations [27] and the proposed color distribution
CLOC PVCas [55] IROS2020 92.78 85.94 83.25 87.32
make the fused point features more discriminative for
SRIF-RCNN (Ours) - 92.23 85.43 83.17 86.94 accurate 3D proposal refinement.
13
SRIF-RCNN: Sparsely represented... 5545
Table 5 Performance comparison on the 3769 validation set among Table 6 Improvements of the proposed SRIF-RCNN over PV-RCNN.
LiDAR-only methods. Results were reported by AP with 11 recall Results were reported by AP with 40 recall positions
positions
Method Recall of Recall of
Method Reference Car (3D) (%) RoI (IoU=0.7) predicted boxes (IoU=0.7)
13
5546 X. Li and D. Kong
Table 7 Effects of different components. Results were reported by AP with 40 recall positions
Voxel Flow Point Flow Pixel Flow Point Segmentation Intra-object Part location Color Distribution FCM Easy Mod. Hard
√ √ √ √
91.86 84.55 82.60
√ √ √ √ √
92.24 84.67 82.77
√ √ √ √ √ √
91.97 85.12 82.89
√ √ √ √ √ √
92.15 85.16 82.98
√ √ √ √ √ √ √
92.23 85.43 83.17
The point score predicted by the point segmentation head features obtained from LiDAR sensor, and the proposed
weighs the probability of point belongs to foreground or SRIF-RCNN predicts point scores by using fused point
background. That is to say, foreground points are more features obtained from different sensors. As shown in
likely to have high scores and background points should Fig. 6, it can be seen that the background point scores
have low scores. Figure 6 shows the mean foreground and are effectively suppressed by using the proposed sensor
background point scores of the proposed method and PV- fusion strategy and are lower than that of PV-RCNN, but
RCNN. PV-RCNN predicted point scores by using point the boundaries for foreground point scores are ambiguous.
13
SRIF-RCNN: Sparsely represented... 5547
Table 8 Mean point scores of SRIF-RCNN and PV-RCNN Table 10 Performance comparison of FCM and other feature fusion
methods. Results were reported by AP with 40 recall positions
Method Mean foreground Mean background
point score point score Method Car (3D) (%)
13
5548 X. Li and D. Kong
studied. To sum up, the employed encoder-decoder structure information for LiDAR-only methods. And the proposed
is optimal for our sensor fusion strategy. SRIF-RCNN effectively detected such objects with the help
of extra color information. As shown in the three scenes in
4.4 Qualitative results Fig. 8b, grasses, glass guardrail and ladder were incorrectly
detected as cars by PV-RCNN. As grasses are different from
In this section, several prediction results on test set of KITTI cars in almost every way in RGB images, the proposed
dataset were generated and illustrated. For each scene, SRIF-RCNN did not identify it as a car. Because glass
the upper row shows the 3D point clouds and the lower guardrail and ladder are similar with cars in some views in
row shows the corresponding real world images. Figure 7 LiDAR point cloud, it could be ambiguous for LiDAR-only
shows 3D prediction results in six scenes, where the upper methods in some cases. But for the proposed SRIF-RCNN,
row shows the 3D point clouds and detected 3D bounding such cases are unlikely to happen with the help of different
boxes, and the lower row shows the corresponding real- sensors. The results of SRIF-RCNN in Fig. 8 shows that
world 2D images and 2D bounding boxes projected from the it successfully identified such ambiguous objects and gave
3D bounding boxes. As illustrated in Fig. 7, the proposed satisfactory 3D bounding boxes.
SRIF-RCNN could give satisfactory 3D detection results
in various scenes, cars were well detected even they are in 4.5 Future works
distant or occluded by other objects.
Figure 8 shows 3D detection results for SRIF-RCNN As discussed, the proposed SRIF-RCNN achieved state-
and PV-RCNN in six scenes on KITTI test set, where the of-the-art precision compared with previous sensor fusion
detected 3D bounding boxes of SRIF-RCNN are illustrated methods. However, there are also some concerns should
in the upper part of each scene and that of PV-RCNN are be further studied. (i) Throughout the works published
illustrated in the lower part of each scene, red circles in in recent years, the feature extraction backbone based
the lower part represent false detections. As shown in the on sparse convolutions has been widely employed, and
three scenes in Fig. 8a, PV-RCNN missed some targets the backbone structure has not changed much. But in
in distant, while the proposed SRIF-RCNN successfully recent years, transformer [64] structures show dominant
detected these objects. We consider it is because the performance in many traditional computer vision tasks like
objects in distant are of few points which lacks geometry image classification [65, 66] and 2D object detection [67].
13
SRIF-RCNN: Sparsely represented... 5549
(a)
(b)
Several researchers [63, 68, 69] explored ways to introduce cause large consumption in computation and storage.
such structure in 3D object detection task, and achieved How to effectively and efficient transfer the transformer
well performance. However, transformer structures always structure to sensor fusion methods is a direction worthy
13
5550 X. Li and D. Kong
of study, whereas sensor fusion methods need to process fusion methods for 3D detection, and achieves competitive
extra inputs. (ii) Compared with the research on network performance with state-of-the-art LiDAR-only methods
structure, some mechanisms like knowledge distillation in some cases. In addition, extensive experiments were
could be another direction worthy of study to improve implemented, the results show the effectiveness of each
the 3D detection performance. Specifically, powered by component in the proposed method. In our future work,
knowledge distillation mechanism, SE-SSD [32] trained we will continuously explore ways to lift the effectiveness,
a simple structure, and achieved top performance for efficiency and adaptation of sensor fusion methods.
3D detection. On the other hand, knowledge distillation
Acknowledgements The authors gratefully acknowledge the financial
mechanism can also improve the efficiency of sensor fusion
support from the National Natural Science Foundation of China (nos.
methods due to the simplified student structure compared 61501394 and 62173289) and Natural Science Foundation of Hebei
with teacher structure. In addition, BtcDet and SPG [70, province of China (no. F2016203155).
71] show the importance of shape augmentation mechanism
for 3D object detection. Progressive resizing technique Declarations
[72] is also proved to be effective for prediction tasks,
which could be used in the training process. (iii) The Conflict of Interests The authors declare that they have no conflict of
interest.
adaptation of sensor fusion methods in practical use is also
a point should be further studied. As for sensor fusion
methods, RGB images or LiDAR points may be lost in References
real applications, and the performance will be degraded.
There have been several co-learning [73] methods to address 1. He K, Zhang X, Ren S, Sun J (2016) Deep residual
such issue for classification tasks (image, video, sound and learning for image recognition. In: 2016 IEEE Conference
media event), sentiment analysis, video activity recognition, on computer vision and pattern recognition, pp 770–778.
https://doi.org/10.1109/CVPR.2016.90
vision-language tasks. Based on the works, we will explore 2. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only
appropriate ways to apply co-learning strategies on 3D look once: Unified, real-time object detection. In: 2016 IEEE
object detection task to improve the adaptation of sensor conference on computer vision and pattern recognition, pp 779–
fusion methods. (iv) In recent years, great progresses have 788. https://doi.org/10.1109/CVPR.2016.91
3. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn:
been made in 3D object detection tasks powered by DCNNs, Towards real-time object detection with region proposal net-
and several companies like Tesla and Baidu began to test works. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149.
their autonomous vehicles in real scenes. However, the https://doi.org/10.1109/TPAMI.2016.2577031
complex, opaque and black-box nature of the deep neural 4. Yang W, Li Z, Wang C, Li J (2020) A multi-task faster r-cnn
method for 3d vehicle detection based on a single image. Appl Soft
nets, especially for complicated multi-modal networks, Comput 95:106533. https://doi.org/10.1016/j.asoc.2020.106533
limits their social acceptance and usability. This leads to the 5. Simonelli A, Bulò SR, Porzi L, Lopez-Antequera M, Kontschieder
requirement for the interpretability of AI models. To this P (2019) Disentangling monocular 3d object detection. In: 2019
end, Explanation AI (XAI) technique [74] should be further IEEE/CVF International conference on computer vision, pp 1991–
1999. https://doi.org/10.1109/ICCV.2019.00208
studied to explain the internal working mechanisms of AI
6. Chen X, Kundu K, Zhu Y, Ma H, Fidler S, Urtasun R (2018)
models. In addition to promote the social acceptance, the 3d object proposals using stereo imagery for accurate object class
internal working mechanisms can also help improving the detection. IEEE Trans Pattern Anal Mach Intell 40(5):1259–1272.
design of new methods towards wide application, which is https://doi.org/10.1109/TPAMI.2017.2706685
7. Mousavian A, Anguelov D, Flynn J, Kosecka J (2017) 3d bound-
meaningful for the autonomous driving industry.
ing box estimation using deep learning and geometry. In: 2017
IEEE conference on computer vision and pattern recognition,
pp 7074–7082. https://doi.org/10.1109/CVPR.2017.597
5 Conclusions 8. Chabot F, Chaouch M, Rabarisoa J, Teulière C, Chateau T (2017)
Deep manta: a coarse-to-fine many-task network for joint 2d
and 3d vehicle analysis from monocular image. In: 2017 IEEE
In this paper, a sensor fusion 3D object detection Conference on computer vision and pattern recognition, pp 1827–
network SRIF-RCNN is proposed, which achieves top 1836. https://doi.org/10.1109/CVPR.2017.198
performance among all the existing sensor fusion methods. 9. Xiang Y, Choi W, Lin Y, Savarese S (2017) Subcategory-aware
convolutional neural networks for object proposals and detection.
The proposed SRIF-RCNN generates lightweight sparse
In: 2017 IEEE Winter conference on applications of computer
front-view images for efficient image feature extraction vision, pp 924–933. https://doi.org/10.1109/WACV.2017.108
for the first time. Color supervision is also introduced 10. Wang Y, Chao W-L, Garg D, Hariharan B, Campbell
to obtain color distribution of LiDAR points which has M, Weinberger KQ (2019) Pseudo-lidar from visual depth
never been explored in previous works. Experiment results estimation: Bridging the gap in 3d object detection for
autonomous driving. In: 2019 IEEE/CVF Conference on
on the authoritative and competitive KITTI dataset show computer vision and pattern recognition, pp 8437–8445.
that SRIF-RCNN outperforms all the previous sensor https://doi.org/10.1109/CVPR.2019.00864
13
SRIF-RCNN: Sparsely represented... 5551
11. You Y, Wang Y, Chao W-L, Garg D, Pleiss G, Hariharan B, Camp- 27. Shi S, Wang Z, Shi J, Wang X, Li H (2021) From points to
bell M, Weinberger KQ (2019) Pseudo-lidar++: Accurate depth parts: 3d object detection from point cloud with part-aware and
for 3d object detection in autonomous driving. arXiv:1906.06310 part-aggregation network. IEEE Trans Pattern Anal Mach Intell
12. Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view 3d 43(8):2647–2664. https://doi.org/10.1109/TPAMI.2020.2977026
object detection network for autonomous driving. In: 2017 IEEE 28. Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X, Li
Conference on computer vision and pattern recognition, pp 6526– H (2020) Pv-rcnn: Point-voxel feature set abstraction for
6534. https://doi.org/10.1109/CVPR.2017.691 3d object detection. In: 2020 IEEE/CVF Conference on
13. Ku J, Mozifian M, Lee J, Harakeh A, Waslander SL computer vision and pattern recognition, pp 10526–10535.
(2018) Joint 3d proposal generation and object detection https://doi.org/10.1109/CVPR42600.2020.01054
from view aggregation. In: 2018 IEEE/RSJ International 29. Shi S, Jiang L, Deng J, Wang Z, Guo C, Shi J, Wang X, Li H
conference on intelligent robots and systems, pp 1–8. (2021) Pv-rcnn++: Point-voxel feature set abstraction with local
https://doi.org/10.1109/IROS.2018.8594049 vector representation for 3d object detection. arXiv:2102.00463
14. Liang M, Yang B, Wang S, Urtasun R (2018) Deep continuous 30. Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal
fusion for multi-sensor 3d object detection. In: 2018 European generation and detection from point cloud. In: 2019 IEEE/CVF
conference on computer vision, pp 641–656 Conference on computer vision and pattern recognition, pp 770–
15. Xu D, Anguelov D, Jain A (2018) Pointfusion: Deep sensor 779. https://doi.org/10.1109/CVPR.2019.00086
fusion for 3d bounding box estimation. In: 2018 IEEE/CVF 31. Liu Z, Zhao X, Huang T, Hu R, Bai X (2020) Tanet: Robust
Conference on computer vision and pattern recognition, pp 244– 3d object detection from point clouds with triple attention. 2020
253. https://doi.org/10.1109/CVPR.2018.00033 AAAI Conference on Artificial Intelligence 34(7):11677–11684.
16. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets https://doi.org/10.1609/aaai.v34i07.6837
for 3d object detection from rgb-d data. In: 2018 IEEE/CVF 32. Zheng W, Tang W, Jiang L, Fu C-W (2021) Se-ssd: Self-
Conference on computer vision and pattern recognition, pp 918– ensembling single-stage object detector from point cloud. In:
927. https://doi.org/10.1109/CVPR.2018.00102 2021 IEEE/CVF Conference on computer vision and pattern
17. Du X, Ang MH, Karaman S, Rus D (2018) A general recognition, pp 14494–14503
pipeline for 3d detection of vehicles. In: 2018 IEEE Interna- 33. Zheng W, Tang W, Chen S, Jiang L, Fu C-W (2021) Cia-ssd:
tional conference on robotics and automation, pp 3194–3200. Confident iou-aware single-stage object detector from point cloud.
https://doi.org/10.1109/ICRA.2018.8461232 In: 2021 AAAI Conference on artificial intelligence, vol 35,
18. Xie L, Xiang C, Yu Z, Xu G, He X (2020) Pi-rcnn: pp 3555–3562
an efficient multi-sensor 3d object detector with point-based 34. Li Z, Yao Y, Quan Z, Yang W, Xie J (2021) Sienet: Spatial
attentive cont-conv fusion module. In: 2020 AAAI Con- information enhancement network for 3d object detection from
ference on artificial intelligence, vol 34, pp 12460–12467. point cloud. arXiv:2103.15396
https://doi.org/10.1609/aaai.v34i07.6933 35. Yang Y, Chen F, Wu F, Zeng D, Ji Y-M, Jing X-
19. Wang Z, Jia K (2019) Frustum convnet: Sliding frus- Y (2020) Multi-view semantic learning network for point
tums to aggregate local point-wise features for amodal cloud based 3d object detection. Neurocomputing 397:477–485.
3d object detection. In: 2019 IEEE/RSJ international con- https://doi.org/10.1016/j.neucom.2019.10.116
ference on intelligent robots and systems, pp 1742–1749. 36. Yang B, Liang M, Urtasun R (2020) Hdnet: Exploiting hd maps
https://doi.org/10.1109/IROS40897.2019.8968513 for 3d object detection. arXiv:2012.11704
20. Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: 37. Yang B, Luo W, Urtasun R (2018) Pixor: Real-time 3d object
Sequential fusion for 3d object detection. In: 2020 IEEE/CVF detection from point clouds. In: 2018 IEEE/CVF Conference
conference on computer vision and pattern recognition, pp 4603– on computer vision and pattern recognition, pp 7652–7660.
4611. https://doi.org/10.1109/CVPR42600.2020.00466 https://doi.org/10.1109/CVPR.2018.00798
21. Liang M, Yang B, Chen Y, Hu R, Urtasun R (2019) Multi-task 38. Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for point
multi-sensor fusion for 3d object detection. In: 2019 IEEE/CVF cloud based 3d object detection. In: 2018 IEEE/CVF Conference
Conference on computer vision and pattern recognition, pp 7337– on computer vision and pattern recognition, pp 4490–4499.
7345. https://doi.org/10.1109/CVPR.2019.00752 https://doi.org/10.1109/CVPR.2018.00472
22. Wu Y, Jiang X, Fang Z, Gao Y, Fujita H (2021) Multi-modal 39. Deng J, Shi S, Li P, Zhou W, Zhang Y, Li H (2021) Voxel r-cnn:
3d object detection by 2d-guided precision anchor proposal Towards high performance voxel-based 3d object detection. In:
and multi-layer fusion. Applied Soft Computing 108:107405. Proceedings of the AAAI conference on artificial intelligence, vol
https://doi.org/10.1016/j.asoc.2021.107405 35, pp 1201–1209
23. Tian Y, Wang K, Wang Y, Tian Y, Wang Z, Wang F-Y (2020) 40. Li J, Sun Y, Luo S, Zhu Z, Dai H, Krylov AS, Ding Y,
Adaptive and azimuth-aware fusion network of multimodal local Shao L (2021) P2v-rcnn: Point to voxel feature learning for 3d
features for 3d object detection. Neurocomputing 411:32–44. object detection from point clouds. IEEE Access 9:98249–98260.
https://doi.org/10.1016/j.neucom.2020.05.086 https://doi.org/10.1109/ACCESS.2021.3094562
24. Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded 41. Li J, Dai H, Shao L, Ding Y (2021) From voxel to point: Iou-
convolutional detection. Sensors 18(10). https://doi.org/10.3390/ guided 3d object detection for point cloud with voxel-to-point
s18103337 decoder. Association for Computing Machinery, New York, NY,
25. Lang AH, Vora S, Caesar H, Zhou L, Beijbom O (2019) Pointpil- USA. https://doi.org/10.1145/3474085.3475314
lars: Fast encoders for object detection from point clouds. In: 2019 42. Yang Z, Sun Y, Liu S, Shen X, Jia J (2019) Std: Sparse-to-
IEEE/CVF Conference on computer vision and pattern recogni- dense 3d object detector for point cloud. In: 2019 IEEE/CVF
tion, pp 12689–12697. https://doi.org/10.1109/CVPR.2019.01298 International conference on computer vision, pp 1951–1960.
26. He C, Zeng H, Huang J, Hua XS, Zhang L (2020) https://doi.org/10.1109/ICCV.2019.00204
Structure aware single-stage 3d object detection from 43. Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting
point cloud. In: 2020 IEEE/CVF Conference on com- for 3d object detection in point clouds. In: 2019 IEEE/CVF
puter vision and pattern recognition, pp 11870–11879. international conference on computer vision, pp 9277–9286.
https://doi.org/10.1109/CVPR42600.2020.01189 https://doi.org/10.1109/ICCV.2019.00937
13
5552 X. Li and D. Kong
44. Li J, Luo S, Zhu Z, Dai H, Krylov AS, Ding Y, Shao L IEEE Conference on computer vision and pattern recognition,
(2020) 3d iou-net: Iou guided 3d object detector for point clouds. pp 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
arXiv:2004.04962 61. KITTI 3D object detection benchmark leaderboard. http://www.
45. Yang Z, Sun Y, Liu S, Jia J (2020) 3dssd: Point-based 3d cvlibs.net/datasets/kitti/eval object.php?obj benchmark=3d/.
single stage object detector. In: 2020 IEEE/CVF conference Accessed on 2021-7-15
on computer vision and pattern recognition, pp 11040–11048. 62. Cai Z, Vasconcelos N (2019) Cascade r-cnn: High quality
https://doi.org/10.1109/CVPR42600.2020.01105 object detection and instance segmentation. IEEE Trans-
46. Deng J, Zhou W, Zhang Y, Li H (2021) From multi-view to actions on Pattern Analysis and Machine Intelligence.
hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection. https://doi.org/10.1109/TPAMI.2019.2956516
IEEE Trans Circuits Syst Video Technol 31(12):4722–4734. 63. Mao J, Xue Y, Niu M, Bai H, Feng J, Liang X, Xu H, Xu
https://doi.org/10.1109/TCSVT.2021.3100848 C (2021) Voxel transformer for 3d object detection. In: 2021
47. Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: Deep IEEE/CVF International conference on computer vision (ICCV),
learning on point sets for 3d classification and segmentation. pp 3164–3173
In: 2020 IEEE/CVF conference on computer vision and pattern 64. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L,
recognition, pp 652–660. https://doi.org/10.1109/CVPR.2017.16 Gomez AN, Kaiser L. u., Polosukhin I (2017) Attention
48. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: Deep is all you need. In: Guyon I, Luxburg UV, Bengio S,
hierarchical feature learning on point sets in a metric space. Wallach H, Fergus R, Vishwanathan S, Garnett R (eds)
arXiv:1706.02413 Advances in neural information processing systems, vol 30.
49. Graham B, van der Maaten L (2017) Submanifold sparse Curran Associates Inc. https://proceedings.neurips.cc/paper/2017/
convolutional networks. arXiv:1706.01307 file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
50. Graham B, Engelcke M, Maaten LVD (2018) 3d semantic segmen- 65. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X,
tation with submanifold sparse convolutional networks. In: 2018 Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S,
IEEE/CVF conference on computer vision and pattern recogni- Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words:
tion, pp 9224–9232. https://doi.org/10.1109/cvpr.2018.00961 Transformers for image recognition at scale. arXiv:2010.11929
51. Gustafsson FK, Danelljan M, Schön TB (2021) Accurate 3d object 66. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B
detection using energy-based models. In: 2021 IEEE/CVF con- (2021) Swin transformer: Hierarchical vision transformer using
ference on computer vision and pattern recognition workshops, shifted windows. In: 2021 IEEE/CVF International conference on
pp 2849–2858. https://doi.org/10.1109/CVPRW53098.2021.00320 computer vision (ICCV), pp 10012–10022
52. Shi W, Rajkumar R (2020) Point-gnn: Graph neural network 67. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A,
for 3d object detection in a point cloud. In: 2020 IEEE/CVF Zagoruyko S (2020) End-to-end object detection with transform-
Conference on computer vision and pattern recognition, pp 1711– ers. In: 2020 European conference on computer vision. Springer,
1719. https://doi.org/10.1109/CVPR42600.2020.00178 pp 213–229. https://doi.org/10.1007/978-3-030-58452-8 13
53. He Y, Xia G, Luo Y, Su L, Zhang Z, Li W, Wang 68. Sheng H, Cai S, Liu Y, Deng B, Huang J, Hua X-S, Zhao
P (2021) Dvfenet: Dual-branch voxel feature extraction net- M-J (2021) Improving 3d object detection with channel-wise
work for 3d object detection. Neurocomputing 459:201–211. transformer. In: 2021 IEEE/CVF International conference on
https://doi.org/10.1016/j.neucom.2021.06.046 computer vision (ICCV), pp 2743–2752
54. Yoo JH, Kim Y, Kim J, Choi JW (2020) 3d-cvf: Generating joint 69. Guan T, Wang J, Lan S, Chandra R, Wu Z, Davis L, Manocha D
camera and lidar features using cross-view spatial feature fusion (2022) M3detr: Multi-representation, multi-scale, mutual-relation
for 3d object detection. In: 16th European conference on computer 3d object detection with transformers. In: 2022 IEEE/CVF Winter
vision. Springer Science and Business Media Deutschland GmbH, conference on applications of computer vision (WACV), pp 772–
pp 720–736. https://doi.org/10.1007/978-3-030-58583-9 43 782
55. Pang S, Morris D, Radha H (2020) Clocs: Camera-lidar object 70. Xu Q, Zhong Y, Neumann U (2020) Behind the curtain: Learning
candidates fusion for 3d object detection. In: 2020 IEEE/RSJ inter- occluded shapes for 3d object detection. arXiv:2112.02205
national conference on intelligent robots and systems, pp 10386– 71. Xu Q, Zhou Y, Wang W, Qi CR, Anguelov D (2021) Spg:
10393. https://doi.org/10.1109/IROS45743.2020.9341791 Unsupervised domain adaptation for 3d object detection via
56. Milioto A, Vizzo I, Behley J, Stachniss C (2019) Rangenet++: semantic point generation. In: 2021 IEEE/CVF International
Fast and accurate lidar semantic segmentation. In: 2019 IEEE/RSJ conference on computer vision (ICCV), pp 15446–15456
international conference on intelligent robots and systems. IEEE, 72. Bhatt A, Ganatra A, Kotecha K (2021) Covid-19 pulmonary
pp 4213–4220. https://doi.org/10.1109/iros40897.2019.8967762 consolidations detection in chest x-ray using progressive resiz-
57. Liang Z, Zhang M, Zhang Z, Zhao X, Pu S (2020) Rangercnn: ing and transfer learning techniques. Heliyon 7(6):07211.
Towards fast and accurate 3d object detection with range image https://doi.org/10.1016/j.heliyon.2021.e07211
representation. arXiv:2009.00206 73. Rahate A, Walambe R, Ramanna S, Kotecha K (2022) Multi-
58. Ronneberger O, Fischer P, Brox T (2015) U-net: Con- modal co-learning: challenges, applications with datasets, recent
volutional networks for biomedical image segmentation. advances and future directions. Information Fusion 81:203–239.
In: International conference on medical image computing https://doi.org/10.1016/j.inffus.2021.12.003
and computer-assisted intervention. Springer, pp 234–241. 74. Joshi G, Walambe R, Kotecha K (2021) A review on explainability
https://doi.org/10.1007/978-3-319-24574-4 28 in multimodal deep neural nets. IEEE Access 9:59800–59821.
59. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2020) Focal loss https://doi.org/10.1109/ACCESS.2021.3070212
for dense object detection. IEEE Trans Pattern Anal Mach Intell
42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826
60. Geiger A, Lenz P, Urtasun R (2012) Are we ready for Publisher’s note Springer Nature remains neutral with regard to
autonomous driving? the kitti vision benchmark suite. In: 2012 jurisdictional claims in published maps and institutional affiliations.
13
SRIF-RCNN: Sparsely represented... 5553
13