TransFusion Robust LiDAR-Camera Fusion For 3D Object Detection With Transformers
TransFusion Robust LiDAR-Camera Fusion For 3D Object Detection With Transformers
Xuyang Bai1 Zeyu Hu1 Xinge Zhu2 Qingqiu Huang2 Yilun Chen2 Hongbo Fu3 Chiew-Lan Tai1
1 2 3
Hong Kong University of Science and Technology ADS, IAS BU, Huawei City University of Hong Kong
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00116
Abstract datasets with sparser point clouds, such as nuScenes [1] and
Waymo [42]. LiDAR-only methods are surely insufficient
LiDAR and camera are two important sensors for 3D ob- for robust 3D detection due to the sparsity of point clouds.
ject detection in autonomous driving. Despite the increas- For example, small or distant objects are difficult to detect
ing popularity of sensor fusion in this field, the robustness in LiDAR modality. In contrast, such objects are still clearly
against inferior image conditions, e.g., bad illumination visible and distinguishable in high-resolution images. The
and sensor misalignment, is under-explored. Existing fu- complementary roles of point clouds and images motivate
sion methods are easily affected by such conditions, mainly researchers to design detectors utilizing the best of the two
due to a hard association of LiDAR points and image pixels, worlds, i.e., multi-modal detectors.
established by calibration matrices. Existing LiDAR-camera fusion methods roughly fall
We propose TransFusion, a robust solution to LiDAR- into three categories: result-level, proposal-level, and point-
camera fusion with a soft-association mechanism to han- level. The result-level methods, including FPointNet [29]
dle inferior image conditions. Specifically, our TransFu- and RoarNet [39], use off-the-shelf 2D detectors to seed
sion consists of convolutional backbones and a detection 3D proposals, followed by a PointNet [30] for object lo-
head based on a transformer decoder. The first layer of the calization. The proposal-level fusion methods, including
decoder predicts initial bounding boxes from a LiDAR point MV3D [5] and AVOD [12], perform fusion at the region
cloud using a sparse set of object queries, and its second proposal level by applying RoIPool [31] in each modality
decoder layer adaptively fuses the object queries with use- for shared proposals. These coarse-grained fusion meth-
ful image features, leveraging both spatial and contextual ods show unsatisfactory results since rectangular regions
relationships. The attention mechanism of the transformer of interest (RoI) usually contain lots of background noise.
enables our model to adaptively determine where and what Recently, a majority of approaches have tried to do point-
information should be taken from the image, leading to a level fusion and achieved promising results. They first find
robust and effective fusion strategy. We additionally design a hard association between LiDAR points and image pix-
an image-guided query initialization strategy to deal with els based on calibration matrices, and then augment LiDAR
objects that are difficult to detect in point clouds. TransFu- features with the segmentation scores [46, 51] or CNN fea-
sion achieves state-of-the-art performance on large-scale tures [10, 22, 40, 47, 62] of the associated pixels through
datasets. We provide extensive experiments to demonstrate point-wise concatenation. Similarly, [16, 17, 50, 59] first
its robustness against degenerated image quality and cali- project a point cloud onto the bird’s eye view (BEV) plane
bration errors. We also extend the proposed method to the and then fuse the image features with the BEV pixels.
3D tracking task and achieve the 1st place in the leader- Despite the impressive improvements, these point-level
board of nuScenes tracking, showing its effectiveness and fusion methods suffer from two major problems, as shown
generalization capability. [code release] in Fig. 1. First, they simply fuse the LiDAR features and
image features through element-wise addition or concate-
nation, and thus their performance degrades seriously with
1. Introduction low-quality image features, e.g., images in bad illumina-
As one of the fundamental tasks in self-driving, 3D ob- tion conditions. Second, finding the hard association be-
ject detection aims to localize a set of objects in 3D space tween sparse LiDAR points and dense image pixels not only
and recognize their categories. Thanks to the accurate wastes many image features with rich semantic information,
depth information provided by LiDAR, early works such but also heavily relies on high-quality calibration between
as VoxelNet [67] and PointPillar [14] achieve reasonably two sensors, which is usually hard to acquire due to the in-
good results using only point clouds as input. However, herent spatial-temporal misalignment [63].
these LiDAR-only methods are generally surpassed by the To address the shortcomings of the previous fusion ap-
methods using both LiDAR and camera data on large-scale proaches, we introduce an effective and robust multi-modal
1081
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
LiDAR as ), + Initial Prediction
Query Initialization
Prediction FFN
3D Backbone
as , Transformer
Decoder Layer
Prediction FFN
2D Backbone
as ), + Transformer
Decoder Layer
with SMCA
Image Features
LiDAR-Camera Fusion
Figure 2. Overall pipeline of TransFusion. Our model relies on standard 3D and 2D backbones to extract LiDAR BEV feature map and
image feature map. Our detection head consists of two transformer decoder layers sequentially: (1) The first layer produces initial 3D
bounding boxes using a sparse set of object queries, initialized in a input-dependent and category-aware manner. (2) The second layer
attentively associates and fuses the object queries (with initial predictions) from the first stage with the image features, producing rich
texture and color cues for better detection results. A spatially modulated cross attention (SMCA) mechanism is introduced to involve a
locality inductive bias and help the network better attend to the related image regions. We additionally propose an image-guided query
initialization strategy to involve image guidance on LiDAR BEV. This strategy helps produce object queries that are difficult to detect in
the sparse LiDAR point clouds.
1082
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
plane, the objects on the BEV plane are all in absolute scale object query with predicted bounding box. object query with predicted bounding box.
3.3. Transformer Decoder and FFN LiDAR points. When an object only contains a small num-
The decoder layer follows the design of DETR [23] and ber of LiDAR points, it can fetch only the same number
the detailed architecture is provided in the supplementary of image features, wasting the rich semantic information of
Sec. ??. The cross attention between object queries and high-resolution images. To mitigate this issue, we do not
the feature maps (either from point clouds or images) ag- fetch the multiview image features based on the hard asso-
gregates relevant context onto the object candidates, while ciation between LiDAR points and image pixels. Instead,
the self attention between object queries reasons pairwise we retain all the image features FC ∈ RNv ×H×W ×d as our
relations between different object candidates. The query memory bank, and use the cross-attention mechanism in the
positions are embedded into d-dimensional positional en- transformer decoder to perform feature fusion in a sparse-
coding with a Multilayer Perceptron (MLP), and element- to-dense and adaptive manner, as shown in Fig. 2.
wisely summed with the query features. This enables the SMCA for Image Feature Fusion. Multi-head attention is
network to reason about both context and position jointly. a popular mechanism to perform information exchange and
The N object queries containing rich instance informa- build a soft association between two sets of inputs, and it has
tion are then independently decoded into boxes and class been widely used for the feature matching task [34, 41]. To
labels by a feed-forward network (FFN). Following Center- mitigate the sensitivity towards sensor calibration and infe-
Point [57], our FFN predicts the center offset from the query rior image features brought by the hard-association strategy,
position as δx, δy, bounding box height as z, size l, w, h as we leverage the cross-attention mechanism to build the soft
log(l), log(w), log(h), yaw angle α as sin(α), cos(α) and association between LiDAR and images, enabling the net-
the velocity (if available) as vx , vy . We also predict a per- work to adaptively determine where and what information
class probability p̂ ∈ [0, 1]K for K semantic classes. Each should be taken from the images.
attribute is computed by a separate two-layer 1 × 1 convolu- Specifically, we first identify the specific image in which
tion. By decoding each object query into prediction in par- the object queries are located using previous predictions as
allel, we get a set of predictions {b̂t , p̂t }N well as the calibration matrices, and then perform cross at-
t as output, where
b̂t is the predicted bounding box for the i-th query. Fol- tention between the object queries and the corresponding
lowing [23], we adopt the auxiliary decoding mechanism, image feature map. However, as the LiDAR features and
which adds FFN and supervision after each decoder layer. image features are from completely different domains, the
Hence, we can have initial bounding box predictions from object queries might attend to visual regions unrelated to
the first decoder layer. We leverage such initial predictions the bounding box to be predicted, leading to a long train-
in the LiDAR-camera fusion module to constrain the cross ing time for the network to accurately identify the proper
attention, as explained in the next section. regions on images. Inspired by [9], we design a spatially
modulated cross attention (SMCA) module, which weighs
3.4. LiDAR-Camera Fusion the cross attention by a 2D circular Gaussian mask around
the projected 2D center of each query. The 2D Gaussian
Image Feature Fetching. Although impressive improve-
weight mask M is generated in a similar way as Center-
ment has been brought by point-level fusion methods [46, (i−cx )2 +(j−cy )2
47], their fusion quality is largely limited by the sparsity of Net [65], Mij = exp(− σr 2 ), where (i, j) is
the spatial indices of the weight mask M, (cx , cy ) is the 2D
1083
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
#
center computed by projecting the query prediction onto the Lidar BEV
Query Initialization
image plane, r is the radius of the minimum circumscribed Features
as
Collapse along
circle of the projected corners of the 3D bounding box, and height axis # $
Multi-Head
σ is the hyper-parameter to modulate the bandwidth of the ! maxpool as !, #
Attention
Gaussian distribution. Then this weight map is element- " "
wisely multiplied with the cross-attention map among all Image Features Fused BEV Features
Figure 4. We first condense the image features along the vertical
the attention heads. In this way, each object query only at-
dimension, and then project the features onto the BEV plane us-
tends to the related region around the projected 2D box, so
ing cross attention with the LiDAR BEV features. Each image is
that the network can learn where to select image features processed by a separate multi-head attention layer, which captures
based on the input LiDAR features better and faster. The the relation between image column and BEV locations.
visualization of the attention map is shown in Fig. 3. The
network typically tends to focus on the foreground pixels initialization strategy, which selects object queries leverag-
close to the object center and ignore the irrelevant pixels, ing both the LiDAR and camera information.
providing valuable semantic information for object classifi- Specifically, we generate a LiDAR-camera BEV feature
cation and bounding box regression. After SMCA, we use map FLC by projecting the image features FC onto the BEV
another FFN to produce the final bound box predictions us- plane through cross attention with LiDAR BEV features
ing the object queries containing both LiDAR and image FL . Inspired by [32], we use the multiview image features
information. collapsed along the height axis as the key-value sequence of
3.5. Label Assignment and Losses the attention mechanism, as shown in Fig. 4. The collaps-
ing operation is based on the observation that the relation
Following DETR [23], we find the bipartite matching be- between BEV locations and image columns can be estab-
tween the predictions and ground truth objects through the lished easily using camera geometry, and usually there is
Hungarian algorithm [13], where the matching cost is de- at most one object along each image column. Therefore,
fined by a weighted sum of classification, regression, and collapsing along the height axis can significantly reduce the
IoU cost: computation without losing critical information. Although
some fine-grained image features might be lost during this
C_{match} = \lambda _1 L_{cls}(p, \hat p) + \lambda _2 L_{reg}(b, \hat b) + \lambda _3 L_{iou}(b, \hat b), \label {eq:matching_cost} (1)
process, it already meets our need as only a hint on poten-
where Lcls is the binary cross entropy loss, Lreg is the L1 tial object positions is required. Afterward, similar to Sec.
loss between the predicted BEV centers and the ground- 3.2, we use FLC to predict the heatmap, which is averaged
truth centers (both normalized in [0, 1]), and Liou is the with the LiDAR-only heatmap Ŝ as the final heatmap ŜLC .
IoU loss [64] between the predicted boxes and ground-truth Using ŜLC to select and initialize the object queries, our
boxes. λ1 , λ2 , λ3 are the coefficients of the individual cost model is able to detect objects that are difficult to detect in
terms. We provide sensitivity analysis of these terms in the LiDAR point clouds.
supplementary Sec. ??. Since the number of predictions is Note that proposing a novel method to project the image
usually larger than that of GT boxes, the unmatched predic- features onto the BEV plane is beyond the scope of this pa-
tions are considered as negative samples. Given all matched per. We believe that our method could benefit from more
pairs, we compute a focal loss [18] for the classification research progress [26, 32, 33] in this direction.
branch. The bounding box regression is supervised by an
L1 loss for only positive pairs. For the heatmap predic- 4. Implementation Details
tion, we adopt a penalty-reduced focal loss following Cen-
terPoint [57]. The total loss is the weighted sum of losses Training. We implement our network in PyTorch [25] us-
for each component. We adopt the same label assignment ing the open-sourced MMDetection3D [6]. For nuScenes,
strategy and loss formulation for both decoder layers. we use the DLA34 [60] of the pretrained CenterNet as our
2D backbone and keep its weights frozen during training,
3.6. Image-Guided Query Initialization following [47]. We set the image size to 448 × 800, which
Since our object queries are currently selected using only performs comparably with full resolution (896 × 1600).
LiDAR features, it potentially leads to sub-optimality in VoxelNet [52, 67] is chosen as our 3D backbone. Our train-
terms of the detection recall. Empirically, our model al- ing consists of two stages: 1) We first train the 3D backbone
ready achieves high recall and shows superior performance with the first decoder layer and FFN for 20 epochs, which
over the baselines (Sec. 5). Nevertheless, to further lever- only needs the LiDAR point clouds as input and produces
age the ability of high-resolution images in detecting small the initial 3D bounding box predictions. We adopt the same
objects and make our algorithm more robust against sparse data augmentation and training schedules as prior LiDAR-
LiDAR point clouds, we propose an image-guided query only works [57, 68]. Note that we also find the copy-and-
paste augmentation strategy [52] benefits the convergence
1084
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
Method Modality Voxel Size (m) mAP NDS Car Truck C.V. Bus Trailer Barrier Motor. Bike Ped. T.C.
PointPillar [14] L (0.2, 0.2, 8) 40.1 55.0 76.0 31.0 11.3 32.1 36.6 56.4 34.2 14.0 64.0 45.6
CBGS [68] L (0.1, 0.1, 0.2) 52.8 63.3 81.1 48.5 10.5 54.9 42.9 65.7 51.5 22.3 80.1 70.9
CenterPoint [57] L (0.075, 0.075, 0.2) 60.3 67.3 85.2 53.5 20.0 63.6 56.0 71.1 59.5 30.7 84.6 78.4
PointPainting [46] LC (0.2, 0.2, 8) 46.4 58.1 77.9 35.8 15.8 36.2 37.3 60.2 41.5 24.1 73.3 62.4
3D-CVF [59] LC (0.05, 0.05, 0.2) 52.7 62.3 83.0 45.0 15.9 48.8 49.6 65.9 51.2 30.4 74.2 62.9
PointAugmenting [47] LC (0.075, 0.075, 0.2) 66.8 71.0 87.5 57.3 28.0 65.2 60.7 72.6 74.3 50.9 87.9 83.6
MVP [58] LC (0.075, 0.075, 0.2) 66.4 70.5 86.8 58.5 26.1 67.4 57.3 74.8 70.0 49.3 89.1 85.0
FusionPainting [51] LC (0.075, 0.075, 0.2) 68.1 71.6 87.1 60.8 30.0 68.5 61.7 71.8 74.7 53.5 88.3 85.0
TransFusion-L L (0.075, 0.075, 0.2) 65.5 70.2 86.2 56.7 28.2 66.3 58.8 78.2 68.3 44.2 86.1 82.0
TransFusion LC (0.075, 0.075, 0.2) 68.9 71.7 87.1 60.0 33.1 68.3 60.8 78.1 73.6 52.9 88.4 86.7
Table 1. Comparison with SOTA methods on the nuScenes test set. ‘C.V.’, ‘Ped.’, and ‘T.C.’ are short for construction vehicle, pedestrian,
and traffic cone, respectively. ‘L’ and ‘C’ represent LiDAR and Camera, respectively. The best results are in boldface (Best LiDAR-only
results are marked blue and best LC results are marked red). For FusionPainting [51], we report the results on the nuScenes website, which
are better than what they reported in their paper. Note that CenterPoint [57] and PointAugmenting [47] utilize double-flip testing while we
do not use any test time augmentation. Please find detailed results here.2
but could disturb the real data distribution, so we disable ing, consisting of 700, 150, and 150 scenes for train-
this augmentation for the last 5 epochs following [47] (they ing, validation, and testing, respectively. Each frame con-
called a fade strategy). 2) We then train the LiDAR-camera tains one point cloud and six calibrated images cover-
fusion and the image-guided query initialization module for ing the 360-degree horizontal FOV. For 3D detection, the
another 6 epochs. We find that this two-step training scheme main metrics are mean Average Precision (mAP) [7] and
performs better than joint training, since we can adopt more nuScenes detection score (NDS). The mAP is defined by
flexible augmentations for the first training stage. See sup- the BEV center distance instead of the 3D IoU, and the fi-
plementary Sec. ?? for the detailed hyper-parameters and nal mAP is computed by averaging over distance thresholds
settings on Waymo. of 0.5m, 1m, 2m, 4m across ten classes. NDS is a consol-
Testing. During inference, the final score is computed as the idated metric of mAP and other attribute metrics, includ-
geometric average of the heatmap score Ŝij and the classi- ing translation, scale, orientation, velocity, and other box
fication score p̂t . We use all the outputs as our final predic- attributes. Following CenterPoint [57], we set the voxel size
tions without Non-maximum Suppression (NMS) (see the to (0.075m, 0.075m, 0.2m).
effect of NMS in supplementary Sec. ??). It is noteworthy Waymo Open Dataset. This dataset consists of 798 scenes
that previous point-level fusion methods such as PointAug- for training and 202 scenes for validation. The official met-
menting [47] rely on two different models for camera FOV rics are mAP and mAPH (mAP weighted by heading accu-
and LiDAR-only regions if the cameras are not 360-degree racy). The mAP and mAPH are defined based on the 3D
cameras, because only points in the camera FOV could fetch IoU threshold of 0.7 for vehicles and 0.5 for pedestrians
the corresponding image features. In contrast, we use a sin- and cyclists. These metrics are further broken down into
gle model to deal with both camera FOV and LiDAR-only two difficulty levels: LEVEL1 for boxes with more than
regions, since object queries located outside camera FOV five LiDAR points and LEVEL2 for boxes with at least one
will directly ignore the fusion stage and the initial predic- LiDAR point. Unlike the 360-degree cameras in nuScenes,
tions from the first decoder layer will be a safeguard. the cameras in Waymo only cover around 250 degrees hor-
izontally. The voxel size is set to (0.1m, 0.1m, 0.15m).
5. Experiments
5.1. Main Results
In this section, we first make comparisons with the state-
of-the-art methods on nuScenes and Waymo. Then we con- nuScenes Results. We submitted our detection results to
duct extensive ablation studies to demonstrate the impor- the nuScenes evaluation server. Without any test time aug-
tance of each key component of TransFusion. Moreover, we mentation or model ensemble, our TransFusion outperforms
design two experiments to show the robustness of our Trans- all competing non-ensembled methods on the nuScenes
Fusion against inferior image conditions. Besides TransFu- leaderboard at the time of submission. As shown in Table 1,
sion, we also include a model variant, which is based on our TransFusion-L already outperforms the state-of-the-
the first training stage, i.e., producing the initial bounding art LiDAR-only methods by a significant margin (+5.2%
box predictions using only point clouds. We denote it as mAP, +2.9% NDS) and even surpasses some multi-modality
TransFusion-L and believe that it can serve as a strong base- methods. We ascribe this performance gain to the rela-
line for LiDAR-only detection. We provide the qualitative tion modeling power of the transformer decoder as well as
results in supplementary Sec. ??. the proposed query initialization strategies, which are ab-
nuScenes Dataset. The nuScenes dataset is a large-scale lated in Sec. 5.3. Once enabling the proposed fusion com-
autonomous-driving dataset for 3D detection and track- ponents, our TransFusion receives remarkable performance
boost (+3.4% mAP, +1.5% NDS) and outperforms all the
2 https://www.nuscenes.org/object-detection
previous methods, including FusionPainting [51], which
1085
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
Vehicle Pedestrian Cyclist Overall Nighttime Daytime
PointPillar [46] 62.5 50.2 59.9 57.6 TransFusion-L 49.2 60.3
PVRCNN [36] 64.8 46.7 - - CC 49.4 (+0.2) 63.4 (+3.1)
LiDAR-RCNN [15] 64.2 51.7 64.4 60.1
CenterPoint [57] 66.1 62.4 67.6 65.3
PA 51.0 (+1.8) 64.3 (+4.0)
PointAugmenting [47] 62.2 64.6 73.3 66.7 TransFusion 55.2 (+6.0) 65.7 (+5.4)
TransFusion-L 65.1 63.7 65.9 64.9 Table 4. mAP breakdown over daytime and nighttime. We exclude
TransFusion 65.1 64.0 67.4 65.5
categories that do no have any labeled samples.
Table 2. LEVEL 2 mAPH on Waymo validation set. For Center-
Point, we report the performance of single-frame one-stage model test set only allows at most three submissions, all the ex-
trained in 36 epochs. periments are conducted on the validation set. For fast it-
AMOTA↑ TP↑ FP↓ FN↓ IDS↓ eration, we reduce the first stage training to 12 epochs and
CenterPoint [57] 63.8 95877 18612 22928 760 remove the fade strategy. All the other parameters are the
EagerMOT [11] 67.7 93484 17705 24925 1156
AlphaTrack [61] 69.3 95851 18421 22996 718 same as the main experiments. To avoid overstatement, we
TransFusion-L 68.6 95235 17851 23437 893 additionally build two baseline LiDAR-camera detectors by
TransFusion 71.8 96775 16232 21846 944 equipping our TransFusion-L with two representative fu-
Table 3. Comparison of the tracking results on nuScenes test set. sion methods on nuScenes: fusing LiDAR and image fea-
Please find detailed results here.3 tures by point-wise concatenation (denoted as CC) and the
fusion strategy of PointAugmenting (denoted as PA).
uses extra data to train their segmentation sub-networks. Nighttime. We first split the validation set into daytime
Moreover, thanks to our soft-association mechanism, Trans- and nighttime based on scene descriptions provided by
Fusion is robust to inferior image conditions including de- nuScenes and show the performance gain under different
generated image quality and sensor misalignment, as shown situations in Table 4. Our method brings a much larger per-
in the next section. formance gain during nighttime, where the worse lighting
Waymo Results. We report the performance of our model negatively affects the hard-association based fusion strate-
over all three classes on Waymo validation set in Table 2. gies CC and PA.
Our fusion strategy improves the mAPH of pedestrian and Degenerated Image Quality. In Table 5, we randomly drop
cyclist classes by 0.3 and 1.5x, respectively. We suspect two several images for each frame by setting the image features
reasons for the relatively small improvement brought by the of such images to zero during inference. Since both CC
image components. First, the semantic information of im- and PA fuse LiDAR and image features in a tightly-coupled
ages might have less impact on the coarse-grained catego- way, their performance drops significantly when some im-
rization of Waymo. Second, the initial bounding boxes from ages are not available during inference. In contrast, our
the first decoder layer are already with accurate locations TransFusion is able to maintain a high mAP under all cases.
since the point clouds in Waymo are denser than those in When all the six images are not available, CC and PA suf-
nuScenes (see more discussions in supplementary Sec. ??). fer from 23.8% and 17.2% mAP degradation, respectively,
Note that CenterPoint achieves a better performance with while TransFusion still keeps the mAP at a competitive level
a multi-frame input and a second-stage refinement mod- of 61.7%. This advantage comes from the sequential design
ule. Such components are orthogonal to our method and we and the attentive fusion strategy, which first generates ini-
leave a more powerful TransFusion for Waymo as the future tial predictions based on LiDAR data and then only gathers
work. PointAugmenting achieves better performance than useful information from image features adaptively. More-
ours but relies on CenterPoint to get the predictions outside over, we could even directly disable the fusion module if
camera FOV for a full-region detection, making their sys- the camera malfunctioning is known, such that the whole
tem less flexible. system could still work seamlessly in a LiDAR-only mode.
Extend to Tracking. To further demonstrate the general-
ization capability, we evaluate our model in a 3D multi- # Dropped Images 0 1 3 6
object tracking (MOT) task by performing tracking-by- CC 63.3 59.8 (-3.5) 50.9 (-12.4) 39.5 (-23.8)
PA 64.2 61.6 (-2.6) 55.4 (-8.8) 47.0 (-17.2)
detection with the same tracking algorithms adopted by TransFusion 65.6 65.1 (-0.5) 63.9 (-1.7) 61.7 (-3.9)
CenterPoint. We refer readers to the original paper [57] Table 5. mAP under different numbers of dropped images. The
for details. As shown in Table 3, our model significantly number in each bracket is the mAP drop from the standard input.
outperforms CenterPoint and sets the new state-of-the-art
results on the leaderboard of nuScenes tracking. Sensor Misalignment. We evaluate different fusion meth-
ods under a setting where LiDAR and images are not well-
5.2. Robustness against Inferior Image Conditions calibrated following RoarNet [39]. Specifically, we ran-
domly add a translation offset to the transformation matrix
We design three experiments to demonstrate the robust-
from camera to LiDAR sensor. As shown in Fig. 5, Trans-
ness of our proposed fusion module. Since the nuScenes
Fusion achieves better robustness against the calibration er-
3 https://www.nuscenes.org/tracking ror compared with other fusion methods. When two sensors
1086
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
66 65.60 65.60 mAP NDS Params (M) Latency (ms)
65.58 65.49
65.33
65
65.11 CenterPoint 57.4 65.2 8.54 117.2
64.22
TransFusion-L 60.0 66.8 7.96 114.9
64.13
64 63.71
CC 63.3 67.6 8.01 + 18.34 212.3
PA 64.2 68.7 13.9 + 18.34 288.2
mAP (%)
63.32
63.12 63.11
63 w/o Fusion 61.6 67.4 9.08 + 18.34 215.0
62.56 62.53
62 61.86 61.89
w/o Guide 64.8 69.3 8.35 + 18.34 236.9
TransFusion 65.6 69.7 9.47 + 18.34 265.9
61.11
61 TransFusion
PA
CC
60.48 Table 7. Ablation of the proposed fusion components. 18.34 rep-
60 0.0 0.2 0.4 0.6 0.8 1.0 resents the parameter size of the 2D backbone. The latency is
Discrepancy (m)
Figure 5. mAP under sensor misalignment cases. The X axis refers measured on an Intel Core i7 CPU and a Titan V100 GPU. For
to the translational discrepancy between two sensors. CenterPoint, we use re-implementations in MMDetection3D.
1087
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
References [19] Zili Liu, Guodong Xu, Honghui Yang, Minghao Chen,
Kuoliang Wu, Zheng Yang, Haifeng Liu, and Deng Cai.
[1] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Suppress-and-refine framework for end-to-end 3d object de-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- tection. arXiv, 2021. 2
ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- [20] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong.
modal dataset for autonomous driving. CVPR, 2020. 1 Group-free 3d object detection via transformers. ICCV,
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas 2021. 2, 3
Usunier, Alexander Kirillov, and Sergey Zagoruyko. DETR: [21] Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi
End-to-end object detection with transformers. ECCV, 2020. Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. Voxel
2, 3 transformer for 3d object detection. ICCV, 2021. 2
[3] Qi Chen, Lin Sun, Ernest C. H. Cheung, and A. Yuille. Every [22] Gregory P. Meyer, Jake Charland, Darshan Hegde,
View Counts: Cross-view consistency in 3d object detection Ankita Gajanan Laddha, and Carlos Vallespi-Gonzalez. Sen-
with hybrid-cylindrical-spherical voxelization. NeurIPS, sor fusion for joint 3d object detection and semantic segmen-
2020. 2 tation. CVPRW, 2019. 1
[4] Qi Chen, Lin Sun, Zhixin Wang, K. Jia, and A. Yuille. Object [23] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-
as Hotspots: An anchor-free 3d object detection approach via to-End Transformer Model for 3D Object Detection. ICCV,
firing of hotspots. ECCV, 2020. 2 2021. 2, 3, 4, 5
[5] Xiaozhi Chen, Huimin Ma, Jixiang Wan, B. Li, and Tian [24] Xuran Pan, Zhuofan Xia, Shiji Song, L. Li, and Gao Huang.
Xia. Multi-view 3d object detection network for autonomous 3d object detection with pointformer. CVPR, 2021. 2
driving. CVPR, 2017. 1, 2 [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[6] MMDetection3D Contributors. MMDetection3D: Open- Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
MMLab next-generation platform for general 3D object ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
detection. https://github.com/open- mmlab/ differentiation in pytorch. NeurIPS-W, 2017. 5
mmdetection3d, 2020. 5 [26] Jonah Philion and S. Fidler. Lift, Splat, Shoot: Encoding
[7] M. Everingham, L. Gool, Christopher K. I. Williams, J. images from arbitrary camera rigs by implicitly unprojecting
Winn, and Andrew Zisserman. The pascal visual object to 3d. ECCV, 2020. 5
classes (voc) challenge. IJCV, 2009. 6 [27] C. Qi, Xinlei Chen, O. Litany, and L. Guibas. ImVoteNet:
[8] Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and Boosting 3d object detection in point clouds with image
Zhaoxiang Zhang. RangeDet: In defense of range view for votes. CVPR, 2020. 2
lidar-based 3d object detection. ICCV, 2021. 2 [28] C. Qi, O. Litany, Kaiming He, and L. Guibas. Deep hough
[9] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, voting for 3d object detection in point clouds. ICCV, 2019.
and Hongsheng Li. Fast convergence of detr with spatially 2
modulated co-attention. ICCV, 2021. 3, 4 [29] C. Qi, W. Liu, Chenxia Wu, Hao Su, and L. Guibas. Frustum
[10] Tengteng Huang, Zhe Liu, Xiwu Chen, and X. Bai. EPNet: pointnets for 3d object detection from rgb-d data. CVPR,
Enhancing point features with image semantics for 3d object 2018. 1, 2
detection. ECCV, 2020. 1, 2 [30] C. Qi, Hao Su, Kaichun Mo, and L. Guibas. PointNet: Deep
[11] Aleksandr Kim, Aljosa Osep, and Laura Leal-Taixé. Ea- learning on point sets for 3d classification and segmentation.
gerMOT: 3d multi-object tracking via sensor fusion. ICRA, CVPR, 2017. 1
2021. 7 [31] Shaoqing Ren, Kaiming He, Ross B. Girshick, and J. Sun.
[12] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, Faster R-CNN: Towards real-time object detection with re-
and Steven L. Waslander. Joint 3d proposal generation and gion proposal networks. TPAMI, 2015. 1
object detection from view aggregation. IROS, 2018. 1 [32] Thomas Roddick and R. Cipolla. Predicting semantic map
[13] H. Kuhn. The hungarian method for the assignment problem. representations from images using pyramid occupancy net-
Naval Research Logistics Quarterly, 1955. 5 works. CVPR, 2020. 5
[14] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [33] Thomas Roddick, Alex Kendall, and R. Cipolla. Ortho-
Jiong Yang, and Oscar Beijbom. PointPillars: Fast encoders graphic feature transform for monocular 3d object detection.
for object detection from point clouds. CVPR, 2019. 1, 2, 6 BMVC, 2019. 5
[15] Zhichao Li, Feng Wang, and Naiyan Wang. LiDAR R-CNN: [34] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
An efficient and universal 3d object detector. CVPR, 2021. 7 and Andrew Rabinovich. SuperGlue: Learning feature
[16] Ming Liang, Binh Yang, Yun Chen, Rui Hu, and R. Urta- matching with graph neural networks. CVPR, 2020. 4
sun. Multi-task multi-sensor fusion for 3d object detection. [35] Hualian Sheng, Sijia Cai, Yuan Liu, Bing Deng, Jianqiang
CVPR, 2019. 1 Huang, Xiansheng Hua, and Min-Jian Zhao. Improving 3d
[17] Ming Liang, Binh Yang, Shenlong Wang, and R. Urtasun. object detection with channel-wise transformer. ICCV, 2021.
Deep continuous fusion for multi-sensor 3d object detection. 2
ECCV, 2018. 1 [36] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping
[18] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, Shi, Xiaogang Wang, and Hongsheng Li. PV-RCNN: Point-
and Piotr Dollár. Focal loss for dense object detection. ICCV, voxel feature set abstraction for 3d object detection. CVPR,
2017. 5 2020. 2, 7
1088
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
[37] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR- [53] Binh Yang, Wenjie Luo, and R. Urtasun. PIXOR: Real-time
CNN: 3d object proposal generation and detection from 3d object detection from point clouds. CVPR, 2018. 2
point cloud. CVPR, 2019. 2 [54] Zetong Yang, Y. Sun, Shu Liu, and Jiaya Jia. 3DSSD: Point-
[38] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, based 3d single stage object detector. CVPR, 2020. 2
and Hongsheng Li. From Points to Parts: 3d object detec- [55] Zetong Yang, Y. Sun, Shu Liu, Xiaoyong Shen, and Jiaya
tion from point cloud with part-aware and part-aggregation Jia. STD: Sparse-to-dense 3d object detector for point cloud.
network. TPAMI, 2021. 2 ICCV, 2019. 2
[39] Kiwoo Shin, Y. Kwon, and M. Tomizuka. RoarNet: A ro- [56] Z. Yao, Jiangbo Ai, Boxun Li, and Chi Zhang. Efficient
bust 3d object detection based on region approximation re- DETR: Improving end-to-end object detector with dense
finement. IV, 2019. 1, 2, 7 prior. arXiv, 2021. 3
[40] Vishwanath A. Sindagi, Yin Zhou, and Oncel Tuzel. MVX- [57] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-
Net: Multimodal voxelnet for 3d object detection. ICRA, based 3d object detection and tracking. CVPR, 2021. 2, 4, 5,
2019. 1, 2 6, 7
[41] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and [58] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multi-
Xiaowei Zhou. LoFTR: Detector-free local feature matching modal virtual point 3d detection. NeurIPS, 2021. 6
with transformers. CVPR, 2021. 4
[59] Jin Hyeok Yoo, Yeocheol Kim, Ji Song Kim, and J. Choi.
[42] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
3D-CVF: Generating joint camera and lidar features using
Chouard, Vijaysai Patnaik, P. Tsui, James Guo, Yin Zhou,
cross-view spatial feature fusion for 3d object detection.
Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
ECCV, 2020. 1, 6
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, S. Ettinger,
[60] F. Yu, Dequan Wang, and Trevor Darrell. Deep layer aggre-
Maxim Krivokon, A. Gao, Aditya Joshi, Y. Zhang, Jonathon
gation. CVPR, 2018. 5
Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability
in perception for autonomous driving: Waymo open dataset. [61] Yihan Zeng, Chao Ma, Ming Zhu, Zhiming Fan, and Xi-
CVPR, 2020. 1 aokang Yang. Cross-modal 3d object detection and tracking
[43] Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, for auto-driving. IROS, 2021. 7
Alex Bewley, Xiao Zhang, Cristian Sminchisescu, and Drago [62] Wenwei Zhang, Zhe Wang, and Chen Change Loy. Multi-
Anguelov. RSN: Range sparse net for efficient, accurate lidar modality cut and paste for 3d object detection. arXiv, 2020.
3d object detection. CVPR, 2021. 2 1
[44] Pei Sun, Rufeng Zhang, Yi Jiang, T. Kong, Chenfeng Xu, [63] Lin Zhao, Hui Zhou, Xinge Zhu, Xiao Song, Hongsheng Li,
W. Zhan, M. Tomizuka, L. Li, Zehuan Yuan, C. Wang, and and Wenbing Tao. LIF-Seg: Lidar and camera image fusion
Ping Luo. Sparse R-CNN: End-to-end object detection with for 3d lidar semantic segmentation. arXiv, 2021. 1
learnable proposals. CVPR, 2021. 2, 3 [64] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo
[45] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob-
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and ject detection. 3DV, 2019. 5
Illia Polosukhin. Attention is all you need. NeurIPS, 2017. [65] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
3 jects as points. arXiv, 2019. 3, 4
[46] Sourabh Vora, Alex H. Lang, Bassam Helou, and Oscar Bei- [66] Yin Zhou, Pei Sun, Y. Zhang, Dragomir Anguelov, J. Gao,
jbom. PointPainting: Sequential fusion for 3d object detec- Tom Y. Ouyang, James Guo, Jiquan Ngiam, and Vijay Va-
tion. CVPR, 2020. 1, 2, 4, 6, 7, 8 sudevan. MVF: End-to-end multi-view fusion for 3d object
[47] Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. detection in lidar point clouds. CoRL, 2019. 2
PointAugmenting: Cross-modal augmentation for 3d object [67] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-end learning
detection. CVPR, 2021. 1, 2, 4, 5, 6, 7 for point cloud based 3d object detection. CVPR, 2018. 1, 2,
[48] Yue Wang, Alireza Fathi, Abhijit Kundu, David A. Ross, 5
Caroline Pantofaru, Thomas A. Funkhouser, and Justin M. [68] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and
Solomon. Pillar-based object detection for autonomous driv- Gang Yu. Class-balanced grouping and sampling for point
ing. In ECCV, 2020. 2 cloud 3d object detection. arXiv, 2019. 2, 5, 6
[49] Yue Wang and Justin Solomon. Object DGCNN: 3d object [69] Xinge Zhu, Yuexin Ma, Tai Wang, Yan Xu, Jianping Shi, and
detection using dynamic graphs. NeurIPS, 2021. 3 Dahua Lin. SSN: Shape signature networks for multi-class
[50] Liang Xie, Chao Xiang, Zhengxu Yu, Guodong Xu, Zheng object detection from point clouds. ECCV, 2020. 2
Yang, Deng Cai, and Xiaofei He. PI-RCNN: An efficient [70] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
multi-sensor 3d object detector with point-based attentive and Jifeng Dai. Deformable DETR: Deformable transform-
cont-conv fusion module. AAAI, 2020. 1 ers for end-to-end object detection. ICLR, 2021. 3
[51] Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Bin Zhou,
and Liangjun Zhang. FusionPainting: Multimodal fusion
with adaptive attention for 3d object detection. ITSC, 2021.
1, 6
[52] Yan Yan, Yuxing Mao, and B. Li. SECOND: Sparsely em-
bedded convolutional detection. Sensors, 2018. 2, 5
1089
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.