0% found this document useful (0 votes)
63 views10 pages

TransFusion Robust LiDAR-Camera Fusion For 3D Object Detection With Transformers

The document presents TransFusion, a robust method for LiDAR-camera fusion for 3D object detection using transformers. It proposes a soft-association mechanism using a transformer decoder to handle poor image conditions. The first decoder layer predicts boxes from LiDAR and the second adaptively fuses object queries with image features using spatial and contextual relationships. Experiments show it achieves state-of-the-art performance and is robust to degraded images and calibration errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views10 pages

TransFusion Robust LiDAR-Camera Fusion For 3D Object Detection With Transformers

The document presents TransFusion, a robust method for LiDAR-camera fusion for 3D object detection using transformers. It proposes a soft-association mechanism using a transformer decoder to handle poor image conditions. The first decoder layer predicts boxes from LiDAR and the second adaptively fuses object queries with image features using spatial and contextual relationships. Experiments show it achieves state-of-the-art performance and is robust to degraded images and calibration errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

Xuyang Bai1 Zeyu Hu1 Xinge Zhu2 Qingqiu Huang2 Yilun Chen2 Hongbo Fu3 Chiew-Lan Tai1
1 2 3
Hong Kong University of Science and Technology ADS, IAS BU, Huawei City University of Hong Kong
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.00116

Abstract datasets with sparser point clouds, such as nuScenes [1] and
Waymo [42]. LiDAR-only methods are surely insufficient
LiDAR and camera are two important sensors for 3D ob- for robust 3D detection due to the sparsity of point clouds.
ject detection in autonomous driving. Despite the increas- For example, small or distant objects are difficult to detect
ing popularity of sensor fusion in this field, the robustness in LiDAR modality. In contrast, such objects are still clearly
against inferior image conditions, e.g., bad illumination visible and distinguishable in high-resolution images. The
and sensor misalignment, is under-explored. Existing fu- complementary roles of point clouds and images motivate
sion methods are easily affected by such conditions, mainly researchers to design detectors utilizing the best of the two
due to a hard association of LiDAR points and image pixels, worlds, i.e., multi-modal detectors.
established by calibration matrices. Existing LiDAR-camera fusion methods roughly fall
We propose TransFusion, a robust solution to LiDAR- into three categories: result-level, proposal-level, and point-
camera fusion with a soft-association mechanism to han- level. The result-level methods, including FPointNet [29]
dle inferior image conditions. Specifically, our TransFu- and RoarNet [39], use off-the-shelf 2D detectors to seed
sion consists of convolutional backbones and a detection 3D proposals, followed by a PointNet [30] for object lo-
head based on a transformer decoder. The first layer of the calization. The proposal-level fusion methods, including
decoder predicts initial bounding boxes from a LiDAR point MV3D [5] and AVOD [12], perform fusion at the region
cloud using a sparse set of object queries, and its second proposal level by applying RoIPool [31] in each modality
decoder layer adaptively fuses the object queries with use- for shared proposals. These coarse-grained fusion meth-
ful image features, leveraging both spatial and contextual ods show unsatisfactory results since rectangular regions
relationships. The attention mechanism of the transformer of interest (RoI) usually contain lots of background noise.
enables our model to adaptively determine where and what Recently, a majority of approaches have tried to do point-
information should be taken from the image, leading to a level fusion and achieved promising results. They first find
robust and effective fusion strategy. We additionally design a hard association between LiDAR points and image pix-
an image-guided query initialization strategy to deal with els based on calibration matrices, and then augment LiDAR
objects that are difficult to detect in point clouds. TransFu- features with the segmentation scores [46, 51] or CNN fea-
sion achieves state-of-the-art performance on large-scale tures [10, 22, 40, 47, 62] of the associated pixels through
datasets. We provide extensive experiments to demonstrate point-wise concatenation. Similarly, [16, 17, 50, 59] first
its robustness against degenerated image quality and cali- project a point cloud onto the bird’s eye view (BEV) plane
bration errors. We also extend the proposed method to the and then fuse the image features with the BEV pixels.
3D tracking task and achieve the 1st place in the leader- Despite the impressive improvements, these point-level
board of nuScenes tracking, showing its effectiveness and fusion methods suffer from two major problems, as shown
generalization capability. [code release] in Fig. 1. First, they simply fuse the LiDAR features and
image features through element-wise addition or concate-
nation, and thus their performance degrades seriously with
1. Introduction low-quality image features, e.g., images in bad illumina-
As one of the fundamental tasks in self-driving, 3D ob- tion conditions. Second, finding the hard association be-
ject detection aims to localize a set of objects in 3D space tween sparse LiDAR points and dense image pixels not only
and recognize their categories. Thanks to the accurate wastes many image features with rich semantic information,
depth information provided by LiDAR, early works such but also heavily relies on high-quality calibration between
as VoxelNet [67] and PointPillar [14] achieve reasonably two sensors, which is usually hard to acquire due to the in-
good results using only point clouds as input. However, herent spatial-temporal misalignment [63].
these LiDAR-only methods are generally surpassed by the To address the shortcomings of the previous fusion ap-
methods using both LiDAR and camera data on large-scale proaches, we introduce an effective and robust multi-modal

978-1-6654-6946-3/22/$31.00 ©2022 IEEE 1080


DOI 10.1109/CVPR52688.2022.00116
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
4. We achieve the state-of-the-art 3D detection per-
formance on nuScenes and competitive results on
Waymo. We also extend our model to the 3D track-
ing task and achieve the 1st place in the leaderboard
of the nuScenes tracking challenge.

Figure 1. Left: An example of bad illumination conditions. Right:


Due to the sparsity of point clouds, the hard-association based fu- 2. Related Work
sion methods waste many image features and are sensitive to sen-
sor calibration, since the projected points may fall outside objects
LiDAR-only 3D Detection aims to predict 3D bounding
due to a small calibration error.
boxes of objects in given point clouds [3, 4, 27, 28, 38, 46,
detection framework in this paper. Our key idea is to repo- 53, 66, 68, 69]. Due to the unordered, irregular nature of
sition the focus of the fusion process, from hard-association point clouds, many 3D detectors first project them onto a
to soft-association, leading to the robustness against degen- regular grid such as 3D voxels [52, 67], pillars [14] or range
erated image quality and sensor misalignment. images [8, 43]. After that, standard 2D or 3D convolutions
Specifically, we design a sequential fusion method that are used to compute the features in the BEV plane, where
uses two transformer decoder layers as the detection head. objects are naturally separated, with their physical sizes pre-
To our best knowledge, we are the first to use transformer served. Other works [36, 37, 54, 55] directly operate on raw
for LiDAR-camera 3D detection. Our first decoder layer point clouds without quantization. The mainstream of 3D
leverages a sparse set of object queries to produce ini- detection head is based on anchor boxes [14, 67] follow-
tial bounding boxes from LiDAR features. Unlike input- ing the 2D counterparts, while [48,57] adopt a center-based
independent object queries in 2D [2, 44], we make the ob- representation for 3D objects, largely simplifying the 3D
ject queries input-dependent and category-aware so that the detection pipeline. Despite the popularity of adopting the
queries are enriched with better position and category infor- transformer architecture as a detection head in 2D [2], 3D
mation. Next, the second transformer decoder layer adap- detection models for outdoor scenarios mostly utilize the
tively fuses object queries with useful image features as- transformer for feature extraction [21, 24, 35]. However,
sociated by spatial and contextual relationships. We lever- the attention operation in each transformer layer requires
age a locality inductive bias by spatially constraining the a computation complexity of O(N 2 ) for N points, requir-
cross attention around the initial bounding boxes to help the ing a carefully designed memory reduction operation when
network better visit the related positions. Our fusion mod- handling LiDAR point clouds with millions of points per
ule not only provides rich semantic information to object frame. In contrast, our model retains an efficient convolu-
queries, but also is more robust to inferior image conditions tion backbone for feature extraction and leverages a trans-
since the association between LiDAR points and image pix- former decoder with a small set of object queries as the
els are established in a soft and adaptive way. Finally, to detection head, making the computation cost manageable.
handle objects that are difficult to detect in point clouds, The concurrent works [19, 20, 23] adopt transformer as a
we introduce an image-guided query initialization module detection head but focus on indoor scenarios and extending
to involve image guidance on the query initialization stage. these methods to outdoor scenes is non-trivial.
Overall, the corporation of these components significantly LiDAR-Camera 3D Detection has gained increasing atten-
improves the effectiveness and robustness of our LiDAR- tion due to the complementary roles of point clouds and im-
camera 3D detector. To summarize, our contributions are ages. Early works [5, 29, 39] adopt result-level or proposal-
fourfold: level fusion, where the fusion granularity is too coarse to
1. Our studies investigate the inherent difficulties of release the full potential of two modalities. Since Point-
LiDAR-camera fusion and reveal a crucial aspect to Painting [46] was proposed, the point-level fusion meth-
robust fusion, namely, the soft-association mechanism. ods [10,40,47] have shown great advantages and promising
2. We propose a novel transformer-based LiDAR-camera results. However, such methods are easily affected by the
fusion model for 3D detection, which performs fine- sensor misalignment due to the hard association between
grained fusion in an attentive manner and shows supe- points and pixels established by calibration matrices. More-
rior robustness against degenerated image quality and over, the simple point-wise concatenation ignores the qual-
sensor misalignment. ity of real data and contextual relationships between two
3. We introduce several simple yet effective adjustments modalities, and thus leads to degraded performance when
for object queries to boost the quality of initial bound- the image features are defective. In our work, we explore
ing box predictions for image fusion. An image- a more robust and effective fusion mechanism to mitigate
guided query initialization module is also designed to these limitations during LiDAR-camera fusion.
handle objects that are hard to detect in point clouds.

1081

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
LiDAR as ), + Initial Prediction

Query Initialization

Prediction FFN
3D Backbone
as , Transformer
Decoder Layer

LiDAR BEV Object Queries


Features

Camera Final Output


Image Guidance

Prediction FFN
2D Backbone
as ), + Transformer
Decoder Layer
with SMCA
Image Features
LiDAR-Camera Fusion

Figure 2. Overall pipeline of TransFusion. Our model relies on standard 3D and 2D backbones to extract LiDAR BEV feature map and
image feature map. Our detection head consists of two transformer decoder layers sequentially: (1) The first layer produces initial 3D
bounding boxes using a sparse set of object queries, initialized in a input-dependent and category-aware manner. (2) The second layer
attentively associates and fuses the object queries (with initial predictions) from the first stage with the image features, producing rich
texture and color cues for better detection results. A spatially modulated cross attention (SMCA) mechanism is introduced to involve a
locality inductive bias and help the network better attend to the related image regions. We additionally propose an image-guided query
initialization strategy to involve image guidance on LiDAR BEV. This strategy helps produce object queries that are difficult to detect in
the sparse LiDAR point clouds.

3. Methodology 3.2. Query Initialization


In this section, we present the proposed method TransFu- Input-dependent. The query positions in the seminal
sion for LiDAR-camera 3D object detection. As shown in works [2, 44, 70] are randomly generated or learned as net-
Fig. 2, given a LiDAR BEV feature map and an image fea- work parameters, regardless of the input data. Such input-
ture map from convolutional backbones, our transformer- independent query positions will take extra stages (decoder
based detection head first decodes object queries into ini- layers) for their models [2, 70] to learn the moving process
tial bounding box predictions using the LiDAR information, towards the real object centers. Recently, it has been ob-
and then performs LiDAR-camera fusion by attentively fus- served in 2D object detection [56] that with a better ini-
ing object queries with useful image features. Below we tialization of object queries, the gap between 1-layer struc-
will first provide the preliminary knowledge about a trans- ture and 6-layer structure could be bridged. Inspired by this
former architecture for detection and then present the detail observation, we propose an input-dependent initialization
of TransFusion. strategy based on a center heatmap to achieve competitive
performance using only one decoder layer.
3.1. Preliminary: Transformer for 2D Detection Specifically, given a d dimensional LiDAR BEV fea-
ture map FL ∈ RX×Y ×d , we first predict a class-specific
Transformer [45] has been widely used for 2D object de-
heatmap Ŝ ∈ RX×Y ×K , where X × Y describes the size
tection [9, 44, 56, 70] since DETR [2] was proposed. DETR
of the BEV feature map and K is the number of categories.
uses a CNN backbone to extract image features and a trans-
Then we regard the heatmap as X × Y × K object candi-
former architecture to convert a small set of learned embed-
dates and select the top-N candidates for all the categories
dings (called object queries) into a set of predictions. The
as our initial object queries. To avoid spatially too closed
follow-up works [44,56,70] further equip the object queries
queries, following [65], we select the local maximum ele-
with positional information 1 . The final predictions of boxes
ments as our object queries, whose values are greater than
are the relative offsets w.r.t. the query positions to reduce
or equal to their 8-connected neighbors. Otherwise, a large
optimization difficulty. We refer readers to the original pa-
number of queries are needed to cover the BEV plane. The
pers [2, 70] for more details. In our work, each object query
positions and features of the selected candidates are used to
contains a query position providing the localization of the
initialize the query positions and query features. In this way,
object and a query feature encoding instance information,
our initial object queries will locate at or close to the poten-
such as the box’s size, orientation, etc.
tial object centers, eliminating the need of multiple decoder
1 Slightly different concepts might be introduced, e.g., reference points layers [20, 23, 49] to refine the locations.
in Deformable-DETR [70] and proposal boxes in Sparse-RCNN [44]. Category-aware. Unlike their 2D projections on the image

1082

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
plane, the objects on the BEV plane are all in absolute scale object query with predicted bounding box. object query with predicted bounding box.

and has small scale variance among the same categories. To


leverage such properties for better multi-class detection, we
make the object queries category-aware by equipping each
query with a category embedding. Specifically, using the
category of each selected candidate (e.g. Ŝijk belonging to
the k-th category), we element-wisely sum the query feature
with a category embedding produced by linearly projecting
the one-hot category vector into a Rd vector. The category
embedding brings benefits in two aspects: on the one hand,
it serves as a useful side information when modelling the
object-object relations in the self-attention modules and the
Figure 3. The first row shows the input images and the predic-
object-context relations in the cross-attention modules. On
tions of object queries projected on the images, and the second
the other hand, during prediction, it could deliver valuable
row shows the cross-attention maps. Our fusion strategy is able
prior knowledge of the object, making the network focus to dynamically choose relevant image pixels and is not limited by
on intra-category variance and thus benefiting the property the number of LiDAR points. The two images are picked from
prediction. nuScenes and Waymo, respectively.

3.3. Transformer Decoder and FFN LiDAR points. When an object only contains a small num-
The decoder layer follows the design of DETR [23] and ber of LiDAR points, it can fetch only the same number
the detailed architecture is provided in the supplementary of image features, wasting the rich semantic information of
Sec. ??. The cross attention between object queries and high-resolution images. To mitigate this issue, we do not
the feature maps (either from point clouds or images) ag- fetch the multiview image features based on the hard asso-
gregates relevant context onto the object candidates, while ciation between LiDAR points and image pixels. Instead,
the self attention between object queries reasons pairwise we retain all the image features FC ∈ RNv ×H×W ×d as our
relations between different object candidates. The query memory bank, and use the cross-attention mechanism in the
positions are embedded into d-dimensional positional en- transformer decoder to perform feature fusion in a sparse-
coding with a Multilayer Perceptron (MLP), and element- to-dense and adaptive manner, as shown in Fig. 2.
wisely summed with the query features. This enables the SMCA for Image Feature Fusion. Multi-head attention is
network to reason about both context and position jointly. a popular mechanism to perform information exchange and
The N object queries containing rich instance informa- build a soft association between two sets of inputs, and it has
tion are then independently decoded into boxes and class been widely used for the feature matching task [34, 41]. To
labels by a feed-forward network (FFN). Following Center- mitigate the sensitivity towards sensor calibration and infe-
Point [57], our FFN predicts the center offset from the query rior image features brought by the hard-association strategy,
position as δx, δy, bounding box height as z, size l, w, h as we leverage the cross-attention mechanism to build the soft
log(l), log(w), log(h), yaw angle α as sin(α), cos(α) and association between LiDAR and images, enabling the net-
the velocity (if available) as vx , vy . We also predict a per- work to adaptively determine where and what information
class probability p̂ ∈ [0, 1]K for K semantic classes. Each should be taken from the images.
attribute is computed by a separate two-layer 1 × 1 convolu- Specifically, we first identify the specific image in which
tion. By decoding each object query into prediction in par- the object queries are located using previous predictions as
allel, we get a set of predictions {b̂t , p̂t }N well as the calibration matrices, and then perform cross at-
t as output, where
b̂t is the predicted bounding box for the i-th query. Fol- tention between the object queries and the corresponding
lowing [23], we adopt the auxiliary decoding mechanism, image feature map. However, as the LiDAR features and
which adds FFN and supervision after each decoder layer. image features are from completely different domains, the
Hence, we can have initial bounding box predictions from object queries might attend to visual regions unrelated to
the first decoder layer. We leverage such initial predictions the bounding box to be predicted, leading to a long train-
in the LiDAR-camera fusion module to constrain the cross ing time for the network to accurately identify the proper
attention, as explained in the next section. regions on images. Inspired by [9], we design a spatially
modulated cross attention (SMCA) module, which weighs
3.4. LiDAR-Camera Fusion the cross attention by a 2D circular Gaussian mask around
the projected 2D center of each query. The 2D Gaussian
Image Feature Fetching. Although impressive improve-
weight mask M is generated in a similar way as Center-
ment has been brought by point-level fusion methods [46, (i−cx )2 +(j−cy )2
47], their fusion quality is largely limited by the sparsity of Net [65], Mij = exp(− σr 2 ), where (i, j) is
the spatial indices of the weight mask M, (cx , cy ) is the 2D

1083

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
#
center computed by projecting the query prediction onto the Lidar BEV

Query Initialization
image plane, r is the radius of the minimum circumscribed Features
as
Collapse along
circle of the projected corners of the 3D bounding box, and height axis # $

Multi-Head
σ is the hyper-parameter to modulate the bandwidth of the ! maxpool as !, #
Attention
Gaussian distribution. Then this weight map is element- " "

wisely multiplied with the cross-attention map among all Image Features Fused BEV Features
Figure 4. We first condense the image features along the vertical
the attention heads. In this way, each object query only at-
dimension, and then project the features onto the BEV plane us-
tends to the related region around the projected 2D box, so
ing cross attention with the LiDAR BEV features. Each image is
that the network can learn where to select image features processed by a separate multi-head attention layer, which captures
based on the input LiDAR features better and faster. The the relation between image column and BEV locations.
visualization of the attention map is shown in Fig. 3. The
network typically tends to focus on the foreground pixels initialization strategy, which selects object queries leverag-
close to the object center and ignore the irrelevant pixels, ing both the LiDAR and camera information.
providing valuable semantic information for object classifi- Specifically, we generate a LiDAR-camera BEV feature
cation and bounding box regression. After SMCA, we use map FLC by projecting the image features FC onto the BEV
another FFN to produce the final bound box predictions us- plane through cross attention with LiDAR BEV features
ing the object queries containing both LiDAR and image FL . Inspired by [32], we use the multiview image features
information. collapsed along the height axis as the key-value sequence of
3.5. Label Assignment and Losses the attention mechanism, as shown in Fig. 4. The collaps-
ing operation is based on the observation that the relation
Following DETR [23], we find the bipartite matching be- between BEV locations and image columns can be estab-
tween the predictions and ground truth objects through the lished easily using camera geometry, and usually there is
Hungarian algorithm [13], where the matching cost is de- at most one object along each image column. Therefore,
fined by a weighted sum of classification, regression, and collapsing along the height axis can significantly reduce the
IoU cost: computation without losing critical information. Although
some fine-grained image features might be lost during this
C_{match} = \lambda _1 L_{cls}(p, \hat p) + \lambda _2 L_{reg}(b, \hat b) + \lambda _3 L_{iou}(b, \hat b), \label {eq:matching_cost} (1)
process, it already meets our need as only a hint on poten-
where Lcls is the binary cross entropy loss, Lreg is the L1 tial object positions is required. Afterward, similar to Sec.
loss between the predicted BEV centers and the ground- 3.2, we use FLC to predict the heatmap, which is averaged
truth centers (both normalized in [0, 1]), and Liou is the with the LiDAR-only heatmap Ŝ as the final heatmap ŜLC .
IoU loss [64] between the predicted boxes and ground-truth Using ŜLC to select and initialize the object queries, our
boxes. λ1 , λ2 , λ3 are the coefficients of the individual cost model is able to detect objects that are difficult to detect in
terms. We provide sensitivity analysis of these terms in the LiDAR point clouds.
supplementary Sec. ??. Since the number of predictions is Note that proposing a novel method to project the image
usually larger than that of GT boxes, the unmatched predic- features onto the BEV plane is beyond the scope of this pa-
tions are considered as negative samples. Given all matched per. We believe that our method could benefit from more
pairs, we compute a focal loss [18] for the classification research progress [26, 32, 33] in this direction.
branch. The bounding box regression is supervised by an
L1 loss for only positive pairs. For the heatmap predic- 4. Implementation Details
tion, we adopt a penalty-reduced focal loss following Cen-
terPoint [57]. The total loss is the weighted sum of losses Training. We implement our network in PyTorch [25] us-
for each component. We adopt the same label assignment ing the open-sourced MMDetection3D [6]. For nuScenes,
strategy and loss formulation for both decoder layers. we use the DLA34 [60] of the pretrained CenterNet as our
2D backbone and keep its weights frozen during training,
3.6. Image-Guided Query Initialization following [47]. We set the image size to 448 × 800, which
Since our object queries are currently selected using only performs comparably with full resolution (896 × 1600).
LiDAR features, it potentially leads to sub-optimality in VoxelNet [52, 67] is chosen as our 3D backbone. Our train-
terms of the detection recall. Empirically, our model al- ing consists of two stages: 1) We first train the 3D backbone
ready achieves high recall and shows superior performance with the first decoder layer and FFN for 20 epochs, which
over the baselines (Sec. 5). Nevertheless, to further lever- only needs the LiDAR point clouds as input and produces
age the ability of high-resolution images in detecting small the initial 3D bounding box predictions. We adopt the same
objects and make our algorithm more robust against sparse data augmentation and training schedules as prior LiDAR-
LiDAR point clouds, we propose an image-guided query only works [57, 68]. Note that we also find the copy-and-
paste augmentation strategy [52] benefits the convergence

1084

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
Method Modality Voxel Size (m) mAP NDS Car Truck C.V. Bus Trailer Barrier Motor. Bike Ped. T.C.
PointPillar [14] L (0.2, 0.2, 8) 40.1 55.0 76.0 31.0 11.3 32.1 36.6 56.4 34.2 14.0 64.0 45.6
CBGS [68] L (0.1, 0.1, 0.2) 52.8 63.3 81.1 48.5 10.5 54.9 42.9 65.7 51.5 22.3 80.1 70.9
CenterPoint [57] L (0.075, 0.075, 0.2) 60.3 67.3 85.2 53.5 20.0 63.6 56.0 71.1 59.5 30.7 84.6 78.4
PointPainting [46] LC (0.2, 0.2, 8) 46.4 58.1 77.9 35.8 15.8 36.2 37.3 60.2 41.5 24.1 73.3 62.4
3D-CVF [59] LC (0.05, 0.05, 0.2) 52.7 62.3 83.0 45.0 15.9 48.8 49.6 65.9 51.2 30.4 74.2 62.9
PointAugmenting [47] LC (0.075, 0.075, 0.2) 66.8 71.0 87.5 57.3 28.0 65.2 60.7 72.6 74.3 50.9 87.9 83.6
MVP [58] LC (0.075, 0.075, 0.2) 66.4 70.5 86.8 58.5 26.1 67.4 57.3 74.8 70.0 49.3 89.1 85.0
FusionPainting [51] LC (0.075, 0.075, 0.2) 68.1 71.6 87.1 60.8 30.0 68.5 61.7 71.8 74.7 53.5 88.3 85.0
TransFusion-L L (0.075, 0.075, 0.2) 65.5 70.2 86.2 56.7 28.2 66.3 58.8 78.2 68.3 44.2 86.1 82.0
TransFusion LC (0.075, 0.075, 0.2) 68.9 71.7 87.1 60.0 33.1 68.3 60.8 78.1 73.6 52.9 88.4 86.7
Table 1. Comparison with SOTA methods on the nuScenes test set. ‘C.V.’, ‘Ped.’, and ‘T.C.’ are short for construction vehicle, pedestrian,
and traffic cone, respectively. ‘L’ and ‘C’ represent LiDAR and Camera, respectively. The best results are in boldface (Best LiDAR-only
results are marked blue and best LC results are marked red). For FusionPainting [51], we report the results on the nuScenes website, which
are better than what they reported in their paper. Note that CenterPoint [57] and PointAugmenting [47] utilize double-flip testing while we
do not use any test time augmentation. Please find detailed results here.2

but could disturb the real data distribution, so we disable ing, consisting of 700, 150, and 150 scenes for train-
this augmentation for the last 5 epochs following [47] (they ing, validation, and testing, respectively. Each frame con-
called a fade strategy). 2) We then train the LiDAR-camera tains one point cloud and six calibrated images cover-
fusion and the image-guided query initialization module for ing the 360-degree horizontal FOV. For 3D detection, the
another 6 epochs. We find that this two-step training scheme main metrics are mean Average Precision (mAP) [7] and
performs better than joint training, since we can adopt more nuScenes detection score (NDS). The mAP is defined by
flexible augmentations for the first training stage. See sup- the BEV center distance instead of the 3D IoU, and the fi-
plementary Sec. ?? for the detailed hyper-parameters and nal mAP is computed by averaging over distance thresholds
settings on Waymo. of 0.5m, 1m, 2m, 4m across ten classes. NDS is a consol-
Testing. During inference, the final score is computed as the idated metric of mAP and other attribute metrics, includ-
geometric average of the heatmap score Ŝij and the classi- ing translation, scale, orientation, velocity, and other box
fication score p̂t . We use all the outputs as our final predic- attributes. Following CenterPoint [57], we set the voxel size
tions without Non-maximum Suppression (NMS) (see the to (0.075m, 0.075m, 0.2m).
effect of NMS in supplementary Sec. ??). It is noteworthy Waymo Open Dataset. This dataset consists of 798 scenes
that previous point-level fusion methods such as PointAug- for training and 202 scenes for validation. The official met-
menting [47] rely on two different models for camera FOV rics are mAP and mAPH (mAP weighted by heading accu-
and LiDAR-only regions if the cameras are not 360-degree racy). The mAP and mAPH are defined based on the 3D
cameras, because only points in the camera FOV could fetch IoU threshold of 0.7 for vehicles and 0.5 for pedestrians
the corresponding image features. In contrast, we use a sin- and cyclists. These metrics are further broken down into
gle model to deal with both camera FOV and LiDAR-only two difficulty levels: LEVEL1 for boxes with more than
regions, since object queries located outside camera FOV five LiDAR points and LEVEL2 for boxes with at least one
will directly ignore the fusion stage and the initial predic- LiDAR point. Unlike the 360-degree cameras in nuScenes,
tions from the first decoder layer will be a safeguard. the cameras in Waymo only cover around 250 degrees hor-
izontally. The voxel size is set to (0.1m, 0.1m, 0.15m).
5. Experiments
5.1. Main Results
In this section, we first make comparisons with the state-
of-the-art methods on nuScenes and Waymo. Then we con- nuScenes Results. We submitted our detection results to
duct extensive ablation studies to demonstrate the impor- the nuScenes evaluation server. Without any test time aug-
tance of each key component of TransFusion. Moreover, we mentation or model ensemble, our TransFusion outperforms
design two experiments to show the robustness of our Trans- all competing non-ensembled methods on the nuScenes
Fusion against inferior image conditions. Besides TransFu- leaderboard at the time of submission. As shown in Table 1,
sion, we also include a model variant, which is based on our TransFusion-L already outperforms the state-of-the-
the first training stage, i.e., producing the initial bounding art LiDAR-only methods by a significant margin (+5.2%
box predictions using only point clouds. We denote it as mAP, +2.9% NDS) and even surpasses some multi-modality
TransFusion-L and believe that it can serve as a strong base- methods. We ascribe this performance gain to the rela-
line for LiDAR-only detection. We provide the qualitative tion modeling power of the transformer decoder as well as
results in supplementary Sec. ??. the proposed query initialization strategies, which are ab-
nuScenes Dataset. The nuScenes dataset is a large-scale lated in Sec. 5.3. Once enabling the proposed fusion com-
autonomous-driving dataset for 3D detection and track- ponents, our TransFusion receives remarkable performance
boost (+3.4% mAP, +1.5% NDS) and outperforms all the
2 https://www.nuscenes.org/object-detection
previous methods, including FusionPainting [51], which

1085

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
Vehicle Pedestrian Cyclist Overall Nighttime Daytime
PointPillar [46] 62.5 50.2 59.9 57.6 TransFusion-L 49.2 60.3
PVRCNN [36] 64.8 46.7 - - CC 49.4 (+0.2) 63.4 (+3.1)
LiDAR-RCNN [15] 64.2 51.7 64.4 60.1
CenterPoint [57] 66.1 62.4 67.6 65.3
PA 51.0 (+1.8) 64.3 (+4.0)
PointAugmenting [47] 62.2 64.6 73.3 66.7 TransFusion 55.2 (+6.0) 65.7 (+5.4)
TransFusion-L 65.1 63.7 65.9 64.9 Table 4. mAP breakdown over daytime and nighttime. We exclude
TransFusion 65.1 64.0 67.4 65.5
categories that do no have any labeled samples.
Table 2. LEVEL 2 mAPH on Waymo validation set. For Center-
Point, we report the performance of single-frame one-stage model test set only allows at most three submissions, all the ex-
trained in 36 epochs. periments are conducted on the validation set. For fast it-
AMOTA↑ TP↑ FP↓ FN↓ IDS↓ eration, we reduce the first stage training to 12 epochs and
CenterPoint [57] 63.8 95877 18612 22928 760 remove the fade strategy. All the other parameters are the
EagerMOT [11] 67.7 93484 17705 24925 1156
AlphaTrack [61] 69.3 95851 18421 22996 718 same as the main experiments. To avoid overstatement, we
TransFusion-L 68.6 95235 17851 23437 893 additionally build two baseline LiDAR-camera detectors by
TransFusion 71.8 96775 16232 21846 944 equipping our TransFusion-L with two representative fu-
Table 3. Comparison of the tracking results on nuScenes test set. sion methods on nuScenes: fusing LiDAR and image fea-
Please find detailed results here.3 tures by point-wise concatenation (denoted as CC) and the
fusion strategy of PointAugmenting (denoted as PA).
uses extra data to train their segmentation sub-networks. Nighttime. We first split the validation set into daytime
Moreover, thanks to our soft-association mechanism, Trans- and nighttime based on scene descriptions provided by
Fusion is robust to inferior image conditions including de- nuScenes and show the performance gain under different
generated image quality and sensor misalignment, as shown situations in Table 4. Our method brings a much larger per-
in the next section. formance gain during nighttime, where the worse lighting
Waymo Results. We report the performance of our model negatively affects the hard-association based fusion strate-
over all three classes on Waymo validation set in Table 2. gies CC and PA.
Our fusion strategy improves the mAPH of pedestrian and Degenerated Image Quality. In Table 5, we randomly drop
cyclist classes by 0.3 and 1.5x, respectively. We suspect two several images for each frame by setting the image features
reasons for the relatively small improvement brought by the of such images to zero during inference. Since both CC
image components. First, the semantic information of im- and PA fuse LiDAR and image features in a tightly-coupled
ages might have less impact on the coarse-grained catego- way, their performance drops significantly when some im-
rization of Waymo. Second, the initial bounding boxes from ages are not available during inference. In contrast, our
the first decoder layer are already with accurate locations TransFusion is able to maintain a high mAP under all cases.
since the point clouds in Waymo are denser than those in When all the six images are not available, CC and PA suf-
nuScenes (see more discussions in supplementary Sec. ??). fer from 23.8% and 17.2% mAP degradation, respectively,
Note that CenterPoint achieves a better performance with while TransFusion still keeps the mAP at a competitive level
a multi-frame input and a second-stage refinement mod- of 61.7%. This advantage comes from the sequential design
ule. Such components are orthogonal to our method and we and the attentive fusion strategy, which first generates ini-
leave a more powerful TransFusion for Waymo as the future tial predictions based on LiDAR data and then only gathers
work. PointAugmenting achieves better performance than useful information from image features adaptively. More-
ours but relies on CenterPoint to get the predictions outside over, we could even directly disable the fusion module if
camera FOV for a full-region detection, making their sys- the camera malfunctioning is known, such that the whole
tem less flexible. system could still work seamlessly in a LiDAR-only mode.
Extend to Tracking. To further demonstrate the general-
ization capability, we evaluate our model in a 3D multi- # Dropped Images 0 1 3 6
object tracking (MOT) task by performing tracking-by- CC 63.3 59.8 (-3.5) 50.9 (-12.4) 39.5 (-23.8)
PA 64.2 61.6 (-2.6) 55.4 (-8.8) 47.0 (-17.2)
detection with the same tracking algorithms adopted by TransFusion 65.6 65.1 (-0.5) 63.9 (-1.7) 61.7 (-3.9)
CenterPoint. We refer readers to the original paper [57] Table 5. mAP under different numbers of dropped images. The
for details. As shown in Table 3, our model significantly number in each bracket is the mAP drop from the standard input.
outperforms CenterPoint and sets the new state-of-the-art
results on the leaderboard of nuScenes tracking. Sensor Misalignment. We evaluate different fusion meth-
ods under a setting where LiDAR and images are not well-
5.2. Robustness against Inferior Image Conditions calibrated following RoarNet [39]. Specifically, we ran-
domly add a translation offset to the transformation matrix
We design three experiments to demonstrate the robust-
from camera to LiDAR sensor. As shown in Fig. 5, Trans-
ness of our proposed fusion module. Since the nuScenes
Fusion achieves better robustness against the calibration er-
3 https://www.nuscenes.org/tracking ror compared with other fusion methods. When two sensors

1086

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
66 65.60 65.60 mAP NDS Params (M) Latency (ms)
65.58 65.49
65.33

65
65.11 CenterPoint 57.4 65.2 8.54 117.2
64.22
TransFusion-L 60.0 66.8 7.96 114.9
64.13
64 63.71
CC 63.3 67.6 8.01 + 18.34 212.3
PA 64.2 68.7 13.9 + 18.34 288.2
mAP (%)

63.32
63.12 63.11
63 w/o Fusion 61.6 67.4 9.08 + 18.34 215.0
62.56 62.53

62 61.86 61.89
w/o Guide 64.8 69.3 8.35 + 18.34 236.9
TransFusion 65.6 69.7 9.47 + 18.34 265.9
61.11
61 TransFusion
PA
CC
60.48 Table 7. Ablation of the proposed fusion components. 18.34 rep-
60 0.0 0.2 0.4 0.6 0.8 1.0 resents the parameter size of the 2D backbone. The latency is
Discrepancy (m)
Figure 5. mAP under sensor misalignment cases. The X axis refers measured on an Intel Core i7 CPU and a Titan V100 GPU. For
to the translational discrepancy between two sensors. CenterPoint, we use re-implementations in MMDetection3D.

ule (denoted as w/o Fusion) and the image-guided query


are misaligned by 1m, the mAP of our model only drops by initialization (denoted as w/o Guide). As shown in Table 7,
0.49%, while the mAP of PA and CC degrades by 2.33% the image feature fusion and image-guided query initializa-
and 2.85%, respectively. In our method, the calibration ma- tion bring 4.8% and 1.6% mAP gain, respectively. The for-
trix is only used for projecting the object queries onto im- mer provides more distinctive instance features, which are
ages, and the fusion module is not strict with the projected particularly critical for classification on nuScenes, where
locations since the attention mechanism could adaptively some categories are challenging to distinguish, such as
find the relevant image features around based on the con- trailer and construction vehicle. The latter affects less, since
text information. The insensitivity towards sensor calibra- TransFusion-L already has enough recall. We believe the
tion also enables the possibility to pipelining the 2D and 3D latter will be more useful when point clouds are sparser.
backbones such that the LiDAR features are fused with the Compared with other fusion methods, our fusion strategy
features from the previous images [46]. brings a larger performance gain with a modestly increas-
ing number of parameters and latency. To better understand
5.3. Ablation Studies where the improvements are from, we show the mAP break-
down on different subsets based on the range in Table 8. Our
We conduct ablation studies on the nuScenes validation
fusion method gives larger performance boost for distant re-
set to study the effectiveness of the proposed components.
gions where 3D objects are difficult to detect or classify in
C.A. I.D. #Layers #Epochs mAP NDS LiDAR modality.
a) ✓ ✓ 1 12 60.0 66.8
b) ✓ 1 12 54.3 63.9 <15m 15-30m >30m
c) ✓ ✓ 3 12 59.9 67.1 TransFusion-L 70.4 59.5 35.3
TransFusion 75.5 (+5.1) 66.9 (+7.4) 43.7 (+8.4)
d) ✓ 1 12 24.0 33.8
e) ✓ 3 12 28.3 43.4 Table 8. mAP breakdown over BEV distance between object cen-
f) ✓ 3 36 46.9 57.8 ter and ego vehicle in meters.
Table 6. Ablation of the query initialization module. C.A.:
category-aware; I.D.: input-dependent. 6. Conclusion
Query Initialization. In Table 6, we study how the query We have designed an effective and robust transformer-
initialization strategy affects the performance of the initial based LiDAR-camera 3D detection framework with a soft-
bounding box prediction. a) the first row is TransFusion-L. association mechanism to adaptively determine where and
b) when the category-embedding is removed, NDS drops what information should be taken from images. Our Trans-
to 63.9%. d)-f) shows the performance of the models Fusion sets the new state-of-the-art results on the nuScenes
trained without the input-dependent strategy. Specifically, detection and tracking leaderboards, and shows competi-
we make the query positions as a set of learnable parame- tive results on Waymo detection benchmark. The exten-
ters (N × 2) to capture the statistics of potential object lo- sive ablative experiments demonstrate the robustness of our
cations in the dataset. The model under this setting only method against inferior image conditions. We hope that our
achieves 33.8% NDS. Increasing the number of decoder work will inspire further investigation of LiDAR-camera fu-
layers or the number of training epochs boosts the perfor- sion for driving-scene perception, and the application of a
mance, but TransFusion-L still outperforms the model in (f) soft-association based fusion strategy to other tasks, such as
by 9.0% NDS. a), c): In contrast, with the proposed query 3D segmentation.
initialization strategy, our TransFusion-L does not require Acknowledgements. This work is supported by Hong
more decoder layers. Kong RGC (GRF 16206819, 16203518, T22-603/15N),
Fusion Components. To study how the image informa- Guangzhou Okay Information Technology with the project
tion benefits the detection results, we ablate the proposed GZETDZ18EG05, and City University of Hong Kong (No.
fusion components by removing the feature fusion mod- 7005729).

1087

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
References [19] Zili Liu, Guodong Xu, Honghui Yang, Minghao Chen,
Kuoliang Wu, Zheng Yang, Haifeng Liu, and Deng Cai.
[1] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Suppress-and-refine framework for end-to-end 3d object de-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- tection. arXiv, 2021. 2
ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- [20] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong.
modal dataset for autonomous driving. CVPR, 2020. 1 Group-free 3d object detection via transformers. ICCV,
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas 2021. 2, 3
Usunier, Alexander Kirillov, and Sergey Zagoruyko. DETR: [21] Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi
End-to-end object detection with transformers. ECCV, 2020. Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. Voxel
2, 3 transformer for 3d object detection. ICCV, 2021. 2
[3] Qi Chen, Lin Sun, Ernest C. H. Cheung, and A. Yuille. Every [22] Gregory P. Meyer, Jake Charland, Darshan Hegde,
View Counts: Cross-view consistency in 3d object detection Ankita Gajanan Laddha, and Carlos Vallespi-Gonzalez. Sen-
with hybrid-cylindrical-spherical voxelization. NeurIPS, sor fusion for joint 3d object detection and semantic segmen-
2020. 2 tation. CVPRW, 2019. 1
[4] Qi Chen, Lin Sun, Zhixin Wang, K. Jia, and A. Yuille. Object [23] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-
as Hotspots: An anchor-free 3d object detection approach via to-End Transformer Model for 3D Object Detection. ICCV,
firing of hotspots. ECCV, 2020. 2 2021. 2, 3, 4, 5
[5] Xiaozhi Chen, Huimin Ma, Jixiang Wan, B. Li, and Tian [24] Xuran Pan, Zhuofan Xia, Shiji Song, L. Li, and Gao Huang.
Xia. Multi-view 3d object detection network for autonomous 3d object detection with pointformer. CVPR, 2021. 2
driving. CVPR, 2017. 1, 2 [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[6] MMDetection3D Contributors. MMDetection3D: Open- Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
MMLab next-generation platform for general 3D object ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
detection. https://github.com/open- mmlab/ differentiation in pytorch. NeurIPS-W, 2017. 5
mmdetection3d, 2020. 5 [26] Jonah Philion and S. Fidler. Lift, Splat, Shoot: Encoding
[7] M. Everingham, L. Gool, Christopher K. I. Williams, J. images from arbitrary camera rigs by implicitly unprojecting
Winn, and Andrew Zisserman. The pascal visual object to 3d. ECCV, 2020. 5
classes (voc) challenge. IJCV, 2009. 6 [27] C. Qi, Xinlei Chen, O. Litany, and L. Guibas. ImVoteNet:
[8] Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and Boosting 3d object detection in point clouds with image
Zhaoxiang Zhang. RangeDet: In defense of range view for votes. CVPR, 2020. 2
lidar-based 3d object detection. ICCV, 2021. 2 [28] C. Qi, O. Litany, Kaiming He, and L. Guibas. Deep hough
[9] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, voting for 3d object detection in point clouds. ICCV, 2019.
and Hongsheng Li. Fast convergence of detr with spatially 2
modulated co-attention. ICCV, 2021. 3, 4 [29] C. Qi, W. Liu, Chenxia Wu, Hao Su, and L. Guibas. Frustum
[10] Tengteng Huang, Zhe Liu, Xiwu Chen, and X. Bai. EPNet: pointnets for 3d object detection from rgb-d data. CVPR,
Enhancing point features with image semantics for 3d object 2018. 1, 2
detection. ECCV, 2020. 1, 2 [30] C. Qi, Hao Su, Kaichun Mo, and L. Guibas. PointNet: Deep
[11] Aleksandr Kim, Aljosa Osep, and Laura Leal-Taixé. Ea- learning on point sets for 3d classification and segmentation.
gerMOT: 3d multi-object tracking via sensor fusion. ICRA, CVPR, 2017. 1
2021. 7 [31] Shaoqing Ren, Kaiming He, Ross B. Girshick, and J. Sun.
[12] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, Faster R-CNN: Towards real-time object detection with re-
and Steven L. Waslander. Joint 3d proposal generation and gion proposal networks. TPAMI, 2015. 1
object detection from view aggregation. IROS, 2018. 1 [32] Thomas Roddick and R. Cipolla. Predicting semantic map
[13] H. Kuhn. The hungarian method for the assignment problem. representations from images using pyramid occupancy net-
Naval Research Logistics Quarterly, 1955. 5 works. CVPR, 2020. 5
[14] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [33] Thomas Roddick, Alex Kendall, and R. Cipolla. Ortho-
Jiong Yang, and Oscar Beijbom. PointPillars: Fast encoders graphic feature transform for monocular 3d object detection.
for object detection from point clouds. CVPR, 2019. 1, 2, 6 BMVC, 2019. 5
[15] Zhichao Li, Feng Wang, and Naiyan Wang. LiDAR R-CNN: [34] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
An efficient and universal 3d object detector. CVPR, 2021. 7 and Andrew Rabinovich. SuperGlue: Learning feature
[16] Ming Liang, Binh Yang, Yun Chen, Rui Hu, and R. Urta- matching with graph neural networks. CVPR, 2020. 4
sun. Multi-task multi-sensor fusion for 3d object detection. [35] Hualian Sheng, Sijia Cai, Yuan Liu, Bing Deng, Jianqiang
CVPR, 2019. 1 Huang, Xiansheng Hua, and Min-Jian Zhao. Improving 3d
[17] Ming Liang, Binh Yang, Shenlong Wang, and R. Urtasun. object detection with channel-wise transformer. ICCV, 2021.
Deep continuous fusion for multi-sensor 3d object detection. 2
ECCV, 2018. 1 [36] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping
[18] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, Shi, Xiaogang Wang, and Hongsheng Li. PV-RCNN: Point-
and Piotr Dollár. Focal loss for dense object detection. ICCV, voxel feature set abstraction for 3d object detection. CVPR,
2017. 5 2020. 2, 7

1088

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.
[37] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR- [53] Binh Yang, Wenjie Luo, and R. Urtasun. PIXOR: Real-time
CNN: 3d object proposal generation and detection from 3d object detection from point clouds. CVPR, 2018. 2
point cloud. CVPR, 2019. 2 [54] Zetong Yang, Y. Sun, Shu Liu, and Jiaya Jia. 3DSSD: Point-
[38] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, based 3d single stage object detector. CVPR, 2020. 2
and Hongsheng Li. From Points to Parts: 3d object detec- [55] Zetong Yang, Y. Sun, Shu Liu, Xiaoyong Shen, and Jiaya
tion from point cloud with part-aware and part-aggregation Jia. STD: Sparse-to-dense 3d object detector for point cloud.
network. TPAMI, 2021. 2 ICCV, 2019. 2
[39] Kiwoo Shin, Y. Kwon, and M. Tomizuka. RoarNet: A ro- [56] Z. Yao, Jiangbo Ai, Boxun Li, and Chi Zhang. Efficient
bust 3d object detection based on region approximation re- DETR: Improving end-to-end object detector with dense
finement. IV, 2019. 1, 2, 7 prior. arXiv, 2021. 3
[40] Vishwanath A. Sindagi, Yin Zhou, and Oncel Tuzel. MVX- [57] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-
Net: Multimodal voxelnet for 3d object detection. ICRA, based 3d object detection and tracking. CVPR, 2021. 2, 4, 5,
2019. 1, 2 6, 7
[41] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and [58] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multi-
Xiaowei Zhou. LoFTR: Detector-free local feature matching modal virtual point 3d detection. NeurIPS, 2021. 6
with transformers. CVPR, 2021. 4
[59] Jin Hyeok Yoo, Yeocheol Kim, Ji Song Kim, and J. Choi.
[42] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
3D-CVF: Generating joint camera and lidar features using
Chouard, Vijaysai Patnaik, P. Tsui, James Guo, Yin Zhou,
cross-view spatial feature fusion for 3d object detection.
Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
ECCV, 2020. 1, 6
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, S. Ettinger,
[60] F. Yu, Dequan Wang, and Trevor Darrell. Deep layer aggre-
Maxim Krivokon, A. Gao, Aditya Joshi, Y. Zhang, Jonathon
gation. CVPR, 2018. 5
Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability
in perception for autonomous driving: Waymo open dataset. [61] Yihan Zeng, Chao Ma, Ming Zhu, Zhiming Fan, and Xi-
CVPR, 2020. 1 aokang Yang. Cross-modal 3d object detection and tracking
[43] Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, for auto-driving. IROS, 2021. 7
Alex Bewley, Xiao Zhang, Cristian Sminchisescu, and Drago [62] Wenwei Zhang, Zhe Wang, and Chen Change Loy. Multi-
Anguelov. RSN: Range sparse net for efficient, accurate lidar modality cut and paste for 3d object detection. arXiv, 2020.
3d object detection. CVPR, 2021. 2 1
[44] Pei Sun, Rufeng Zhang, Yi Jiang, T. Kong, Chenfeng Xu, [63] Lin Zhao, Hui Zhou, Xinge Zhu, Xiao Song, Hongsheng Li,
W. Zhan, M. Tomizuka, L. Li, Zehuan Yuan, C. Wang, and and Wenbing Tao. LIF-Seg: Lidar and camera image fusion
Ping Luo. Sparse R-CNN: End-to-end object detection with for 3d lidar semantic segmentation. arXiv, 2021. 1
learnable proposals. CVPR, 2021. 2, 3 [64] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo
[45] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob-
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and ject detection. 3DV, 2019. 5
Illia Polosukhin. Attention is all you need. NeurIPS, 2017. [65] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
3 jects as points. arXiv, 2019. 3, 4
[46] Sourabh Vora, Alex H. Lang, Bassam Helou, and Oscar Bei- [66] Yin Zhou, Pei Sun, Y. Zhang, Dragomir Anguelov, J. Gao,
jbom. PointPainting: Sequential fusion for 3d object detec- Tom Y. Ouyang, James Guo, Jiquan Ngiam, and Vijay Va-
tion. CVPR, 2020. 1, 2, 4, 6, 7, 8 sudevan. MVF: End-to-end multi-view fusion for 3d object
[47] Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. detection in lidar point clouds. CoRL, 2019. 2
PointAugmenting: Cross-modal augmentation for 3d object [67] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-end learning
detection. CVPR, 2021. 1, 2, 4, 5, 6, 7 for point cloud based 3d object detection. CVPR, 2018. 1, 2,
[48] Yue Wang, Alireza Fathi, Abhijit Kundu, David A. Ross, 5
Caroline Pantofaru, Thomas A. Funkhouser, and Justin M. [68] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and
Solomon. Pillar-based object detection for autonomous driv- Gang Yu. Class-balanced grouping and sampling for point
ing. In ECCV, 2020. 2 cloud 3d object detection. arXiv, 2019. 2, 5, 6
[49] Yue Wang and Justin Solomon. Object DGCNN: 3d object [69] Xinge Zhu, Yuexin Ma, Tai Wang, Yan Xu, Jianping Shi, and
detection using dynamic graphs. NeurIPS, 2021. 3 Dahua Lin. SSN: Shape signature networks for multi-class
[50] Liang Xie, Chao Xiang, Zhengxu Yu, Guodong Xu, Zheng object detection from point clouds. ECCV, 2020. 2
Yang, Deng Cai, and Xiaofei He. PI-RCNN: An efficient [70] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
multi-sensor 3d object detector with point-based attentive and Jifeng Dai. Deformable DETR: Deformable transform-
cont-conv fusion module. AAAI, 2020. 1 ers for end-to-end object detection. ICLR, 2021. 3
[51] Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Bin Zhou,
and Liangjun Zhang. FusionPainting: Multimodal fusion
with adaptive attention for 3d object detection. ITSC, 2021.
1, 6
[52] Yan Yan, Yuxing Mao, and B. Li. SECOND: Sparsely em-
bedded convolutional detection. Sensors, 2018. 2, 5

1089

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on April 03,2024 at 01:18:25 UTC from IEEE Xplore. Restrictions apply.

You might also like