Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving
Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving
Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving
Abstract
in many practical autonomous driving and assisted driving without the costly LiDAR-based ground-truth.
systems. Our goal is to find representations of radar and cam-
Traditionally, the radar-camera fusion is achieved by era detection results, such that matched pairs are close and
the combination of rule-based association algorithms and unmatched ones are far. We convert the detection results
kinematic model-based tracking. The key is data asso- into image channels and combine them with the original
ciation between radar and camera detections. The noisy image to feed into a convolutional neural network (CNN),
and sparse nature of radar detection and the depth ambi- namely, AssociationNet. Training is performed based on
guity from a mono camera makes such association problem imperfect labels obtained from a traditional rule-based asso-
very challenging. Traditionally, the association process is ciation method. A loss sampling mechanism is introduced
hand-crafted based on minimizing certain distance metrics to mitigate false labels. To further boost the performance,
along with some heuristic rules. It not only requires a large we guide the reasoning logic of AssociationNet by adding
amount of engineering and tuning but is also hard to adapt a novel ordinal loss. The proposed AssociationNet signif-
to ever-growing data. icantly outperforms the rule-based method through scene-
An emerging solution is to use learning-based methods dependent global reasoning.
to replace the rule-based radar-camera fusion. The latest Our main contributions are summarized as follows:
advances focus on direct 3D object detection with the com-
• We proposed a scalable learning-based radar-camera
bined radar and camera data as the input [16, 25, 28]. These
fusion framework without using ground-truth labels
approaches all rely on LiDAR-based ground-truth to build
from LiDAR, which is suitable for building a low-cost,
the link between radar and camera. This is feasible on
production-ready perception system for autonomous
most public datasets such as nuScenes [4], Waymo [29] etc.
driving applications.
However, it cannot be applied to a large fleet of commer-
cial autonomous vehicles, often equipped with only radars • We designed a loss sampling mechanism to alleviate
and cameras. In this study, we propose a scalable learning- the impact of the label noise, and also invented an or-
based framework to associate radar and camera information dinal loss to enforce critical association logic into the
model for performance enhancement. geneous data, such as sensor parameters, point clouds, and
association relationships between two groups of data [27].
• We developed a robust model via representation learn-
In order to get compatible with CNN, a popular approach
ing, which is capable of handling various challenging
is to adapt the heterogeneous data into a form of pseudo-
scenarios, and also outperforms the traditional rule-
images. Examples include encoding camera intrinsic into
based algorithm by 11.6% in terms of the F1 score.
images with normalized coordinates and field of view maps
[11], projecting radar data into the image plane to form new
2. Related Work
image channels [6, 28], and the various forms of projection-
2.1. Sensor Fusion based LiDAR point-cloud representations [30, 23]. We
adopted a similar approach in this study to handle the het-
Traditionally, different sensory modules process their
erogeneous radar and camera outputs.
data separately. A downstream sensor fusion module aug-
ments the sensory outputs (typically detected objects) to 2.4. Representation Learning
form a more comprehensive understanding of the surround-
Representation learning has been considered as the key
ings. Such an object-level fusion method is the mainstream
to understanding complex environments and problems [3,
approach [9, 17, 19, 12, 31] and is still widely used on many
Advanced Driver Assistance Systems (ADAS). In object- 20, 18]. Representation learning has been widely used in
level fusion, object detection is independently performed on many natural language processing tasks such as word em-
bedding [24], and many computer vision tasks, such as im-
each sensor, and the fusion algorithm combines such object
detection results to create so-called global tracks for kine- age classification [8], object detection [13], and keypoint
matic tracking [1]. matching [10]. In this study, we aim at learning a vector
in the high-dimensional feature space as the representation
Data association is the most critical and challenging task
for each object in the scene, in order to establish the inter-
in object-level fusion. The precise association can eas-
actions between objects as well as enable global reasoning
ily lead to 3D object detection and multiple-object track-
about the scene.
ing solutions [1, 5]. Traditional approaches tend to man-
ually craft various distance metrics to represent the simi-
larities between different sensory outputs. Distance min-
3. Problem Formulation
imization [9] and other heuristic rules are applied to find We use a front-facing camera and a front-facing
the associations. To handle the complexity and uncertainty, millimeter-wave mid-range radar for the proposed radar-
probabilistic models are also sometimes adopted in the as- camera fusion, yet the approach can be easily generalized
sociation process [2]. to 360 perception with proper hardware setups. The cam-
era intrinsic and the extrinsics of both sensors are obtained
2.2. Learning-Based Radar-Camera Fusion
through offline calibration. The radar and camera operate
The learning-based radar-camera fusion algorithms can asynchronously at 20Hz and 10Hz, respectively. The field-
be primarily categorized into three groups, data-level fu- of-views (FOVs) of the radar and camera are 120 degrees
sion, feature-level fusion, and object-level fusion. The data- and 52 degrees, respectively. The camera is mounted under
level fusion and feature-level fusion combine the radar and the windshield at 1.33 meters above the ground. The out-
camera information at the early stage [28, 14] and the mid- put of the camera sensor at each frame is an RGB image
dle stage [16, 7, 26], respectively, but both directly perform with a size of 1828 pixels (width) by 948 pixels (height),
3D object detection. Hence, they rely on LiDAR to provide whereas the output of the radar sensor at each frame is a
ground-truth labels during training, which prohibits their list of processed points with many attributes (convention-
usage to autonomous vehicles without LiDAR. ally referred to as radar pins). Since the radar used here
The learning-based object-level fusion remains under- performs internal clustering, each output radar pin is on the
explored due to the limited information contained in the de- object level (yet the proposed fusion technique also applies
tection results. In this study, our proposed method belongs to lower level detection, e.g., radar locations). There are
to this category in that we focus on associating radar and several tens of radar pins per frame depending on the actual
camera detection results. Thus, our method is more com- scene and traffic. The attributes of each radar pin are listed
patible with the traditional sensor fusion pipeline. On the in Table 1. There are two noteworthy characteristics of the
other hand, our method also directly takes the raw camera radar pins. First, we only consume the 2D position infor-
image for further performance enhancement. mation in the Bird’s-Eye View (BEV) without the elevation
angle, due to poor resolution and large measurement noise
2.3. CNN for Heterogeneous Data
in the elevation dimension. Second, each radar pin either
The tremendous success of CNN on structured image corresponds to a movable object (cars, cyclists, pedestrians,
data inspires its application to many other types of hetero- etc.) or an interfering static structure such as a traffic sign,
a street light, or a bridge. resentation learning network, AssociationNet, and a post-
In this study, we focus on associating 2D bounding boxes processing step to extract representations and make associ-
detected from a camera image to radar pins detected in the ations. An overview of the method is shown in Fig. 2 and
corresponding radar frame. With precise associations, many details are explained in the following sections.
subsequent tasks like 3D object detection and tracking be-
come much easier if not trivial. 4.1. Radar and Camera Data Preprocessing
Table 1: The Features of Each Radar Pin Temporal and spatial alignment is performed in the pre-
processing stage. For each camera frame, we look for the
nearest radar frame to perform data alignment. We align the
nearest radar frame to the time instant of the camera frame,
Feature Explanation by moving the radar pin locations forward/backward along
the time axis under a constant velocity assumption. After
object id the id of the radar pin the temporal alignment, the radar pins are further trans-
formed from the radar coordinate to the camera coordinate
the probability of the existence of an obsta- using the known extrinsics. All the attributes of the aligned
obstacle prob
cle being detected by the radar pin radar pins will be used in AssociationNet.
the x coordinate of the position of the de- Each camera frame is first fed into a 2D object detec-
position x
tected obstacle in radar frame tion network to produce a list of 2D bounding boxes corre-
the y coordinate of the position of the de- sponding to the movable objects in the scene. The output
position y attributes for each detected 2D bounding box are displayed
tected obstacle in radar frame
the velocity of the detected obstacle along in Table 2. Though the network used in this study is an
velocity x anchor-based RetinaNet [22] network, any 2D object de-
the x coordinate in radar frame
tector will serve the purpose. After preprocessing, a list of
the velocity of the detected obstacle along
velocity y temporally and spatially aligned radar pins and bounding
the y coordinate in radar frame
boxes will be ready for association.
loss calculation to alleviate pushing apart positive pairs by Hence, we design an additional ordinal loss to enforce
mistake. The number of sampled negative pairs is set to be the self-consistency within any two associations according
equal to the number of positive ones at each frame. to the ordinal rule, which is written as:
Table 5: Comparison with Rule-based Algorithm small sizes of the objects in the camera image and also the
heavy occlusions.
Algorithm Performance
Precision / Recall / F1
6. Conclusion
Rule-based 0.890 / 0.736 / 0.806
Learning-based 0.906 / 0.939 / 0.922 In this work, we developed a scalable learning-based
radar-camera fusion algorithm, without using LiDAR for
ground-truth labels generation. Such a solution has many
practical merits at the current technological stage, includ-
5.5. Visualization ing low cost, low maintenance, high reliability, and more
importantly, readiness for mass production. We employed
Examples of the predicted associations are shown in Fig. deep representation learning to tackle the challenging asso-
6. Despite multiple big trucks present in both examples, As- ciation problem, with the benefits of enabled feature-level
sociationNet correctly predicted their associations, which interaction and global reasoning. We also designed a loss
demonstrates the robustness of the algorithm. On the other sampling mechanism and a novel ordinal loss to mitigate
hand, in the second example, there are two bounding boxes the impact of label noise and enforce critical human logic
incorrectly associated, with one bounding box having no into the learning process. Although imperfect labels gener-
predicted associations and the other associated to a wrong ated by a traditional rule-based algorithm were used to train
radar pin. The two bounding boxes correspond to vehicles the network, our proposed algorithm outperforms the rule-
at the very far range. The mistakes are largely due to the based teacher by 11.6% in terms of the F1 score.
References [13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Malik. Rich feature hierarchies for accurate object detec-
[1] Michael Aeberhard, Stefan Schlichtharle, Nico Kaempchen, tion and semantic segmentation. In Proceedings of the IEEE
and Torsten Bertram. Track-to-track fusion with asyn- Conference on Computer Vision and Pattern Recognition
chronous sensors using information matrix fusion for sur- (CVPR), June 2014. 3
round environment perception. IEEE Transactions on Intel- [14] Xiao-peng Guo, Jin-song Du, Jie Gao, and Wei Wang. Pedes-
ligent Transportation Systems, 13(4):1717–1726, 2012. 3 trian detection based on fusion of millimeter wave radar and
[2] Yaakov Bar-Shalom, Fred Daum, and Jim Huang. The prob- vision. In Proceedings of the 2018 International Conference
abilistic data association filter. IEEE Control Systems Mag- on Artificial Intelligence and Pattern Recognition, pages 38–
azine, 29(6):82–100, 2009. 3 42, 2018. 3
[3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
resentation learning: A review and new perspectives. IEEE Deep residual learning for image recognition. In Proceed-
transactions on pattern analysis and machine intelligence, ings of the IEEE conference on computer vision and pattern
35(8):1798–1828, 2013. 3 recognition, pages 770–778, 2016. 4
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, [16] Vijay John and Seiichi Mita. Rvnet: deep sensor fusion of
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, monocular camera and radar for image-based obstacle detec-
Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- tion in challenging environments. In Pacific-Rim Symposium
timodal dataset for autonomous driving. arXiv preprint on Image and Video Technology, pages 351–364. Springer,
arXiv:1903.11027, 2019. 2 2019. 2, 3
[5] Josip Ćesić, Ivan Marković, Igor Cvišić, and Ivan Petrović. [17] Naoki Kawasaki and Uwe Kiencke. Standard platform for
Radar and stereo vision fusion for multitarget tracking on the sensor fusion on advanced driver assistance system using
special euclidean group. Robotics and Autonomous Systems, bayesian network. In IEEE Intelligent Vehicles Symposium,
83:338–348, 2016. 3 2004, pages 250–255. IEEE, 2004. 3
[6] Simon Chadwick, Will Maddern, and Paul Newman. Dis- [18] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Re-
tant vehicle detection using radar and vision. In 2019 In- visiting self-supervised visual representation learning. In
ternational Conference on Robotics and Automation (ICRA), Proceedings of the IEEE/CVF Conference on Computer Vi-
pages 8311–8317. IEEE, 2019. 3 sion and Pattern Recognition, pages 1920–1929, 2019. 3
[19] Dirk Langer and Todd Jochem. Fusing radar and vision
[7] Shuo Chang, Yifan Zhang, Fan Zhang, Xiaotong Zhao, Sai
for detecting, classifying and avoiding roadway obstacles.
Huang, Zhiyong Feng, and Zhiqing Wei. Spatial attention
In Proceedings of Conference on Intelligent Vehicles, pages
fusion for obstacle detection using mmwave radar and vision
333–338. IEEE, 1996. 3
sensor. Sensors, 20(4):956, 2020. 3
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
learning. nature, 521(7553):436–444, 2015. 3
offrey Hinton. A simple framework for contrastive learning
[21] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
of visual representations. In International conference on ma-
Bharath Hariharan, and Serge Belongie. Feature pyra-
chine learning, pages 1597–1607. PMLR, 2020. 3
mid networks for object detection. In Proceedings of the
[9] Hyunggi Cho, Young-Woo Seo, BVK Vijaya Kumar, and IEEE conference on computer vision and pattern recogni-
Ragunathan Raj Rajkumar. A multi-sensor fusion system tion, pages 2117–2125, 2017. 4
for moving object detection and tracking in urban driving [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
environments. In 2014 IEEE International Conference on Piotr Dollár. Focal loss for dense object detection. In Pro-
Robotics and Automation (ICRA), pages 1836–1843. IEEE, ceedings of the IEEE international conference on computer
2014. 3 vision, pages 2980–2988, 2017. 4
[10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- [23] Gregory P. Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-
novich. Superpoint: Self-supervised interest point detection Gonzalez, and Carl K. Wellington. Lasernet: An efficient
and description. In Proceedings of the IEEE conference on probabilistic 3d object detector for autonomous driving. In
computer vision and pattern recognition workshops, pages Proceedings of the IEEE/CVF Conference on Computer Vi-
224–236, 2018. 3 sion and Pattern Recognition (CVPR), June 2019. 3
[11] Jose M. Facil, Benjamin Ummenhofer, Huizhong Zhou, [24] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Luis Montesano, Thomas Brox, and Javier Civera. Cam- Efficient estimation of word representations in vector space.
convs: Camera-aware multi-scale convolutions for single- arXiv preprint arXiv:1301.3781, 2013. 3
view depth. In Proceedings of the IEEE/CVF Conference [25] Ramin Nabati and Hairong Qi. Rrpn: Radar region proposal
on Computer Vision and Pattern Recognition (CVPR), June network for object detection in autonomous vehicles. In 2019
2019. 3 IEEE International Conference on Image Processing (ICIP),
[12] Fernando Garcia, Pietro Cerri, Alberto Broggi, Arturo de la pages 3093–3097. IEEE, 2019. 2
Escalera, and José Marı́a Armingol. Data fusion for over- [26] Ramin Nabati and Hairong Qi. Centerfusion: Center-based
taking vehicle detection based on radar and optical flow. In radar and camera fusion for 3d object detection. In Proceed-
2012 IEEE Intelligent Vehicles Symposium, pages 494–499. ings of the IEEE/CVF Winter Conference on Applications of
IEEE, 2012. 3 Computer Vision, pages 1527–1536, 2021. 3
[27] Alejandro Newell and Jia Deng. Pixels to graphs by asso-
ciative embedding. arXiv preprint arXiv:1706.07365, 2017.
3
[28] Felix Nobis, Maximilian Geisslinger, Markus Weber, Jo-
hannes Betz, and Markus Lienkamp. A deep learning-based
radar and camera sensor fusion architecture for object detec-
tion. In 2019 Sensor Data Fusion: Trends, Solutions, Appli-
cations (SDF), pages 1–7. IEEE, 2019. 2, 3
[29] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
Scalability in perception for autonomous driving: Waymo
open dataset, 2019. 2
[30] Yue Wang, Alireza Fathi, Abhijit Kundu, David Ross, Car-
oline Pantofaru, Tom Funkhouser, and Justin Solomon.
Pillar-based object detection for autonomous driving. arXiv
preprint arXiv:2007.10323, 2020. 3
[31] Ziguo Zhong, Stanley Liu, Manu Mathew, and Aish Dubey.
Camera radar fusion for increased reliability in adas applica-
tions. Electronic Imaging, 2018(17):258–1, 2018. 3