Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving

Radar Camera Fusion via Representation Learning in Autonomous Driving
Xu Dong Binnan Zhuang Yunxiang Mao Langechuan Liu †

XSense.ai
11010 Roselle Street, San Diego, CA 92121
Abstract
Radars and cameras are mature, cost-effective, and ro-

bust sensors and have been widely used in the perception
stack of mass-produced autonomous driving systems. Due
to their complementary properties, outputs from radar de-
tection (radar pins) and camera perception (2D bounding
boxes) are usually fused to generate the best perception re-
sults. The key to successful radar-camera fusion is the accu-
rate data association. The challenges in the radar-camera
association can be attributed to the complexity of driving
scenes, the noisy and sparse nature of radar measurements,
and the depth ambiguity from 2D bounding boxes. Tra-
ditional rule-based association methods are susceptible to
performance degradation in challenging scenarios and fail-
ure in corner cases. In this study, we propose to address
radar-camera association via deep representation learning,
to explore feature-level interaction and global reasoning.
Additionally, we design a loss sampling mechanism and an
innovative ordinal loss to overcome the difficulty of imper-
fect labeling and to enforce critical human-like reasoning.
Despite being trained with noisy labels generated by a rule-
based algorithm, our proposed method achieves a perfor- Figure 1: An illustration of the associations between radar
mance of 92.2% F1 score, which is 11.6% higher than the detections (radar pins) and camera detections (2D bound-
rule-based teacher. Moreover, this data-driven method also ing boxes). The context of the scene is illustrated in the
lends itself to continuous improvement via corner case min- top picture, with the image captured by the camera along
ing. with the detected bounding boxes and the projected radar
pins (shown as numbered blue circles). The bottom pic-
ture adds red lines to highlight the association relationships
1. Introduction between radar pins and bounding boxes. The tiny orange
line in the middle denotes uncertain association relation-
LiDAR, radar, and camera are the three main sensory ship, which will be explained later.
modalities employed by the perception system of an au-
tonomous driving vehicle. Though LiDAR-based 3D object
detection is very popular in high-level autonomy, its wide An automotive millimeter-wave radar can also provide
adoption is still limited by some unsolved issues. First, Li- a certain level of geometrical information with relatively
DAR is prone to adversarial conditions (e.g. rainy weather); precise range and speed estimates. Moreover, as a widely-
second, current LiDAR systems still exhibit prohibitively adopted sensor in automobiles for decades, radar is rela-
high maintenance need and cost; third, the mass-production tively robust, low-cost, and low-maintenance. The fusion
of LiDAR is not ready to meet the growing demand. between radar and camera combines radar’s geometrical
information and camera’s appearance and semantic infor-
† indicates corresponding author [email protected] mation, which is still the mainstream perception solution
Figure 2: An overview of AssociationNet. Process a illustrates how the radar pins and 2D bounding boxes are first pro-
jected into the camera image plane and then produce a pseudo-image. Process b illustrates how the final pseudo-image is
composed by concatenating all the features of radar pins, bounding boxes, and the original RGB camera image. The pseudo-
image is then fed into a neural network to learn high-level semantic representations. Process c illustrates how the learned
representation vectors for objects are finally extracted from the feature-map generated in the last layer of the neural network.
in many practical autonomous driving and assisted driving without the costly LiDAR-based ground-truth.
systems. Our goal is to find representations of radar and cam-
Traditionally, the radar-camera fusion is achieved by era detection results, such that matched pairs are close and
the combination of rule-based association algorithms and unmatched ones are far. We convert the detection results
kinematic model-based tracking. The key is data asso- into image channels and combine them with the original
ciation between radar and camera detections. The noisy image to feed into a convolutional neural network (CNN),
and sparse nature of radar detection and the depth ambi- namely, AssociationNet. Training is performed based on
guity from a mono camera makes such association problem imperfect labels obtained from a traditional rule-based asso-
very challenging. Traditionally, the association process is ciation method. A loss sampling mechanism is introduced
hand-crafted based on minimizing certain distance metrics to mitigate false labels. To further boost the performance,
along with some heuristic rules. It not only requires a large we guide the reasoning logic of AssociationNet by adding
amount of engineering and tuning but is also hard to adapt a novel ordinal loss. The proposed AssociationNet signif-
to ever-growing data. icantly outperforms the rule-based method through scene-
An emerging solution is to use learning-based methods dependent global reasoning.
to replace the rule-based radar-camera fusion. The latest Our main contributions are summarized as follows:
advances focus on direct 3D object detection with the com-
• We proposed a scalable learning-based radar-camera
bined radar and camera data as the input [16, 25, 28]. These
fusion framework without using ground-truth labels
approaches all rely on LiDAR-based ground-truth to build
from LiDAR, which is suitable for building a low-cost,
the link between radar and camera. This is feasible on
production-ready perception system for autonomous
most public datasets such as nuScenes [4], Waymo [29] etc.
driving applications.
However, it cannot be applied to a large fleet of commer-
cial autonomous vehicles, often equipped with only radars • We designed a loss sampling mechanism to alleviate
and cameras. In this study, we propose a scalable learning- the impact of the label noise, and also invented an or-
based framework to associate radar and camera information dinal loss to enforce critical association logic into the
model for performance enhancement. geneous data, such as sensor parameters, point clouds, and
association relationships between two groups of data [27].
• We developed a robust model via representation learn-
In order to get compatible with CNN, a popular approach
ing, which is capable of handling various challenging
is to adapt the heterogeneous data into a form of pseudo-
scenarios, and also outperforms the traditional rule-
images. Examples include encoding camera intrinsic into
based algorithm by 11.6% in terms of the F1 score.
images with normalized coordinates and field of view maps
[11], projecting radar data into the image plane to form new
2. Related Work
image channels [6, 28], and the various forms of projection-
2.1. Sensor Fusion based LiDAR point-cloud representations [30, 23]. We
adopted a similar approach in this study to handle the het-
Traditionally, different sensory modules process their
erogeneous radar and camera outputs.
data separately. A downstream sensor fusion module aug-
ments the sensory outputs (typically detected objects) to 2.4. Representation Learning
form a more comprehensive understanding of the surround-
Representation learning has been considered as the key
ings. Such an object-level fusion method is the mainstream
to understanding complex environments and problems [3,
approach [9, 17, 19, 12, 31] and is still widely used on many
Advanced Driver Assistance Systems (ADAS). In object- 20, 18]. Representation learning has been widely used in
level fusion, object detection is independently performed on many natural language processing tasks such as word em-
bedding [24], and many computer vision tasks, such as im-
each sensor, and the fusion algorithm combines such object
detection results to create so-called global tracks for kine- age classification [8], object detection [13], and keypoint
matic tracking [1]. matching [10]. In this study, we aim at learning a vector
in the high-dimensional feature space as the representation
Data association is the most critical and challenging task
for each object in the scene, in order to establish the inter-
in object-level fusion. The precise association can eas-
actions between objects as well as enable global reasoning
ily lead to 3D object detection and multiple-object track-
about the scene.
ing solutions [1, 5]. Traditional approaches tend to man-
ually craft various distance metrics to represent the simi-
larities between different sensory outputs. Distance min-
3. Problem Formulation
imization [9] and other heuristic rules are applied to find We use a front-facing camera and a front-facing
the associations. To handle the complexity and uncertainty, millimeter-wave mid-range radar for the proposed radar-
probabilistic models are also sometimes adopted in the as- camera fusion, yet the approach can be easily generalized
sociation process [2]. to 360 perception with proper hardware setups. The cam-
era intrinsic and the extrinsics of both sensors are obtained
2.2. Learning-Based Radar-Camera Fusion
through offline calibration. The radar and camera operate
The learning-based radar-camera fusion algorithms can asynchronously at 20Hz and 10Hz, respectively. The field-
be primarily categorized into three groups, data-level fu- of-views (FOVs) of the radar and camera are 120 degrees
sion, feature-level fusion, and object-level fusion. The data- and 52 degrees, respectively. The camera is mounted under
level fusion and feature-level fusion combine the radar and the windshield at 1.33 meters above the ground. The out-
camera information at the early stage [28, 14] and the mid- put of the camera sensor at each frame is an RGB image
dle stage [16, 7, 26], respectively, but both directly perform with a size of 1828 pixels (width) by 948 pixels (height),
3D object detection. Hence, they rely on LiDAR to provide whereas the output of the radar sensor at each frame is a
ground-truth labels during training, which prohibits their list of processed points with many attributes (convention-
usage to autonomous vehicles without LiDAR. ally referred to as radar pins). Since the radar used here
The learning-based object-level fusion remains under- performs internal clustering, each output radar pin is on the
explored due to the limited information contained in the de- object level (yet the proposed fusion technique also applies
tection results. In this study, our proposed method belongs to lower level detection, e.g., radar locations). There are
to this category in that we focus on associating radar and several tens of radar pins per frame depending on the actual
camera detection results. Thus, our method is more com- scene and traffic. The attributes of each radar pin are listed
patible with the traditional sensor fusion pipeline. On the in Table 1. There are two noteworthy characteristics of the
other hand, our method also directly takes the raw camera radar pins. First, we only consume the 2D position infor-
image for further performance enhancement. mation in the Bird’s-Eye View (BEV) without the elevation
angle, due to poor resolution and large measurement noise
2.3. CNN for Heterogeneous Data
in the elevation dimension. Second, each radar pin either
The tremendous success of CNN on structured image corresponds to a movable object (cars, cyclists, pedestrians,
data inspires its application to many other types of hetero- etc.) or an interfering static structure such as a traffic sign,
a street light, or a bridge. resentation learning network, AssociationNet, and a post-
In this study, we focus on associating 2D bounding boxes processing step to extract representations and make associ-
detected from a camera image to radar pins detected in the ations. An overview of the method is shown in Fig. 2 and
corresponding radar frame. With precise associations, many details are explained in the following sections.
subsequent tasks like 3D object detection and tracking be-
come much easier if not trivial. 4.1. Radar and Camera Data Preprocessing
Table 1: The Features of Each Radar Pin Temporal and spatial alignment is performed in the pre-
processing stage. For each camera frame, we look for the
nearest radar frame to perform data alignment. We align the
nearest radar frame to the time instant of the camera frame,
Feature Explanation by moving the radar pin locations forward/backward along
the time axis under a constant velocity assumption. After
object id the id of the radar pin the temporal alignment, the radar pins are further trans-
formed from the radar coordinate to the camera coordinate
the probability of the existence of an obsta- using the known extrinsics. All the attributes of the aligned
obstacle prob
cle being detected by the radar pin radar pins will be used in AssociationNet.
the x coordinate of the position of the de- Each camera frame is first fed into a 2D object detec-
position x
tected obstacle in radar frame tion network to produce a list of 2D bounding boxes corre-
the y coordinate of the position of the de- sponding to the movable objects in the scene. The output
position y attributes for each detected 2D bounding box are displayed
tected obstacle in radar frame
the velocity of the detected obstacle along in Table 2. Though the network used in this study is an
velocity x anchor-based RetinaNet [22] network, any 2D object de-
the x coordinate in radar frame
tector will serve the purpose. After preprocessing, a list of
the velocity of the detected obstacle along
velocity y temporally and spatially aligned radar pins and bounding
the y coordinate in radar frame
boxes will be ready for association.
4.2. Deep Association by Representation Learning

Table 2: The Features of Each 2D Bounding Box We employ AssociationNet to learn a semantic represen-
tation (or a descriptor) of each radar pin and each bounding
box. Under such representation, a pair of matched radar pin
and bounding box will “look” similar, in the sense that the
Feature Explanation distance between the learned representations is small. An
overview of the general process is shown in Fig. 2.
the x coordinate in the image plane of the To leverage the powerful CNN architecture, we project
center x each radar pin and 2D bounding box into the image plane
center of the bounding box
to generate a pseudo-image, with each attribute occupying
the y coordinate in the image plane of the an independent channel. Specifically, each bounding box is
center y
center of the bounding box assigned to the pixel location of its center. Each radar pin
is assigned to the pixel location which is obtained through
the height of the bounding box in the image
height projecting its 3D location into the image plane using the
plane
camera intrinsic. The process is illustrated in Process a of
the width of the bounding box in the image the Fig. 2. Next, we concatenate the raw RGB camera im-
width age with the corresponding pseudo-image to incorporate the
plane
the category of the detected moving object, rich pixel-level information. AssociationNet is then applied
category including sedan, suv, truck, bus, bicycle, to perform representation learning.
tricycle, motorcycle, person, and unknown As shown in Fig. 3, the network consists of a ResNet-50
[15] as the backbone, a Feature Pyramid Network [21] for
feature-map decoding, and two extra layers to restore the
output feature-map size to the original input size. The out-
4. Methods put feature-map contains the high-level semantic represen-
tations of radar pins and bounding boxes. As each radar pin
Our proposed method mainly consists of a preprocessing or bounding box has a unique pixel location in the feature-
step to align radar and camera data, a CNN-based deep rep- map, we extract the representation vector of each of those
Figure 3: The architecture of the neural network. It con- Figure 4: An overview of the process of obtaining final as-
sists of a ResNet-50 as the backbone, a feature pyramid for sociations from the learned representation vectors.
feature decoding, and two extra layers to restore the feature-
map size. The feature-map in the last layer will be used to
extract the representation vectors as shown in the Process c And we push apart the representation vectors of negative
of the Fig. 2 samples with the following push loss:
1 X
Lpush = max(0, m2 − khi1 − hi2 k) (2)
on the output feature-map at its corresponding pixel loca- nneg
(i1 ,i2 )∈NEG
tion. The process is illustrated in Process c of the Fig. 2.
The input pseudo-image contains seven radar pin chan- Here POS and NGE are the set of positive samples and the
nels, four bounding box channels, and three raw cam- set of negative samples, respectively in each frame; npos
era image RGB channels. The radar pin channels include and nneg are the total number of associations in POS and
object-id, obstacle-prob, position-x, position-y, velocity-x, NGE respectively; (i1 , i2 ) denotes the ith association pair
velocity-y1 , and a heatmap to indicate the projected pixel consisting of radar pin i1 and bounding box i2 ; hi1 and hi2
location. The bounding box channels include height, width, denotes the learned representation vectors; and m1 and m2
category, and also a heatmap to indicate the pixel location. are the thresholds for the desired distances of representa-
The output feature-map contains 128 channels, resulting in tions among positive associations and negative associations,
the dimension of the representation vector to be 64 for each which were preset to be 2.0 and 8.0 in our experiments.
radar pin and bounding box. During inference, we calculate the Euclidean distance
The obtained representation vector captures the seman- between the representation vectors of all possible radar-pin-
tic meaning of each radar pin and each bounding box in a bounding-box pairs. If the distance falls below a certain
high dimension space. If a radar pin and a bounding box threshold, the radar pin and the bounding box will be con-
come from the same object in the real world, we treat the sidered as a successful association. More details of the in-
pair of radar pin and bounding box as a positive sample, ference process will be explained later.
otherwise, it is considered as a negative sample. We try to
minimize the distance between the representation vectors of 4.2.1 Loss Sampling
any positive sample and maximize the distance between the
representation vectors of any negative sample. Based on The association labels used for supervising the learning pro-
such logic, we design loss functions according to the asso- cess are ultimately from the traditional rule-based method,
ciation ground-truth labels. We pull together the represen- and hence are far from 100% accurate. To mitigate the im-
tation vectors of positive samples with the following pull pact of the inaccurate labels, we first purify the labels by ap-
loss: plying some simple filters to remove low-confidence asso-
ciations, which increases the precision in the remaining as-
1 X
Lpull = max(0, khi1 − hi2 k − m1 ) (1) sociation labels at the cost of the undermined recall. During
npos the push loss calculation in the training of AssociationNet,
(i1 ,i2 )∈POS
instead of exhausting all negative pairs (a pair of a radar pin
1 Positions and velocities used here are under camera coordinate, as it and a bounding box that is not present in the association la-
is after the spatial alignment step in the preprocessing. bels), we only sample a fraction of those to be used for push
Figure 5: An illustration of radar pins, bounding boxes, and their association relationships under BEV perspective. This
BEV image corresponds to the same scene as displayed in Fig. 1. Each grid in the image represents a 10-meter-by-10-meter
square in physical space. The bounding boxes are represented as solid cycles in this image. The location of a bounding box
is estimated by the Inverse Projective Mapping (IPM) method from the bounding box’s center, to provide a rough reference
for its real 3D location. A truncated frustum accompanying each bounding box is also plotted, for better assisting human
curators to determine the association relationships2 .
loss calculation to alleviate pushing apart positive pairs by Hence, we design an additional ordinal loss to enforce
mistake. The number of sampled negative pairs is set to be the self-consistency within any two associations according
equal to the number of positive ones at each frame. to the ordinal rule, which is written as:
4.2.2 Ordinal Loss 2

Lord = ·
nd d
pos · (n pos − 1)
One particular kind of error made by AssociationNet is that X (4)
it could violate the simple ordinal rule, i.e., given two pairs σ(−(dir − djr ) · (ymax
i j
− ymax )),
of associated radar pins and bounding boxes, the farther d
i∈POS
d
radar pin associates to the closer bounding box. To solve j∈POS
this issue, an ordinal loss is introduced.
Denote the y coordinate of bounding box i’s bottom edge where POSd denotes the set of predicted positive associa-
i
as ymax and the bounding box’s depth in 3D world as dib tions and nd pos is the size of the set; i and j are two random
(which is supposedly the same as the depth of the associated d d∗ represents the depth of the radar
associations in POS; r
radar pin dir ). For any two random bounding boxes on the pin in an association , and ymax∗
represents the y coordinate
same image we have the property: of the bounding box’s bottom edge in an association; and σ
is the sigmoid function to smooth the loss values.
i
ymax j
> ymax ⇐⇒ dib > djb (3)
Finally, the total loss is calculated as:
The ordering of the objects in the 3D world can be inter-
preted as the relative vertical ordering of the bottom edges Ltot = Lpull + Lpush + word · Lord , (5)
of the corresponding bounding boxes.
where the word is the adjustable weight to balance losses.
2 The frustum is calculated also by the IPM method, from the two side
edges of each bounding box. According to projective geometry, the real 4.3. Training and Inference
object detected by a bounding box has to be within the bounding box’s
frustum, and hence the possibly matched radar pins as well. We truncated The AssociationNet was trained with a batch size of 48
the frustums for the ease of visualizing. The widths of each frustum at the
truncated positions are set to be one meter and five meters, respectively.
frames at four NVIDIA GeForce RTX 2080 Ti GPUs. The
As the physical width of a vehicle is most likely to be within the range, the SGD optimizer was used for training at a total of 10K itera-
possibly matched radar pins also tend to lie within the truncated frustum. tions. The learning rate was set to be 10−4 initially, and then
was decreased by a factor of 10 at the end of 8K iterations Table 3: The Effect of Loss Sampling
and 9K iterations, respectively.
At the inference time, the representation vectors for all
Sample Ratio Performance
radar pins and bounding boxes are first predicted using the
Precision / Recall / F1
trained model. An affinity matrix is then calculated, where
each matrix element corresponds to the distance between no sampling 0.896 / 0.925 / 0.911
the representations of a radar pin and a bounding box. In 1:2 0.901 / 0.931 / 0.916
reality, each bounding box may be associated with multiple 1:1 0.906 / 0.939 / 0.922
radar pins (this is usually the case where the correspond- 2:1 0.899 / 0.933 / 0.915
ing vehicles are of large sizes, such as trailer trucks and 3:1 0.899 / 0.929 / 0.914
buses.), while each radar pin can only match to at most one
bounding box. As a result, we associate each radar pin to
the bounding box with the smallest distance in the affinity Table 4: The Effect of Ordinal Loss
matrix. Lastly, the improbable associations with a distance
larger than a threshold are filtered out, which usually con-
sists of radar pins from interfering static objects. The whole Loss Weight Performance
inference process is summarized in Fig. 4. word Precision / Recall / F1
0.0 0.897 / 0.912 / 0.904
4.4. Evaluation 0.5 0.897 / 0.923 / 0.910
1.0 0.899 / 0.931 / 0.915
The predicted associations are compared against human- 2.0 0.906 / 0.939 / 0.922
annotated ground-truth associations in the test dataset. We 5.0 0.889 / 0.918 / 0.903
use precision, recall, and F1 score as the metrics for evalu-
ating the performance.
In some very complicated scenes, correctly associating
all radar pins and bounding boxes is very challenging even 5.2. Effect of Loss Sampling
for human annotators. Therefore, we mark those plausible We studied the effect of loss sampling on the Associa-
but dubious associations as “uncertain” in the evaluation tionNet’s performance. Experiments were conducted with
process. An example is shown in Fig. 5. For those “un- no sampling (meaning that all the negative pairs present in
certain” associations, they are counted as neither positive the label are used for push loss calculation), and loss sam-
nor negative associations, which will be excluded from both pling with different sampling ratios. The sampling ratio is
true and false positive predictions. defined as the ratio between the number of positive pairs
and the number of negative pairs at each frame. The result
is shown in Table 3. We can see that the best sampling ratio
5. Experiments and Discussion is 1:1 with the loss sampling mechanism, which boosts the
performance by 1.1% in terms of the F1 score.
5.1. Dataset
5.3. Effect of Ordinal Loss
The AssociationNet was trained and evaluated on an in-
house dataset with 12 driving sequences collected by a test- The effect of the ordinal loss is shown in Table 4. The
ing fleet, which consists of 14.8 hours of driving in various ordinal loss can facilitate both precision and recall to some
driving scenarios, including highway, urban, and city roads. degree. With the optimal loss weight, the performance is
The radar and camera were synchronized at 10 Hz initially boosted by 1.8% in terms of the F1 score.
and further downsampled to 2 Hz, in order to reduce the
5.4. Comparison with Rule-Based Algorithm
temporal correlation among adjacent frames. Eleven se-
quences out of the twelve were used for training with the We compared the performance of AssociationNet with
other one left for the test. Therefore, there are 104,314 syn- the traditional rule-based algorithm, as shown in Table 5.
chronized radar and camera frames in the training dataset, Notably, though the traditional rule-based algorithm was
and 2,714 in the test dataset. For the training data, the as- used to generate association labels to supervise the train-
sociation labels were generated by a traditional rule-based ing of AssociationNet, AssociationNet significantly outper-
algorithm with additional filtering to increase the precision. forms the rule-based alternative. This demonstrates the in-
For the test data, we manually curated the labels with hu- herent robustness of learning-based algorithms in handling
man annotators to obtain high-quality ground-truth labels. complex scenarios.
Figure 6: Examples of AssociationNet predictions. Here, the red solid lines represent the true-positive associations; and
the pink solid lines represent predicted positive associations but labeled as uncertain in the ground-truth. In the second
example, the added green lines represent the false-positive predictions; and the added black lines represent the false-negative
predictions. Also, note that each bounding box on the left corresponds to a solid circle with the same color on the right.
Table 5: Comparison with Rule-based Algorithm small sizes of the objects in the camera image and also the
heavy occlusions.
Algorithm Performance
Precision / Recall / F1
6. Conclusion
Rule-based 0.890 / 0.736 / 0.806
Learning-based 0.906 / 0.939 / 0.922 In this work, we developed a scalable learning-based
radar-camera fusion algorithm, without using LiDAR for
ground-truth labels generation. Such a solution has many
practical merits at the current technological stage, includ-
5.5. Visualization ing low cost, low maintenance, high reliability, and more
importantly, readiness for mass production. We employed
Examples of the predicted associations are shown in Fig. deep representation learning to tackle the challenging asso-
6. Despite multiple big trucks present in both examples, As- ciation problem, with the benefits of enabled feature-level
sociationNet correctly predicted their associations, which interaction and global reasoning. We also designed a loss
demonstrates the robustness of the algorithm. On the other sampling mechanism and a novel ordinal loss to mitigate
hand, in the second example, there are two bounding boxes the impact of label noise and enforce critical human logic
incorrectly associated, with one bounding box having no into the learning process. Although imperfect labels gener-
predicted associations and the other associated to a wrong ated by a traditional rule-based algorithm were used to train
radar pin. The two bounding boxes correspond to vehicles the network, our proposed algorithm outperforms the rule-
at the very far range. The mistakes are largely due to the based teacher by 11.6% in terms of the F1 score.
References [13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Malik. Rich feature hierarchies for accurate object detec-
[1] Michael Aeberhard, Stefan Schlichtharle, Nico Kaempchen, tion and semantic segmentation. In Proceedings of the IEEE
and Torsten Bertram. Track-to-track fusion with asyn- Conference on Computer Vision and Pattern Recognition
chronous sensors using information matrix fusion for sur- (CVPR), June 2014. 3
round environment perception. IEEE Transactions on Intel- [14] Xiao-peng Guo, Jin-song Du, Jie Gao, and Wei Wang. Pedes-
ligent Transportation Systems, 13(4):1717–1726, 2012. 3 trian detection based on fusion of millimeter wave radar and
[2] Yaakov Bar-Shalom, Fred Daum, and Jim Huang. The prob- vision. In Proceedings of the 2018 International Conference
abilistic data association filter. IEEE Control Systems Mag- on Artificial Intelligence and Pattern Recognition, pages 38–
azine, 29(6):82–100, 2009. 3 42, 2018. 3
[3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
resentation learning: A review and new perspectives. IEEE Deep residual learning for image recognition. In Proceed-
transactions on pattern analysis and machine intelligence, ings of the IEEE conference on computer vision and pattern
35(8):1798–1828, 2013. 3 recognition, pages 770–778, 2016. 4
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, [16] Vijay John and Seiichi Mita. Rvnet: deep sensor fusion of
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, monocular camera and radar for image-based obstacle detec-
Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- tion in challenging environments. In Pacific-Rim Symposium
timodal dataset for autonomous driving. arXiv preprint on Image and Video Technology, pages 351–364. Springer,
arXiv:1903.11027, 2019. 2 2019. 2, 3
[5] Josip Ćesić, Ivan Marković, Igor Cvišić, and Ivan Petrović. [17] Naoki Kawasaki and Uwe Kiencke. Standard platform for
Radar and stereo vision fusion for multitarget tracking on the sensor fusion on advanced driver assistance system using
special euclidean group. Robotics and Autonomous Systems, bayesian network. In IEEE Intelligent Vehicles Symposium,
83:338–348, 2016. 3 2004, pages 250–255. IEEE, 2004. 3
[6] Simon Chadwick, Will Maddern, and Paul Newman. Dis- [18] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Re-
tant vehicle detection using radar and vision. In 2019 In- visiting self-supervised visual representation learning. In
ternational Conference on Robotics and Automation (ICRA), Proceedings of the IEEE/CVF Conference on Computer Vi-
pages 8311–8317. IEEE, 2019. 3 sion and Pattern Recognition, pages 1920–1929, 2019. 3
[19] Dirk Langer and Todd Jochem. Fusing radar and vision
[7] Shuo Chang, Yifan Zhang, Fan Zhang, Xiaotong Zhao, Sai
for detecting, classifying and avoiding roadway obstacles.
Huang, Zhiyong Feng, and Zhiqing Wei. Spatial attention
In Proceedings of Conference on Intelligent Vehicles, pages
fusion for obstacle detection using mmwave radar and vision
333–338. IEEE, 1996. 3
sensor. Sensors, 20(4):956, 2020. 3
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
learning. nature, 521(7553):436–444, 2015. 3
offrey Hinton. A simple framework for contrastive learning
[21] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
of visual representations. In International conference on ma-
Bharath Hariharan, and Serge Belongie. Feature pyra-
chine learning, pages 1597–1607. PMLR, 2020. 3
mid networks for object detection. In Proceedings of the
[9] Hyunggi Cho, Young-Woo Seo, BVK Vijaya Kumar, and IEEE conference on computer vision and pattern recogni-
Ragunathan Raj Rajkumar. A multi-sensor fusion system tion, pages 2117–2125, 2017. 4
for moving object detection and tracking in urban driving [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
environments. In 2014 IEEE International Conference on Piotr Dollár. Focal loss for dense object detection. In Pro-
Robotics and Automation (ICRA), pages 1836–1843. IEEE, ceedings of the IEEE international conference on computer
2014. 3 vision, pages 2980–2988, 2017. 4
[10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- [23] Gregory P. Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-
novich. Superpoint: Self-supervised interest point detection Gonzalez, and Carl K. Wellington. Lasernet: An efficient
and description. In Proceedings of the IEEE conference on probabilistic 3d object detector for autonomous driving. In
computer vision and pattern recognition workshops, pages Proceedings of the IEEE/CVF Conference on Computer Vi-
224–236, 2018. 3 sion and Pattern Recognition (CVPR), June 2019. 3
[11] Jose M. Facil, Benjamin Ummenhofer, Huizhong Zhou, [24] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Luis Montesano, Thomas Brox, and Javier Civera. Cam- Efficient estimation of word representations in vector space.
convs: Camera-aware multi-scale convolutions for single- arXiv preprint arXiv:1301.3781, 2013. 3
view depth. In Proceedings of the IEEE/CVF Conference [25] Ramin Nabati and Hairong Qi. Rrpn: Radar region proposal
on Computer Vision and Pattern Recognition (CVPR), June network for object detection in autonomous vehicles. In 2019
2019. 3 IEEE International Conference on Image Processing (ICIP),
[12] Fernando Garcia, Pietro Cerri, Alberto Broggi, Arturo de la pages 3093–3097. IEEE, 2019. 2
Escalera, and José Marı́a Armingol. Data fusion for over- [26] Ramin Nabati and Hairong Qi. Centerfusion: Center-based
taking vehicle detection based on radar and optical flow. In radar and camera fusion for 3d object detection. In Proceed-
2012 IEEE Intelligent Vehicles Symposium, pages 494–499. ings of the IEEE/CVF Winter Conference on Applications of
IEEE, 2012. 3 Computer Vision, pages 1527–1536, 2021. 3
[27] Alejandro Newell and Jia Deng. Pixels to graphs by asso-
ciative embedding. arXiv preprint arXiv:1706.07365, 2017.
3
[28] Felix Nobis, Maximilian Geisslinger, Markus Weber, Jo-
hannes Betz, and Markus Lienkamp. A deep learning-based
radar and camera sensor fusion architecture for object detec-
tion. In 2019 Sensor Data Fusion: Trends, Solutions, Appli-
cations (SDF), pages 1–7. IEEE, 2019. 2, 3
[29] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
Scalability in perception for autonomous driving: Waymo
open dataset, 2019. 2
[30] Yue Wang, Alireza Fathi, Abhijit Kundu, David Ross, Car-
oline Pantofaru, Tom Funkhouser, and Justin Solomon.
Pillar-based object detection for autonomous driving. arXiv
preprint arXiv:2007.10323, 2020. 3
[31] Ziguo Zhong, Stanley Liu, Manu Mathew, and Aish Dubey.
Camera radar fusion for increased reliability in adas applica-
tions. Electronic Imaging, 2018(17):258–1, 2018. 3

Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving

Uploaded by

Copyright:

Available Formats

Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dong 等 - 2021 - Radar Camera Fusion via Representation Learning in Autonomous Driving

Uploaded by

Copyright:

Available Formats

Radar Camera Fusion via Representation Learning in Autonomous Driving

Xu Dong Binnan Zhuang Yunxiang Mao Langechuan Liu †

Radars and cameras are mature, cost-effective, and ro-

4.2. Deep Association by Representation Learning

4.2.2 Ordinal Loss 2

You might also like