AutoMatch SenSys22
AutoMatch SenSys22
ABSTRACT KEYWORDS
Traffic camera is one of the most ubiquitous traffic facilities, provid- Image registration, Vehicle-infrastructure cooperative sensing, Infrastructure-
ing high coverage of complex, accident-prone road sections such as assisted autonomous driving, Edge computing
intersections. This work leverages traffic cameras to improve the ACM Reference Format:
perception and localization performance of autonomous vehicles Yuze He† , Li Ma‡ , Jiahe Cui§ , Zhenyu Yan† , Guoliang Xing† , Sen Wang ¶ ,
at intersections. In particular, vehicles can expand their range of Qintao Hu ¶ , Chen Pan$ . 2022. AutoMatch: Leveraging Traffic Camera to
perception by matching the images captured by both the traffic cam- Improve Perception and Localization of Autonomous Vehicles. In The 20th
eras and on-vehicle cameras. Moreover, a traffic camera can match ACM Conference on Embedded Networked Sensor Systems (SenSys ’22), No-
its images to an existing high-definition map (HD map) to derive vember 6–9, 2022, Boston, MA, USA. ACM, New York, NY, USA, 15 pages.
centimeter-level location of the vehicles in its field of view. To this https://doi.org/10.1145/3560905.3568519
end, we propose AutoMatch - a novel system for real-time image
registration, which is a key enabling technology for traffic camera- 1 INTRODUCTION
assisted perception and localization of autonomous vehicles. Our In this work, we leverage traffic cameras to assist two fundamental
key idea is to leverage landmark keypoints of distinctive structures applications of autonomous driving (see Fig. 1): 1) Boosting vehicle
such as ground signs at intersections to facilitate image registration perception. Vehicles will be able to see beyond obstacle occlusions
between traffic cameras and HD maps or vehicles. By leveraging and expand their range of perception by taking advantage of the
the strong structural characteristics of ground signs, AutoMatch can traffic cameras, which are typically mounted a few meters above
extract very few but precise landmark keypoints for registration, ground and hence provide a much wider and almost unobscured
which effectively reduces the communication/compute overhead. field of view. Specifically, by matching the images captured by
We implement AutoMatch on a testbed consisting of a self-built both traffic cameras and itself, an autonomous vehicle can com-
autonomous car, drones for surveying and mapping, and real traffic plement and expand its field of view and improve its situational
cameras. In addition, we collect two new multi-view traffic image awareness. 2) High-precision vehicle localization. A traffic camera
datasets at intersections, which contain images from 220 real oper- can match its images to existing high-definition global maps (HD
ational traffic cameras in 22 cities. Experimental results show that maps) to derive centimeter-level location of the vehicles in its view.
AutoMatch achieves pixel-level image registration accuracy within This process can be implemented by the infrastructure or cloud
88 milliseconds, and delivers an 11.7× improvement in accuracy, and hence significantly lower the requirements on the vehicle’s
1.4× speedup in compute time, and 17.1× data transmission saving compute/localization capabilities. Such two applications provide
over existing approaches. autonomous vehicles with boosted perception and high-precision
localization, which greatly improves the accuracy and reliability of
CCS CONCEPTS vehicles’ downstream tasks at complex intersection environments,
• Computer systems organization → Sensor networks. including path planning, decision making, and vehicle control. In
this work, we focus on leveraging traffic cameras at intersections
*Corresponding author. because of the following three reasons. First, intersections are more
accident-prone than other road sections. Second, intersections have
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed highly complex structures, introducing unique challenges for au-
for profit or commercial advantage and that copies bear this notice and the full citation tonomous driving. Third, to date, most traffic cameras are installed
on the first page. Copyrights for components of this work owned by others than ACM around intersections [16, 51].
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a The key technology that enables both above applications is real-
fee. Request permissions from [email protected]. time high-precision image registration, which refers to the process
SenSys ’22, November 6–9, 2022, Boston, MA, USA of finding the homography between two image coordinate systems.
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9886-2/22/11. . . $15.00 In the above two applications, images from traffic cameras are
https://doi.org/10.1145/3560905.3568519 registered with those from either vehicles or HD maps. Through
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.
We discuss the collection of two datasets and the system imple- usually less accurate compared to general keypoint detectors, and
mentation in Sections 5 and 6, respectively. Section 7 shows the can not achieve pixel-level registration precision. In this work, we
experiment results and Section 8 concludes the paper. address this key issue by integrating the general keypoint detector
to refine the detected landmarks to obtain landmark keypoints with
precise locations.
2 RELATED WORK Cooperative Infrastructure-Vehicle or Vehicle-Vehicle Per-
Image Registration. Image registration aims to find correspon- ception and Localization. To improve the perception performance
dence between two images and has many applications in autonomous of autonomous vehicles, Arnold et al. [3] propose a cooperative 3D
driving such as camera calibration [26], Simultaneous Localization object detection scheme, where several infrastructure sensors are
and Mapping (SLAM) [46], and Structure from Motion (SfM) [27, 55]. used for multi-view simultaneous 3D object detection. Zhang et
A typical registration pipeline consists of three stages, keypoint al. [71] propose an edge-assisted multi-vehicle perception system
detection, description generation, and keypoint matching. Both clas- called EMP, where connected and autonomous vehicles’ (CAVs’) in-
sical [6, 25, 35, 39, 42] and learning-based [53, 69] methods detect dividual point clouds are optimally partitioned and merged to form
points of interest throughout a whole image. However, they are not a complete point cloud with a higher resolution. In [43], cameras
applicable to complex and diverse intersection scenarios since there and LiDARs are leveraged to assist the localization of autonomous
is no guarantee that meaningful keypoints for registration can be vehicles. Fascista et al. [21] propose to localize vehicles using the
extracted at the first step of image registration. The feature descrip- angle of arrival estimation of beacons from several infrastructure
tors are extracted from a local patch centered around each keypoint nodes. Different from these studies where infrastructures or CAVs
to capture higher-level information and generate robust and precise have known accurate pose, i.e., position and orientation, we focus
representations for keypoints. However, they may suffer from ambi- on leveraging existing traffic cameras with unknown poses to assist
guity when there are repetitive contents that are common in traffic autonomous vehicles in both perception and localization through
scenarios. Moreover, these descriptors are usually represented as real-time image registration.
large-sized feature vectors, which incur significant communication
overhead and are ill-suited for traffic camera-assisted autonomous 3 BACKGROUND, APPLICATIONS AND
driving. The final step, keypoint matching, matches two keypoints CHALLENGES
in the two input images that have the most similar descriptors.
In this section, we first review autonomous driving perception
Nearest neighbor [45] and fast approximation nearest neighbor
and localization technologies available today, which motivates our
[44] algorithms are two representative methods, but they perform
approach. We then present the two applications of AutoMatch for as-
poorly when encountering too many outlier keypoints. Tracking-
sisting autonomous driving at intersections. Finally, the challenges
based matching methods are widely adopted in visual SLAM and
addressed in the design of AutoMatch are discussed.
can achieve real-time performance. However, they work well only
for two similar images, such as the neighboring frames of a video.
Recent works use Graph Neural Networks (GNN) [52] and trans-
3.1 Perception/Localization of Autonomous
formers [30] to boost the matching performance for challenging Driving
cases. Nevertheless, the methods based on the above three-stage Like human drivers, autonomous vehicles must know where they
pipeline require a certain similarity of scales and perspectives of are on the road (localization) and which objects are in the sur-
two images, while the images to be matched in the traffic camera- roundings (perception). Perception and localization are essential
assisted autonomous driving usually have significant scale and for autonomous vehicles to make accurate and reliable decisions
viewpoint differences as well as overly repeated contents. for vehicle control. Due to the mission-critical nature, autonomous
Landmark Detection. The goal of landmark detection is to driving imposes stringent requirements on the accuracy and delay
localize a group of pre-defined landmarks on objects with seman- of perception and localization [64].
tically meaningful structures. For example, facial landmark detec- Mainstream autonomous driving platforms typically use a com-
tors [49, 59, 72] predict 5, 20 or 68 fiducial points, outlining the bination of sensors such as cameras, LiDARs, radars, GNSS/IMUs,
face boundaries, eye, nose and mouth. Body keypoint detectors and odometers for high-precision perception and localization [37].
[11, 15, 47, 66] detect 14 or 17 keypoints, indicating shoulders, Specifically, vehicles consume incoming camera images or LiDAR
wrists, etc. Unlike general keypoint detectors that extract keypoints point clouds to detect and track obstacles such as moving vehi-
in an indiscriminate manner, landmark detectors “recognize” the cles and people within. Then the free navigable space is identified
semantic part of the object by exploiting the shape pattern, like to ensure that the vehicle does not collide with moving objects.
symmetry and spatial relationships. Surprisingly, the use of land- However, on-vehicle sensors have a limited field of view, and the
mark detectors for image registration has not been well explored perception will often be obscured by surrounding objects, which
despite the following advantages: (i) Robustness to noise and out- may unavoidably cause traffic accidents. To achieve high-precision
liers caused by similar low-level image appearances, as the shape localization, many commercial vehicles, such as the Google and
and structured information provide constraints to each landmark. Uber cars, use a priori mapping approach [28, 60], which consists
(ii) Unlike general keypoint descriptors where each descriptor is rep- of pre-driving specific roads, collecting detailed 3D point clouds,
resented as a one-dimensional feature vector, landmarks are more and generating high precision maps. Vehicles can store such maps
interpretable and discriminative. However, a critical shortcoming or download them from the cloud. Localization is then performed
of landmark detectors is that the predicted landmark location is by matching the current sensor data with HD maps. However, the
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.
Figure 4: Framework of our image registration approach for traffic camera-assisted autonomous driving.
Accurate landmark
… General keypoint keypoints
ROIs detector Keypoint heatmap
Figure 8: Illustration of the unified landmark template (a) and some ex-
amples of ground signs that can be modeled using this template (b).
Figure 9: Illustration of the Landmark-guided NMS method for combining
propose to refine the result of landmark detection using a general the landmark detector and the general keypoint detector.
keypoint detector. To this end, a Landmark-guided NMS algo-
rithm is proposed to integrate both detectors to extract the final 4.3.2 Landmark-guided NMS. Despite robustness, the main lim-
landmark keypoints, where the landmarks serve as guidance for itation of the landmark detector is that the detected landmarks
picking the keypoints to achieve more accurate landmark keypoint do not fall precisely on the corners of the ground sign (see green
localization. Such an approach enables both accurate and highly points in Fig. 5). To address this issue, we use the general keypoint
robust landmark keypoint extraction despite various interferences detector to boost positioning accuracy. We adopt the widely-used
on ground sign appearances. We now discuss each component of general keypoint detector SuperPoint [17], a fast and lightweight
the landmark keypoint extractor in detail. model that computes accurate keypoint locations, which generates
a keypoint response heatmap of the same size as the input. Each
4.3.1 Landmark Detector. We design a new landmark detector pixel of the heatmap corresponds to the probability of the pixel
based on a real-time state-of-the-art facial landmark detector PFLD that is a keypoint. The training process is similar to the one in
[24]. We zero-pad the ground sign patches before feeding them [17]. The difference is that our synthetic dataset only consists of
into the landmark detector to meet the aspect ratio requirement. To structures with corners such as quadrilaterals, triangles, lines, and
be able to generate landmarks with different templates, we design stars, which strengthens the detection of corner-like keypoints. The
a unified landmark template as shown in Fig. 8. All categories of synthetic dataset is rendered on-the-fly, and no example is seen by
ground signs are stacked together with similar components merged, the network twice.
which results in a template with 4 components and a total number We now have the landmarks from the landmark detector and the
of 22 landmarks. Each landmark has its own ID number, which keypoint heatmap from the general keypoint detector. Landmarks
implicitly encodes rich semantic information. The neural network capture the global structure and provide guidance for the positions
will predict the pixel locations of all 22 landmarks. The output of final landmark keypoints. By exploiting this property of land-
landmarks of each ground sign class constitute a subset of these marks, we look for the maximum response of the keypoint heatmap
components, e.g., the turning left sign contains component 2, with around each landmark, to fine-tune the position of landmarks for
a total of 7 landmarks. To achieve this, we define a binary mask 𝑀 the final landmark keypoints. As a result, the final landmark key-
with a length of 22 for each category of ground sign to mask out points not only inherit the landmarks’ expression of the global
unused landmarks. The mask is predefined and determined by the structure but also precisely localize the corner points. Specifically,
class of the ground sign. We then define the training loss as follows: as shown in Fig. 9, we first generate a Gaussian distribution map
centered at each landmark and multiply this Gaussian map with the
|𝑀 | 𝑁
keypoint heatmap pixel-wisely. The pixel with the maximum value
1 ∑︁ ∑︁ 𝑛 𝑛 𝑛 2 in the map is selected as the final landmark keypoint (𝑢, ˆ 𝑣).
ˆ This
L := 𝑀 p − p̂𝑚 2 (1)
|𝑀 |𝑁 𝑚=1 𝑛=1 𝑚 𝑚 operation filters out the keypoints far away from the landmark and
allows the final landmark keypoints to have both rich semantics
where |𝑀 | = 22 is the total number of landmarks and the subscript and accurate locations. Formally, this can be expressed as:
𝑚 indicates the 𝑚-th point. 𝑁 denotes the batch size. p and p̂ are the ˆ 𝑣)
(𝑢, ˆ = argmax𝐺 (𝑢, 𝑣) · 𝐻 (𝑢, 𝑣), (2)
ground truth and predicted locations of each landmark, respectively. (𝑢,𝑣)
This masked loss means that only the landmarks that fall into the where
current ground sign’s category will contribute to the training loss.
(𝑢 − 𝑢𝑜 ) 2 (𝑣 − 𝑣𝑜 ) 2
The same mask operation is performed in the inference stage, where 𝐺 (𝑢, 𝑣) = exp − + (3)
2𝜎 2 2𝜎 2
only landmarks belonging to the category of the current ground
sign are picked, and other landmarks are discarded. is a Gaussian distribution centered on a landmark (𝑢𝑜 , 𝑣𝑜 ) and
To train the landmark detector, we crop the ground sign bound- 𝐻 (𝑢, 𝑣) represents the keypoint heatmap from the general keypoint
ing boxes from the training dataset mentioned in Section 4.2. Then detector.
we resize and zero-pad them into patches of size 224 × 224 and then
label the landmarks on them. During training, we also add a small 4.4 Group RANSAC
random perturbation of homography transformations to each patch After the previous modules of our pipeline, we now have the ground
to augment the training examples. sign bounding boxes A and B in the two input images, as well as the
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.
among which only 31% are used for registration, compared to that
of almost 100% for other baselines. The three baselines demonstrate
high communication overhead since they extract massive keypoints
and heavy descriptors for each keypoint. Besides, the bandwidth
requirement of AutoMatch is as low as 72 Kbps, which can be easily
supported by the current LTE network.
Reproj.
Datasets Methods Run time MMA
error
SIFT 218.256 px 7.440 s 17.58%
SuperGlue 74.579 px 0.143 s 47.13%
Traffic camera-
COTR 91.587 px 174.730 s 40.77%
vehicle dataset
D2-Net 77.003 px 1.543 s 29.23%
AutoMatch 2.986 px 0.043 s 96.01%
SIFT 143.476 px 0.629 s 12.39%
SuperGlue 49.106 px 0.125 s 49.74%
Traffic camera-
COTR 68.402 px 67.713 s 35.22%
HD map dataset
D2-Net 61.284 px 0.921 s 21.16%
AutoMatch 4.215 px 0.088 s 92.83%
Figure 19: Registration results between an HD map and a traffic camera image in the real traffic scene.
REFERENCES
[1] n.d.. Nvidia TENSORRT. https://developer.nvidia.com/tensorrt.
Figure 20: Qualitative results of ablation study. Note that the LD only + [2] n.d.. Open Neural Network Exchange. https://onnx.ai/.
Group RANSAC tends to detect inaccurate landmark locations as high- [3] Eduardo Arnold, Mehrdad Dianati, Robert de Temple, and Saber Fallah. 2020.
lighted in blue. Cooperative perception for 3D object detection in driving scenarios using in-
frastructure sensors. IEEE Transactions on Intelligent Transportation Systems
(2020).
RANSAC achieves at least two orders of magnitude faster than the [4] OpenDroneMap Authors. 2020. ODM - A command line toolkit to generate
SOTA matching algorithm SuperGlue. We also visualize the matches maps, point clouds, 3D models and DEMs from drone, balloon or kite images.
in Fig. 20. We can see that without the guidance of landmarks, GKD https://github.com/OpenDroneMap/ODM.
[5] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. 2016.
only + SuperGlue and GKD only + NN produce many noisy and indis- Learning local feature descriptors with triplets and shallow convolutional neural
criminative keypoints and further lead to numerous false matches, networks.. In Bmvc, Vol. 1. 3.
which are consistent with the quantitative results in Table 5. On [6] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust
features. In European conference on computer vision. Springer, 404–417.
the other hand, if we only use the landmark detector (LD only + [7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4:
Group RANSAC), although the landmarks are correctly matched, as Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
highlighted in blue in Fig. 20, they suffer from inaccurate location, (2020).
[8] Gary Bradski and Adrian Kaehler. 2008. Learning OpenCV: Computer vision with
which causes performance degradation. By contrast, our Full model the OpenCV library. " O’Reilly Media, Inc.".
predicts accurate structured keypoint locations and matches all of [9] Matthew Brown, Gang Hua, and Simon Winder. 2010. Discriminative learning
them correctly by combining the benefits of the general keypoint of local image descriptors. IEEE transactions on pattern analysis and machine
intelligence 33, 1 (2010), 43–57.
detector and the landmark detector. Besides, it is also worth notic- [10] Andrew Burnes. 2019. Introducing GeForce RTX SUPER Graphics Cards: Best In
ing that the performance of GKD only + SuperGlue is significantly Class Performance, Plus Ray Tracing. https://www.nvidia.com/en-us/geforce/
news/geforce-rtx-20-series-super-gpus/.
better than the SuperGlue in Table 4. They share the same pipeline
[11] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-
with the only difference being that the GKD only + SuperGlue works person 2d pose estimation using part affinity fields. In Proceedings of the IEEE
on bounding boxes instead of the whole image, which validates our conference on computer vision and pattern recognition. 7291–7299.
[12] Long Chen, Shaobo Lin, Xiankai Lu, Dongpu Cao, Hangbin Wu, Chi Guo, Chun
core idea of focusing on key structures instead of the whole image. Liu, and Fei-Yue Wang. 2021. Deep neural network based vehicle and pedestrian
detection for autonomous driving: A survey. IEEE Transactions on Intelligent
Transportation Systems 22, 6 (2021), 3234–3246.
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.
[13] Christopher Choy, Wei Dong, and Vladlen Koltun. 2020. Deep global registra- [38] Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu,
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern and Juho Kannala. 2019. Dgc-net: Dense geometric correspondence network. In
recognition. 2514–2523. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE,
[14] Christopher Choy, Jaesik Park, and Vladlen Koltun. 2019. Fully Convolutional 1034–1042.
Geometric Features. In ICCV. [39] Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale & affine invariant interest
[15] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang point detectors. International journal of computer vision 60, 1 (2004), 63–86.
Wang. 2017. Multi-context attention for human pose estimation. In Proceedings [40] Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale & affine invariant interest
of the IEEE conference on computer vision and pattern recognition. 1831–1840. point detectors. International journal of computer vision 60, 1 (2004), 63–86.
[16] BRITISH COLUMBIA. 2019. Where intersection safety cameras are lo- [41] Krystian Mikolajczyk and Cordelia Schmid. 2005. A performance evaluation of
cated. https://www2.gov.bc.ca/gov/content/transportation/driving-and-cycling/ local descriptors. IEEE transactions on pattern analysis and machine intelligence
roadsafetybc/intersection-safety-cameras/where-the-cameras-are. 27, 10 (2005), 1615–1630.
[17] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint: [42] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri
Self-supervised interest point detection and description. In Proceedings of the Matas, Frederik Schaffalitzky, Timor Kadir, and L Van Gool. 2005. A comparison
IEEE conference on computer vision and pattern recognition workshops. 224–236. of affine region detectors. International journal of computer vision 65, 1 (2005),
[18] Jingming Dong and Stefano Soatto. 2015. Domain-size pooling in local descriptors: 43–72.
DSP-SIFT. In Proceedings of the IEEE conference on computer vision and pattern [43] Yanghui Mo, Peilin Zhang, Zhijun Chen, and Bin Ran. 2021. A method of vehicle-
recognition. 5097–5106. infrastructure cooperative perception based vehicle state information fusion
[19] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko using improved kalman filter. Multimedia Tools and Applications (2021), 1–18.
Torii, and Torsten Sattler. 2019. D2-net: A trainable cnn for joint description and [44] Marius Muja and David G Lowe. 2009. Fast approximate nearest neighbors with
detection of local features. In Proceedings of the IEEE/cvf conference on computer automatic algorithm configuration. VISAPP (1) 2, 331-340 (2009), 2.
vision and pattern recognition. 8092–8101. [45] Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms
[20] G. Elbaz, T. Avraham, and A. Fischer. 2017. 3D Point Cloud Registration for for high dimensional data. IEEE transactions on pattern analysis and machine
Localization Using a Deep Neural Network Auto-Encoder. In 2017 IEEE Conference intelligence 36, 11 (2014), 2227–2240.
on Computer Vision and Pattern Recognition (CVPR). 2472–2481. https://doi.org/ [46] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORB-
10.1109/CVPR.2017.265 SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on
[21] Alessio Fascista, Giovanni Ciccarese, Angelo Coluccia, and Giuseppe Ricci. 2017. robotics 31, 5 (2015), 1147–1163.
Angle of arrival-based cooperative positioning for smart vehicles. IEEE Transac- [47] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks
tions on Intelligent Transportation Systems 19, 9 (2017), 2880–2892. for human pose estimation. In European conference on computer vision. Springer,
[22] Martin A. Fischler and Robert C. Bolles. 1981. Random Sample Consensus: A 483–499.
Paradigm for Model Fitting with Applications to Image Analysis and Automated [48] NVIDIA. 2022. HARDWARE FOR SELF-DRIVING CARS. https://www.nvidia.
Cartography. Commun. ACM 24, 6 (June 1981), 381–395. https://doi.org/10.1145/ com/en-us/self-driving-cars/drive-platform/hardware/.
358669.358692 [49] Giuseppe Palestra, Adriana Pettinicchio, Marco Del Coco, Pierluigi Carcagnì,
[23] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for au- Marco Leo, and Cosimo Distante. 2015. Improved performance in facial expression
tonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on recognition using 32 geometric features. In International Conference on Image
Computer Vision and Pattern Recognition. IEEE, 3354–3361. Analysis and Processing. Springer, 518–528.
[24] Xiaojie Guo, Siyuan Li, Jinke Yu, Jiawan Zhang, Jiayi Ma, Lin Ma, Wei Liu, and [50] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Haibin Ling. 2019. PFLD: A practical facial landmark detector. arXiv preprint Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.
arXiv:1902.10859 (2019). Pytorch: An imperative style, high-performance deep learning library. Advances
[25] Chris Harris, Mike Stephens, et al. 1988. A combined corner and edge detector. in neural information processing systems 32 (2019).
In Alvey vision conference. Citeseer, 10–5244. [51] radenso. 2021. What’s the difference between traffic cameras, red light cameras,
[26] Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer and speed cameras? https://radenso.com/blogs/radar-university/what-s-the-
vision. Cambridge university press. difference-between-traffic-cameras-red-light-cameras-and-speed-cameras.
[27] Jared Heinly, Johannes L Schonberger, Enrique Dunn, and Jan-Michael Frahm. [52] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
2015. Reconstructing the world* in six days*(as captured by the yahoo 100 novich. 2020. Superglue: Learning feature matching with graph neural networks.
million image dataset). In Proceedings of the IEEE conference on computer vision In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni-
and pattern recognition. 3287–3295. tion. 4938–4947.
[28] INSIDER. 2016. Here’s why self-driving cars can’t handle bridges. <http://www. [53] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys.
businessinsider.com/autonomous-cars-bridges-2016-8. 2017. Quad-networks: unsupervised learning to rank for interest point detection.
[29] Mahdi Javanmardi, Ehsan Javanmardi, Yanlei Gu, and Shunsuke Kamijo. 2017. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Towards high-definition 3D urban mapping: Road feature-based registration of 1822–1830.
mobile mapping systems and aerial imagery. Remote Sensing 9, 10 (2017), 975. [54] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys.
[30] Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, and Kwang Moo 2017. Quad-networks: unsupervised learning to rank for interest point detection.
Yi. 2021. Cotr: Correspondence transformer for matching across images. In In Proceedings of the IEEE conference on computer vision and pattern recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision. 6207– 1822–1830.
6217. [55] Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion
[31] Jialin Jiao. 2018. Machine learning assisted high-definition map creation. In 2018 revisited. In Proceedings of the IEEE conference on computer vision and pattern
IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), recognition. 4104–4113.
Vol. 1. IEEE, 367–373. [56] Heiko G Seif and Xiaolong Hu. 2016. Autonomous driving in the iCity—HD maps
[32] Felix Kam and Henrik Mellin. 2019. Different frequencies of maneuver replanning as a key challenge of the automotive industry. Engineering 2, 2 (2016), 159–162.
on autonomous vehicles. [57] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and
[33] I Karls and M Mueck. 2018. Networking vehicles to everything. Evolving auto- Francesc Moreno-Noguer. 2015. Discriminative learning of deep convolutional
motive solutions. feature point descriptors. In Proceedings of the IEEE international conference on
[34] S. Kuutti, S. Fallah, K. Katsaros, M. Dianati, F. Mccullough, and A. Mouzakitis. 2018. computer vision. 118–126.
A Survey of the State-of-the-Art Localization Techniques and Their Potentials [58] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for Autonomous Vehicle Applications. IEEE Internet of Things Journal 5, 2 (2018), for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
829–846. https://doi.org/10.1109/JIOT.2018.2812300 [59] Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network
[35] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. cascade for facial point detection. In Proceedings of the IEEE conference on computer
International journal of computer vision 60, 2 (2004), 91–110. vision and pattern recognition. 3476–3483.
[36] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. 2021. Image [60] The N.Y. Times. 2017. Building a road map for the self-driving car. <https://www.
matching from handcrafted to deep features: A survey. International Journal of nytimes.com/2017/03/02/automobiles/wheels/selfdriving-cars-gps-maps.html.
Computer Vision 129, 1 (2021), 23–79. [61] Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. 2020. GOCor:
[37] Juliette Marais, Cyril Meurie, Dhouha Attia, Yassine Ruichek, and Amaury Flanc- Bringing globally optimized correspondence volumes into your neural network.
quart. 2014. Toward accurate localization in guided transport: Combining GNSS Advances in Neural Information Processing Systems 33 (2020), 14278–14290.
data and imaging information. Transportation Research Part C: Emerging Tech- [62] Prune Truong, Martin Danelljan, and Radu Timofte. 2020. GLU-Net: Global-local
nologies 43 (2014), 188–197. universal network for dense flow and correspondences. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 6258–6268.
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA
[63] Federal Highway Administration U.S. Department of Transportation. 2002. United Workshop on High-Precision Maps and Intelligent Applications for Autonomous
States Pavement Markings. https://mutcd.fhwa.dot.gov/services/publications/ Vehicles. 1–8.
fhwaop02090/index.htm. [69] Linguang Zhang and Szymon Rusinkiewicz. 2018. Learning to detect features
[64] Jessica Van Brummelen, Marie O’Brien, Dominique Gruyer, and Homayoun in texture images. In Proceedings of the IEEE conference on computer vision and
Najjaran. 2018. Autonomous vehicle perception: The technology of today and pattern recognition. 6325–6333.
tomorrow. Transportation research part C: emerging technologies 89 (2018), 384– [70] Linguang Zhang and Szymon Rusinkiewicz. 2018. Learning to detect features
406. in texture images. In Proceedings of the IEEE conference on computer vision and
[65] Harsha Vardhan. 2017. HD Maps: New age maps powering autonomous vehicles. pattern recognition. 6325–6333.
Geospatial world 22 (2017). [71] Xumiao Zhang, Anlan Zhang, Jiachen Sun, Xiao Zhu, Y Ethan Guo, Feng Qian, and
[66] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convo- Z Morley Mao. 2021. Emp: Edge-assisted multi-vehicle perception. In Proceedings
lutional pose machines. In Proceedings of the IEEE conference on Computer Vision of the 27th Annual International Conference on Mobile Computing and Networking.
and Pattern Recognition. 4724–4732. 545–558.
[67] Ron Weinstein. 2005. RFID: a technical overview and its application to the [72] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. 2013. Extensive
enterprise. IT professional 7, 3 (2005), 27–33. facial landmark localization with coarse-to-fine convolutional network cascade.
[68] Andi Zang, Runsheng Xu, Zichen Li, and David Doria. 2017. Lane boundary In Proceedings of the IEEE international conference on computer vision workshops.
extraction from satellite imagery. In Proceedings of the 1st ACM SIGSPATIAL 386–391.