0% found this document useful (0 votes)
19 views15 pages

AutoMatch SenSys22

Uploaded by

axegamingyt2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

AutoMatch SenSys22

Uploaded by

axegamingyt2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

AutoMatch: Leveraging Traffic Camera to Improve Perception

and Localization of Autonomous Vehicles


Yuze He† , Li Ma‡ , Jiahe Cui§ , Zhenyu Yan† , Guoliang Xing† , Sen Wang ¶ , Qintao Hu ¶ , Chen Pan$
† The
Chinese University of Hong Kong, Hong Kong SAR, China
‡ The
Hong Kong University of Science and Technology, Hong Kong SAR, China
§ School of Computer Science and Engineering, Beihang University, Beijing, China
¶ 2012 Lab, Huawei Technologies, Shenzhen, China
$ Smart Car Solutions Business Unit, Huawei Technologies, Hangzhou, China

Email: [email protected], [email protected], [email protected], [email protected], [email protected],


[email protected], [email protected], [email protected]

ABSTRACT KEYWORDS
Traffic camera is one of the most ubiquitous traffic facilities, provid- Image registration, Vehicle-infrastructure cooperative sensing, Infrastructure-
ing high coverage of complex, accident-prone road sections such as assisted autonomous driving, Edge computing
intersections. This work leverages traffic cameras to improve the ACM Reference Format:
perception and localization performance of autonomous vehicles Yuze He† , Li Ma‡ , Jiahe Cui§ , Zhenyu Yan† , Guoliang Xing† , Sen Wang ¶ ,
at intersections. In particular, vehicles can expand their range of Qintao Hu ¶ , Chen Pan$ . 2022. AutoMatch: Leveraging Traffic Camera to
perception by matching the images captured by both the traffic cam- Improve Perception and Localization of Autonomous Vehicles. In The 20th
eras and on-vehicle cameras. Moreover, a traffic camera can match ACM Conference on Embedded Networked Sensor Systems (SenSys ’22), No-
its images to an existing high-definition map (HD map) to derive vember 6–9, 2022, Boston, MA, USA. ACM, New York, NY, USA, 15 pages.
centimeter-level location of the vehicles in its field of view. To this https://doi.org/10.1145/3560905.3568519
end, we propose AutoMatch - a novel system for real-time image
registration, which is a key enabling technology for traffic camera- 1 INTRODUCTION
assisted perception and localization of autonomous vehicles. Our In this work, we leverage traffic cameras to assist two fundamental
key idea is to leverage landmark keypoints of distinctive structures applications of autonomous driving (see Fig. 1): 1) Boosting vehicle
such as ground signs at intersections to facilitate image registration perception. Vehicles will be able to see beyond obstacle occlusions
between traffic cameras and HD maps or vehicles. By leveraging and expand their range of perception by taking advantage of the
the strong structural characteristics of ground signs, AutoMatch can traffic cameras, which are typically mounted a few meters above
extract very few but precise landmark keypoints for registration, ground and hence provide a much wider and almost unobscured
which effectively reduces the communication/compute overhead. field of view. Specifically, by matching the images captured by
We implement AutoMatch on a testbed consisting of a self-built both traffic cameras and itself, an autonomous vehicle can com-
autonomous car, drones for surveying and mapping, and real traffic plement and expand its field of view and improve its situational
cameras. In addition, we collect two new multi-view traffic image awareness. 2) High-precision vehicle localization. A traffic camera
datasets at intersections, which contain images from 220 real oper- can match its images to existing high-definition global maps (HD
ational traffic cameras in 22 cities. Experimental results show that maps) to derive centimeter-level location of the vehicles in its view.
AutoMatch achieves pixel-level image registration accuracy within This process can be implemented by the infrastructure or cloud
88 milliseconds, and delivers an 11.7× improvement in accuracy, and hence significantly lower the requirements on the vehicle’s
1.4× speedup in compute time, and 17.1× data transmission saving compute/localization capabilities. Such two applications provide
over existing approaches. autonomous vehicles with boosted perception and high-precision
localization, which greatly improves the accuracy and reliability of
CCS CONCEPTS vehicles’ downstream tasks at complex intersection environments,
• Computer systems organization → Sensor networks. including path planning, decision making, and vehicle control. In
this work, we focus on leveraging traffic cameras at intersections
*Corresponding author. because of the following three reasons. First, intersections are more
accident-prone than other road sections. Second, intersections have
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed highly complex structures, introducing unique challenges for au-
for profit or commercial advantage and that copies bear this notice and the full citation tonomous driving. Third, to date, most traffic cameras are installed
on the first page. Copyrights for components of this work owned by others than ACM around intersections [16, 51].
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a The key technology that enables both above applications is real-
fee. Request permissions from [email protected]. time high-precision image registration, which refers to the process
SenSys ’22, November 6–9, 2022, Boston, MA, USA of finding the homography between two image coordinate systems.
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9886-2/22/11. . . $15.00 In the above two applications, images from traffic cameras are
https://doi.org/10.1145/3560905.3568519 registered with those from either vehicles or HD maps. Through
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

onboard camera and traffic camera. Then, we propose a novel land-


AutoMatch mark keypoint extractor to robustly and accurately locate very few
HD map
on cloud or landmark keypoints of ground signs. The novelty of our design
infrastructure
(x, y, z) lies in the integration of a landmark detector and a general key-
High precision centimeter- point detector. In this paper, we refer to the points extracted by
level location
the general keypoint detector as keypoints, the points extracted by
Traffic camera image AutoMatch
on vehicle the landmark detector as landmarks, and the points extracted by
the landmark keypoint extractor as landmark keypoints. Motivated
Third person view Perception fusion by the fact that most ground signs have a dominant structural pat-
Vehicle camera image
tern (e.g., arrows), we develop a new landmark detector to find
Figure 1: Two applications of AutoMatch. 1) The vehicle’s perception is
boosted by fusing the perception information of the traffic camera and structurally meaningful landmarks of ground signs and refine them
2) The vehicle’s high-precision location is derived from a traffic camera using a general keypoint detector to achieve sub-pixel accuracy of
image and an HD map. the landmark keypoint location. The landmark keypoint extractor
registration, the raw data or high-level information such as detec- greatly improves the robustness of image registration by elimi-
tion results of one image can be transformed and merged into the nating noisy and irrelevant points. At last, we design an efficient
coordinate system of the other image. There are three challenges in keypoint matching algorithm based on the detected ground signs
high-precision image registration involving traffic camera images. and their landmark keypoints from the two images.
First, these images are taken in dramatically varied conditions (e.g., To summarize, fundamentally different from the current image
viewpoints, scales and view angles), which poses great challenges registration methods in the computer vision literature, our sys-
to high-precision registration. Second, to support autonomous driv- tem offers several key advantages: (i) AutoMatch is robust to envi-
ing, two images need to establish pixel-level correspondences in ronments including different types of intersections, traffic signs,
real-time (e.g., within tens of milliseconds), which requires the roadside trees and buildings around the intersections, since our
image registration method to be computationally efficient. Third, approach only focuses on distinctive structures and filter out unim-
due to the significant dynamics and limited bandwidth between portant information that may affect the accuracy of matching. (ii)
infrastructures and vehicles, the amount of data sharing required AutoMatch is computationally efficient and memory friendly, which
for registration should be as small as possible. is crucial for practical deployment in real-world traffic scenarios.
Although there exist methods for image registration in the com- AutoMatch achieves this by only processing small image patches
puter vision literature [5, 6, 9, 17, 18, 25, 30, 38, 61, 62], they are not and extracting few but semantically rich landmark keypoints for
specifically designed for traffic scenarios and yield unsatisfactory registration. In contrast, existing approaches require processing
performance in latency, robustness, and accuracy, making them the whole image or extracting massive keypoints. (iii) AutoMatch
ill-suited for infrastructure-assisted autonomous driving. Most cur- significantly reduces the communication overhead between infras-
rent image registration techniques [5, 6, 9, 17, 18, 25] first extract a tructure and vehicle for the registration, since it only requires the
large amount of keypoints throughout two images and then match infrastructure to share with the vehicle a small number of landmark
them to register two images. Other methods [30, 38, 61, 62] directly keypoints extracted from static structures independent of traffic
find the correspondences of two images in an end-to-end manner dynamics.
by leveraging deep learning techniques. The former requires the We implemented AutoMatch on a real testbed consisting of a
transmission of hundreds or thousands of infrastructure keypoints self-built autonomous car, a survey drone for mapping, and real
and features for each frame between infrastructure and vehicle, traffic cameras. In addition, we collect two new multi-view traffic
whose excessive communication overhead poses a major challenge image datasets, which correspond to the perception and localization
in meeting the stringent real-time requirement of autonomous driv- of traffic camera-assisted autonomous driving, respectively. The
ing applications. The latter usually requires a large DNN model to first dataset contains 1,136 image pairs from 48 traffic cameras of 19
achieve accurate transformation from in the wild images, which intersections and onboard cameras of vehicles. The second dataset
incurs excessive compute overhead on the vehicle and hence is contains images from 172 traffic cameras of 32 intersections in 21
ill-suited for real-time autonomous driving. Therefore, there still cities and the corresponding high-resolution maps. Experiments
remains a major gap between the vision of traffic camera-assisted show that AutoMatch is able to extend the vehicle’s field of view
autonomous driving and the capabilities of current image registra- by 72.9% on average, with an average image registration error of
tion technologies. 3 pixels, which delivers an 11.65× improvement in registration
To tackle these challenges, we propose AutoMatch - a novel sys- accuracy compared with the state-of-the-art. Besides, AutoMatch
tem that accurately registers image pairs from different views in leverages traffic cameras to provide high-precision localization for
real-time to support traffic camera-assisted autonomous driving autonomous vehicles with an error of less than 20 cm. Moreover,
at intersections. Our key idea is to extract landmark keypoints AutoMatch only requires the traffic camera to share the data with
of salient structures at intersections to facilitate image registra- the vehicle at a rate of 72 Kbps. Lastly, AutoMatch achieves an end-
tion. Specifically, AutoMatch first detects and extracts ground signs, to-end system latency within 88 ms, which enables real-time image
which are the most common semantic objects at intersections and registration for autonomous vehicles.
distinctive structures shared by the images from both vehicle’s The rest of this paper is organized as follows: Section 2 intro-
duces related work. Section 3 presents the background, applications,
and challenges. In Section 4, we describe the design of AutoMatch.
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

We discuss the collection of two datasets and the system imple- usually less accurate compared to general keypoint detectors, and
mentation in Sections 5 and 6, respectively. Section 7 shows the can not achieve pixel-level registration precision. In this work, we
experiment results and Section 8 concludes the paper. address this key issue by integrating the general keypoint detector
to refine the detected landmarks to obtain landmark keypoints with
precise locations.
2 RELATED WORK Cooperative Infrastructure-Vehicle or Vehicle-Vehicle Per-
Image Registration. Image registration aims to find correspon- ception and Localization. To improve the perception performance
dence between two images and has many applications in autonomous of autonomous vehicles, Arnold et al. [3] propose a cooperative 3D
driving such as camera calibration [26], Simultaneous Localization object detection scheme, where several infrastructure sensors are
and Mapping (SLAM) [46], and Structure from Motion (SfM) [27, 55]. used for multi-view simultaneous 3D object detection. Zhang et
A typical registration pipeline consists of three stages, keypoint al. [71] propose an edge-assisted multi-vehicle perception system
detection, description generation, and keypoint matching. Both clas- called EMP, where connected and autonomous vehicles’ (CAVs’) in-
sical [6, 25, 35, 39, 42] and learning-based [53, 69] methods detect dividual point clouds are optimally partitioned and merged to form
points of interest throughout a whole image. However, they are not a complete point cloud with a higher resolution. In [43], cameras
applicable to complex and diverse intersection scenarios since there and LiDARs are leveraged to assist the localization of autonomous
is no guarantee that meaningful keypoints for registration can be vehicles. Fascista et al. [21] propose to localize vehicles using the
extracted at the first step of image registration. The feature descrip- angle of arrival estimation of beacons from several infrastructure
tors are extracted from a local patch centered around each keypoint nodes. Different from these studies where infrastructures or CAVs
to capture higher-level information and generate robust and precise have known accurate pose, i.e., position and orientation, we focus
representations for keypoints. However, they may suffer from ambi- on leveraging existing traffic cameras with unknown poses to assist
guity when there are repetitive contents that are common in traffic autonomous vehicles in both perception and localization through
scenarios. Moreover, these descriptors are usually represented as real-time image registration.
large-sized feature vectors, which incur significant communication
overhead and are ill-suited for traffic camera-assisted autonomous 3 BACKGROUND, APPLICATIONS AND
driving. The final step, keypoint matching, matches two keypoints CHALLENGES
in the two input images that have the most similar descriptors.
In this section, we first review autonomous driving perception
Nearest neighbor [45] and fast approximation nearest neighbor
and localization technologies available today, which motivates our
[44] algorithms are two representative methods, but they perform
approach. We then present the two applications of AutoMatch for as-
poorly when encountering too many outlier keypoints. Tracking-
sisting autonomous driving at intersections. Finally, the challenges
based matching methods are widely adopted in visual SLAM and
addressed in the design of AutoMatch are discussed.
can achieve real-time performance. However, they work well only
for two similar images, such as the neighboring frames of a video.
Recent works use Graph Neural Networks (GNN) [52] and trans-
3.1 Perception/Localization of Autonomous
formers [30] to boost the matching performance for challenging Driving
cases. Nevertheless, the methods based on the above three-stage Like human drivers, autonomous vehicles must know where they
pipeline require a certain similarity of scales and perspectives of are on the road (localization) and which objects are in the sur-
two images, while the images to be matched in the traffic camera- roundings (perception). Perception and localization are essential
assisted autonomous driving usually have significant scale and for autonomous vehicles to make accurate and reliable decisions
viewpoint differences as well as overly repeated contents. for vehicle control. Due to the mission-critical nature, autonomous
Landmark Detection. The goal of landmark detection is to driving imposes stringent requirements on the accuracy and delay
localize a group of pre-defined landmarks on objects with seman- of perception and localization [64].
tically meaningful structures. For example, facial landmark detec- Mainstream autonomous driving platforms typically use a com-
tors [49, 59, 72] predict 5, 20 or 68 fiducial points, outlining the bination of sensors such as cameras, LiDARs, radars, GNSS/IMUs,
face boundaries, eye, nose and mouth. Body keypoint detectors and odometers for high-precision perception and localization [37].
[11, 15, 47, 66] detect 14 or 17 keypoints, indicating shoulders, Specifically, vehicles consume incoming camera images or LiDAR
wrists, etc. Unlike general keypoint detectors that extract keypoints point clouds to detect and track obstacles such as moving vehi-
in an indiscriminate manner, landmark detectors “recognize” the cles and people within. Then the free navigable space is identified
semantic part of the object by exploiting the shape pattern, like to ensure that the vehicle does not collide with moving objects.
symmetry and spatial relationships. Surprisingly, the use of land- However, on-vehicle sensors have a limited field of view, and the
mark detectors for image registration has not been well explored perception will often be obscured by surrounding objects, which
despite the following advantages: (i) Robustness to noise and out- may unavoidably cause traffic accidents. To achieve high-precision
liers caused by similar low-level image appearances, as the shape localization, many commercial vehicles, such as the Google and
and structured information provide constraints to each landmark. Uber cars, use a priori mapping approach [28, 60], which consists
(ii) Unlike general keypoint descriptors where each descriptor is rep- of pre-driving specific roads, collecting detailed 3D point clouds,
resented as a one-dimensional feature vector, landmarks are more and generating high precision maps. Vehicles can store such maps
interpretable and discriminative. However, a critical shortcoming or download them from the cloud. Localization is then performed
of landmark detectors is that the predicted landmark location is by matching the current sensor data with HD maps. However, the
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

camera has a broader field of view and is less prone to occlusion


Perception Information
(Object Bounding Boxes)
than vehicles. Therefore, autonomous vehicles can leverage the
Merge into
Vehicle’s View perception information from the traffic camera to achieve more
comprehensive scene perception.
Traffic Camera Note that when there are multiple traffic cameras at one inter-
Image
Image section, AutoMatch can process the images from all cameras that
Registration may benefit the vehicle one by one and identify the one that is the
Vehicle Image
most useful to the vehicle. This “naive” design is lightweight since
Experiments (Section 7) show that the added computational over-
head and communication overhead are extremely low compared to
registration with one camera.
Boosting Vehicle Perception Centimeter-level localization via image registration between
traffic cameras and HD maps. The second application of Au-
Figure 2: Illustration of leveraging traffic camera to boost vehicle percep-
tion.
toMatch is the image registration between the traffic camera image
and an HD map. Fig. 3(a) shows a traffic camera image and an HD
large size of HD maps, the high latency in transmissions between map. An HD map is a highly accurate map where each pixel in it
the cloud and the vehicle, and the low updating frequency of HD corresponds to a precise world position. HD maps are usually con-
maps pose significant barriers to wide adoption in practice [56]. structed using drones [29, 68] or map data collection cars equipped
with high precision sensors (e.g. LiDARs, digital cameras and RTK
GPS) [33, 65]. HD maps are able to achieve centimeter-level preci-
3.2 Applications of AutoMatch sion [31, 65]. Fig. 3(b) shows the image registration result between
Boosting vehicle perception via image registration between the HD map and the traffic camera image, which establishes a dense
traffic cameras and vehicles. The first application of AutoMatch correspondence between the pixels in the traffic camera image and
is real-time image registration between the traffic camera image and the points in the HD map. Given this correspondence, we could de-
vehicle image. Image registration establishes the transformation rive the 3D world position for each pixel in the traffic camera image,
between the two image coordinate systems of the traffic camera establishing a highly lightweight local map for the traffic camera,
and the vehicle, so that the vehicle can directly utilize the percep- which is about the size of an image (around 1 MB). As a result, 1)
tion information in the traffic camera image. The scene perception we can easily find a vehicle’s world position if the vehicle is in the
information shared from the traffic camera to the vehicle can be traffic camera’s field of view; 2) the vehicle doesn’t have to match its
the entire image (with all the details of the scene) or abstract se- sensor data with HD maps for localization which saves significant
mantic information (such as object bounding boxes). To actually compute overhead. Note that the image registration between the
achieve such benefits, the data transmission volume for registra- traffic camera image and the HD map can be a one-time offline task.
tion needs to be small enough due to the limited communication Once the registration is completed, the local map of the traffic cam-
bandwidth between the traffic camera and the vehicle. Besides, the era is established. The local map is only related to the pose of the
end-to-end traffic camera-vehicle image registration delay needs to traffic camera, and hence remains unchanged as long as the traffic
be within tens of milliseconds to meet the real-time requirements camera is still. To count for possible camera pose changes, the local
of autonomous driving. map can be updated by periodically performing image registration
In practice, infrastructures and vehicles need to independently between the traffic camera image and the HD map. Specifically, the
extract points in their images for registration. Different from other traffic camera detects vehicles in view, derives, and broadcasts the
image registration approaches [5, 17, 19, 35, 57] that would need world positions of these vehicles. Each vehicle obtains not only its
the infrastructure to extract points in real-time, AutoMatch allows own position but also the positions of other vehicles nearby, which
the infrastructure to extract points less frequently. This is because is useful for downstream autonomous driving tasks such as path
AutoMatch extracts static points in the scene backgrounds, which planning and collision avoidance. Vehicle identification is needed
remain unchanged most of the time. Once extracted, the infras- in this application. The infrastructure can use vehicle attributes
tructure then periodically broadcasts the points and the perception such as color and type, or other techniques such as license plate
information (object bounding boxes) extracted on its own coordi- recognition (LPR) [58] or RFID [67] to distinguish different vehicles.
nate. When a vehicle enters the intersection, it first receives the In order to meet the requirement of high-precision localization for
points from infrastructure and then matches them with the points autonomous driving, HD maps and traffic camera images need to
extracted from its own image to calculate the transformation. Then be matched with pixel-level accuracy so that the localization error
the vehicle could merge the bounding boxes from the infrastructure can be suppressed within centimeter-level [34].
into its field of view. Experiments (Section 7.2.2) show that this
process typically takes a data-sharing rate of only 72 Kbps. Fig. 2
shows two typical images from a driving vehicle and a traffic cam- 3.3 Challenges
era. Blue and green boxes show the perceived objects in the views Despite the promising applications, the design of AutoMatch faces
of the traffic camera and the vehicle, respectively. Due to the occlu- several major challenges in practice. First, there often exists sig-
sion, the vehicle cannot see the remaining 13 vehicles (blue boxes) nificant scale and viewpoint gaps between the image pairs in the
while they are visible to the traffic camera. In contrast, the traffic aforementioned two applications. The reason is that the working
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

shown in Fig. 5 as regions of interest (ROIs) and extract landmark


keypoints inside each ROI. We then match the corresponding ROIs
and landmark keypoints in the two input images to complete the
registration. We focus on ground signs because they are: 1) usually
required to present at intersections to show vehicle movements
[63]; 2) sufficiently discriminative to serve as target structures for
matching and less repetitive compared with other structures like
crosswalk lines or lane lines; and 3) static structures and hence
lead to a low compute overhead on infrastructure. This is because
the points extracted from ground signs on the infrastructure side
remain largely unchanged, which can be updated less frequently.
Accurately detecting the landmark keypoints of ground signs
plays an important role in the performance of image registration.
However, this is highly challenging due to a variety of imperfections
Figure 3: A traffic camera image and an HD map generated from aerial in real-world settings: incompleteness caused by the limited field of
images taken by a survey drone. The registration result can be used to
localize vehicles from the traffic camera image. view, occlusion of vehicles or other objects, stains caused by oil or
water blobs, uneven lighting caused by shadows of trees or vehicles,
positions and orientations of the drone that constructs the HD map, or confusion with other objects such as speed bumps or manhole
the traffic camera, and the vehicle are usually different. Drones covers. These imperfections make the keypoint extraction from
generally shoot vertically at a height of around one hundred me- ground signs error-prone, as the yellow points shown in Fig. 5. On
ters from the ground. Traffic cameras are generally installed at a the other hand, humans can exploit prior knowledge of ground signs
distance of about 10 m to shoot obliquely downward. On-vehicle to robustly extract the locations of keypoints. This inspires us to
cameras are usually installed at a height of about 1.5 m above the apply the idea of landmark detectors to address the challenges faced
ground and are almost parallel to the ground. As a result, the re- by general keypoint detection. Landmark detection learns the prior
sultant differences in scale, rotation, and viewpoint between two shape and appearance of structured objects to localize a group of pre-
images will result in poor performance for existing image registra- defined points. There are numerous landmark detectors designed
tion methods [19, 36], which is consistent with our experimental to locate the landmarks on human faces (e.g., eye corners, mouth
results (Section 7.4). Second, images captured in traffic scenes often corners, etc.) or bodies (e.g., shoulders, wrists, etc.).
contain a large number of repeated textures such as crosswalk lines, However, despite the robustness, existing landmark detectors
lane lines, etc., which unavoidably lead to similar keypoint patches can not be directly used in our image registration method for the
and ambiguous keypoint features, resulting in a large number of following two reasons. First, although all faces/bodies have the
false matches [36]. Third, existing image registration methods incur same landmark template, there are different categories of ground
significant compute and communication overhead. They usually signs, and each category has a different landmark template. We
extract a large number of keypoints for every frame and describe thus design a novel unified landmark template applicable to all
them in the form of large-size feature vectors, which would need categories of ground signs. Second, landmark detectors can result
to be transmitted from infrastructure to vehicle. Moreover, existing in unsatisfactory landmark localization accuracy (shown as green
image registration methods typically have high compute overhead, points in Fig. 5). To address this issue, we propose a new module, i.e.,
which cannot be used in autonomous driving scenarios with strin- the landmark keypoint extractor, to integrate the landmark detector
gent real-time constraints such as tens of milliseconds of delay. with the general keypoint detector to benefit from both methods:
robustness from the landmark detector and pixel-level localization
4 DESIGN OF AUTOMATCH accuracy from the general keypoint detector (see the red points in
Fig. 5). One additional benefit of the landmark keypoint extractor is
4.1 Motivation and Overview that the following landmark keypoint matching stage can be highly
Our design objective is to achieve pixel-level image registration in computationally efficient. Since all landmark keypoints inside two
real-time with low communication overhead under challenging traf- ground signs can be easily matched once the ground signs are
fic camera-assisted autonomous driving scenarios. As a result, the matched (see Fig. 6), we only need to match the ground signs in
image registration results of AutoMatch can assist the perception the image, which reduces the search space to a large extent. Such
and localization of autonomous vehicles, which can benefit various efficiency of landmark keypoint matching lies in the fact that the
downstream tasks for autonomous driving such as accident alarm- descriptors of the landmark keypoints are implicitly encoded into
ing, route planning, etc. Moreover, in practice, most traffic cameras the class of the ground sign and the index from the template point
are installed around intersections [16, 51]. Our key idea is to utilize set.
landmark keypoints of domain-specific structures to match image The system architecture of AutoMatch is shown in Fig. 4. We
pairs. Focusing on distinctive structures instead of the whole image first detect the regions of interest (ROI) in both images (Section 4.2).
helps to mitigate the adverse effects of large perspective variations Then these ROIs are fed into a novel landmark keypoint extrac-
on image registration and eliminate the ambiguity caused by the tor to extract landmark keypoints (Section 4.3), which contains a
repeated contents. It also leads to high compute efficiency because landmark detection branch, a general keypoint detection branch,
less data is being processed. We select ground signs such as those
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

Figure 4: Framework of our image registration approach for traffic camera-assisted autonomous driving.

Landmark keypoint extractor

Landmark detector Landmark-


Coarse Landmarks guided
For each ROI
NMS

Accurate landmark
… General keypoint keypoints
ROIs detector Keypoint heatmap

Figure 7: Design of the landmark keypoint extractor.

widely used in embedded sensing applications, to jointly detect the


Figure 5: Points detected by a general keypoint detector (yellow) [17], bounding boxes of all ground signs in each image and classify each
the landmark detector (green) and AutoMatch (red). Note the challenging
conditions caused by occlusion, incompleteness, uneven lighting, and
sign into one of the seven categories. After detection, we crop the
stains. ground signs according to the detected bounding boxes, and each
ground sign will be processed independently in the subsequent
steps as shown in Fig. 7.
Our training dataset consists of two parts: 1/6 images of two
self-collected datasets (Section 5) and the raw images of the “City”
category in the autonomous driving dataset KITTI [23]. We anno-
tate the bounding boxes and classes of the ground signs in images
and finetune the YOLOv4 model in this dataset. Note that ground
signs in different countries and regions may be slightly different,
Figure 6: Illustrations of the landmark keypoint correspondences between and the ground sign dataset in our method can be updated accord-
two matched ground signs. ingly, which will not affect the generality and performance of our
as well as a newly designed Landmark-guided Non-Maximum Sup- method.
pression (Landmark-guided NMS) module to fuse the two detection
results to obtain accurate landmark keypoints (see Fig. 7). Lastly, 4.3 Landmark Keypoint Extractor
the landmark keypoint matching module (Section 4.4) based on a
newly proposed Group RANSAC algorithm matches the ROIs and The landmark keypoint extractor is designed to extract the land-
landmark keypoints extracted from previous steps. mark keypoints in each ground sign patch in the presence of the
challenges illustrated in Fig. 5. The design of this module is mo-
tivated by the fact that general keypoint detection methods [6,
4.2 Ground Sign Detector 25, 35, 40, 54, 70] usually consider low-level local features, which
Given input images, we first locate the region of interest and ignore will inevitably be affected by imperfections of ground signs and
the unrelated regions to improve the robustness and computational hence lead to noisy and unpredictable keypoints (see the yellow
efficiency. We focus on ground signs because they are commonly points in Fig. 5). In contrast, we propose to extract landmarks fol-
present in complex traffic sections like intersections. Moreover, lowing a pre-defined landmark template. However, unlike facial
most traffic cameras are installed at busy intersections [16, 51]. We landmark detection, which only has a single template, every class
note that our approach can be easily extended to detect other traffic of ground signs has a unique shape. Therefore, We design a uni-
ground markers. We carefully categorize ground signs into seven fied landmark template for all ground signs (shown in Fig. 8),
classes: going straight, turning left, turning right, going straight or which allows AutoMatch to reuse the existing landmark detection
left, going straight or right, turning around, and turning around pipeline. Moreover, since landmark detection can only roughly lo-
or left. We employ YOLOv4 [7], a real-time object detection model calize each landmark but cannot achieve sub-pixel accuracy, we
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

Figure 8: Illustration of the unified landmark template (a) and some ex-
amples of ground signs that can be modeled using this template (b).
Figure 9: Illustration of the Landmark-guided NMS method for combining
propose to refine the result of landmark detection using a general the landmark detector and the general keypoint detector.
keypoint detector. To this end, a Landmark-guided NMS algo-
rithm is proposed to integrate both detectors to extract the final 4.3.2 Landmark-guided NMS. Despite robustness, the main lim-
landmark keypoints, where the landmarks serve as guidance for itation of the landmark detector is that the detected landmarks
picking the keypoints to achieve more accurate landmark keypoint do not fall precisely on the corners of the ground sign (see green
localization. Such an approach enables both accurate and highly points in Fig. 5). To address this issue, we use the general keypoint
robust landmark keypoint extraction despite various interferences detector to boost positioning accuracy. We adopt the widely-used
on ground sign appearances. We now discuss each component of general keypoint detector SuperPoint [17], a fast and lightweight
the landmark keypoint extractor in detail. model that computes accurate keypoint locations, which generates
a keypoint response heatmap of the same size as the input. Each
4.3.1 Landmark Detector. We design a new landmark detector pixel of the heatmap corresponds to the probability of the pixel
based on a real-time state-of-the-art facial landmark detector PFLD that is a keypoint. The training process is similar to the one in
[24]. We zero-pad the ground sign patches before feeding them [17]. The difference is that our synthetic dataset only consists of
into the landmark detector to meet the aspect ratio requirement. To structures with corners such as quadrilaterals, triangles, lines, and
be able to generate landmarks with different templates, we design stars, which strengthens the detection of corner-like keypoints. The
a unified landmark template as shown in Fig. 8. All categories of synthetic dataset is rendered on-the-fly, and no example is seen by
ground signs are stacked together with similar components merged, the network twice.
which results in a template with 4 components and a total number We now have the landmarks from the landmark detector and the
of 22 landmarks. Each landmark has its own ID number, which keypoint heatmap from the general keypoint detector. Landmarks
implicitly encodes rich semantic information. The neural network capture the global structure and provide guidance for the positions
will predict the pixel locations of all 22 landmarks. The output of final landmark keypoints. By exploiting this property of land-
landmarks of each ground sign class constitute a subset of these marks, we look for the maximum response of the keypoint heatmap
components, e.g., the turning left sign contains component 2, with around each landmark, to fine-tune the position of landmarks for
a total of 7 landmarks. To achieve this, we define a binary mask 𝑀 the final landmark keypoints. As a result, the final landmark key-
with a length of 22 for each category of ground sign to mask out points not only inherit the landmarks’ expression of the global
unused landmarks. The mask is predefined and determined by the structure but also precisely localize the corner points. Specifically,
class of the ground sign. We then define the training loss as follows: as shown in Fig. 9, we first generate a Gaussian distribution map
centered at each landmark and multiply this Gaussian map with the
|𝑀 | 𝑁
keypoint heatmap pixel-wisely. The pixel with the maximum value
1 ∑︁ ∑︁ 𝑛 𝑛 𝑛 2 in the map is selected as the final landmark keypoint (𝑢, ˆ 𝑣).
ˆ This
L := 𝑀 p − p̂𝑚 2 (1)
|𝑀 |𝑁 𝑚=1 𝑛=1 𝑚 𝑚 operation filters out the keypoints far away from the landmark and
allows the final landmark keypoints to have both rich semantics
where |𝑀 | = 22 is the total number of landmarks and the subscript and accurate locations. Formally, this can be expressed as:
𝑚 indicates the 𝑚-th point. 𝑁 denotes the batch size. p and p̂ are the ˆ 𝑣)
(𝑢, ˆ = argmax𝐺 (𝑢, 𝑣) · 𝐻 (𝑢, 𝑣), (2)
ground truth and predicted locations of each landmark, respectively. (𝑢,𝑣)
This masked loss means that only the landmarks that fall into the where
current ground sign’s category will contribute to the training loss.  
(𝑢 − 𝑢𝑜 ) 2 (𝑣 − 𝑣𝑜 ) 2

The same mask operation is performed in the inference stage, where 𝐺 (𝑢, 𝑣) = exp − + (3)
2𝜎 2 2𝜎 2
only landmarks belonging to the category of the current ground
sign are picked, and other landmarks are discarded. is a Gaussian distribution centered on a landmark (𝑢𝑜 , 𝑣𝑜 ) and
To train the landmark detector, we crop the ground sign bound- 𝐻 (𝑢, 𝑣) represents the keypoint heatmap from the general keypoint
ing boxes from the training dataset mentioned in Section 4.2. Then detector.
we resize and zero-pad them into patches of size 224 × 224 and then
label the landmarks on them. During training, we also add a small 4.4 Group RANSAC
random perturbation of homography transformations to each patch After the previous modules of our pipeline, we now have the ground
to augment the training examples. sign bounding boxes A and B in the two input images, as well as the
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

landmark keypoints belonging to each bounding box. To calculate


the final homography H, we need to find all the inlier correspon-
dence between the landmark keypoints of the two images. We
develop a fast landmark keypoint matching algorithm based on the
traditional Random Sample Consensus (RANSAC) [22] algorithm.
Unlike the classical RANSAC that randomly samples matched point
pairs, we sample pairs of bounding boxes that have the same class.
This is motivated by the fact that two matched signs must belong
(a)The self-built car equipped (b)The DJI drone used
to the same class and share the same landmark template (see Fig. 6). with sensors and computing unit. for constructing HD maps.
We name our method Group RANSAC, where the landmark key- Figure 10: Devices used in the system implementation.
points in a template are matched as a group. Specifically, we first
randomly samples two bounding box pairs (A1, B1 ) and (A2, B2 ) Table 1: Summary of two datasets.
from A, and B, respectively, so that the classes of each pair are the # inter- # traffic # vehicle # HD # image
Datasets
same, i.e., 𝐶𝑙𝑎𝑠𝑠 (A𝑖 ) = 𝐶𝑙𝑎𝑠𝑠 (B𝑖 ) for 𝑖 = 1, 2. We can now easily sections cameras images maps pairs
obtain the landmark keypoint correspondences from the bound- Traffic camera-
ing box pairs since the landmark keypoints of a bounding box are 19 48 4544 - 4544
vehicle dataset
arranged in a fixed order as shown in Fig. 8. We then estimate Traffic camera-
32 172 - 32 172
the homography matrix H using all the corresponding landmark HD map dataset
keypoint pairs obtained from the bounding box pairs. We check
the correctness of the estimated H by counting the total number
of inlier landmark keypoint pairs. Two landmark keypoints are
defined as inlier point pairs if 1) they belong to bounding boxes of
the same class, and 2) the reprojection error using H is smaller than
a threshold. When the number of inlier landmark keypoint pairs is
larger than a threshold, we finalize the algorithm by re-estimating
H using all of the inlier landmark keypoint pairs. Otherwise, the Figure 11: Two examples of the collected traffic camera images with dif-
current bounding box pairs are false matches, and we repeat all ferent road types, road widths, and lighting conditions.
of the above steps to continue searching for correct bounding box
pairs. image pairs from 48 traffic cameras at 19 intersections. For the traf-
fic camera-HD map dataset, we use DJI drones to capture the aerial
images of intersections at a speed of 9 m/s, and then generate HD
maps of these intersections with centimeter-level accuracy using
5 TESTBED AND DATASETS the drone image processing software ODM [4]. We also collect
We built a real-world testbed consisting of existing traffic cameras images captured by the traffic cameras at these intersections. In
at intersections, DJI drones, and a self-built autonomous car (see total, we collected traffic camera-HD map image pairs from 172
Fig. 10). DJI drones are equipped with Ultra HD Lenses (Fig. 10(b)) traffic cameras of 32 intersections in 21 cities. In addition, we label
for generating HD maps of intersections. Our self-built car (Fig. 10(a)) the corresponding points for each image pair in both two datasets
is equipped with a small computing unit with an Intel Core i7 CPU manually to provide the ground-truth homography.
and multiple sensors, including two Pointgrey CM3-U3 cameras The collected images in the two datasets cover diverse and com-
and three LiDARs (a Robosense RS32, a Robosense RS16, and a plex traffic scenarios with different road types (i.e., crossroads,
Livox AVIA). In this work, we use one camera of the car to cap- T-junctions, highway entrances and exits), road widths (3 lanes to
ture images. As there is no dataset consisting of multi-view image 12 lanes), road conditions (new or old, under construction or not),
pairs at intersections, i.e., traffic camera-HD map image pairs and and lighting conditions (day and dusk). Some examples of traffic
traffic camera-vehicle image pairs, we collect two new multi-view camera images are shown in Fig. 11. The private information such
intersection image datasets for traffic camera-assisted autonomous as street names, image acquisition timestamps, and license plates
driving. One dataset is the traffic camera-vehicle dataset, which are removed by an independent third-party organization. This data
is collected for the evaluation of traffic camera-assisted vehicle collection is approved by the governing department of the city,
perception. The other dataset is the traffic camera-HD map dataset, and the study is approved by the ethics committee of the authors’
which is collected for the evaluation of traffic camera-assisted vehi- institutes.
cle localization. We summarize our two datasets in Table 1. Below
we describe the data acquisition process of each dataset in detail.
For the traffic camera-vehicle dataset, we manipulate our self-
built autonomous car at a speed of 8 m/s to collect images of the
vehicle’s view around a city’s intersections. Vehicle images are col-
lected by the camera mounted on the car, which is about 1.5 m above
the ground. Meanwhile, we collect the images of traffic cameras at
these intersections. In total, we collected 4544 traffic camera-vehicle
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

6 SYSTEM IMPLEMENTATION AND 7.1 Evaluation Metrics


EXPERIMENT SETUP 7.1.1 Perception range gain and Field of View (FoV) gain. In or-
This section introduces the system implementations of the two ap- der to measure how much autonomous vehicles can benefit from
plications, i.e., traffic camera-assisted perception and traffic camera- AutoMatch in perception, we define two application-level metrics,
assisted localization. In the first application, i.e., the traffic camera- the perception range gain and the FoV gain. We project the vehicle
assisted perception, we set up infrastructures and vehicles for per- image to the traffic camera image using the ground truth homog-
ception fusion. We install an NVIDIA Jetson TX2 as the computing raphy, and then calculate the two metrics in the traffic camera
unit on 48 traffic cameras to collect and store the camera images image coordinate. We quantify the two metrics in pixels instead of
at 25 fps. We implement AutoMatch on a laptop and use it as the physical distances because 2D images cannot represent distances
computing platform at the vehicle end. The laptop is equipped in the real world. Perception range gain is the increased ratio in
with an Intel i7-9750H CPU and an NVIDIA RTX2060 Super GPU, distance before and after the image registration. It is defined as
whose computing capability lags far behind that of the mainstream (𝐿𝑡𝑟𝑎𝑓 /𝐿𝑝𝑟𝑜 𝑗 − 1) × 100%, where 𝐿𝑡𝑟𝑎𝑓 and 𝐿𝑝𝑟𝑜 𝑗 are the lengths (in
computing platforms for autonomous driving, such as NVIDIA the vehicle’s heading direction) of the traffic camera image and the
DRIVE AGX Pegasus [10, 48]1 . We collect the images from the projected vehicle image in pixels, respectively. FoV gain is the in-
vehicle camera at 30 fps and store them on the laptop for offline creased ratio in area, which is defined as (𝑁𝑡𝑟𝑎𝑓 /𝑁𝑝𝑟𝑜 𝑗 − 1) × 100%,
processing. Moreover, we use an 802.11ac WiFi router for wireless where 𝑁𝑡𝑟𝑎𝑓 and 𝑁𝑝𝑟𝑜 𝑗 are the total pixel numbers of the traffic
communication between the Jetson TX2 and the laptop to simu- camera image and projected vehicle image, respectively.
late the communication between the traffic camera and the vehicle. 7.1.2 RRE, RTE, and localization error. To evaluate the performance
The data transfer takes place through UDP broadcasting, which of AutoMatch in assisting the localization of autonomous vehicles,
transmits the infrastructure key points and perception information we first measure how accurate the traffic camera can localize ve-
(object bounding boxes). The data transmission frequency is set to hicles in world coordinate, which is equivalent to measuring the
2 Hz which is consistent with the frequency of decision-making on accuracy of the dense correspondence between pixels in the traffic
autonomous vehicles [32]. We simply discard extra frames that are camera image and those in the HD map. Specifically, we measure
not used for communication. The second application, i.e., traffic localization error, the distance between the localized world position
camera-assisted localization, requires image registration between of the vehicle in the constructed traffic camera local map and the
traffic camera images and HD maps. This application is an offline ground truth position of the vehicle in the HD map. This metric
task and can be implemented by running AutoMatch with traffic reflects the accuracy of the local map. Another metric to measure
camera images and HD maps inputs. the performance of traffic camera-assisted vehicle localization is
We train the ground sign detector and the landmark keypoint the accuracy of traffic camera pose estimation. Only if the pose
detector with PyTorch [50] using the two datasets (Section 5) on a estimation of the traffic camera itself is accurate can it accurately
server equipped with Intel Xeon Silver 4210 CPU and one Nvidia locate the vehicles in its field of view. The traffic camera pose can
RTX2080Ti GPU. The training of the two detectors takes around 20 be derived based on the homography between the image and the
hours in total. The implementing details can be found in Section HD map. We adopt two metrics - the relative rotation error (RRE)
4. For inference, we export the trained models in ONNX format and the relative translational error (RTE) used in [13, 14, 20] to
[2] using TensorRT [1] on Jetson TX2. For a brand-new region, the evaluate the errors of the estimated traffic camera poses. RRE is
ground sign detector and the landmark keypoint detector need to be defined as:
retrained or fine-tuned. Therefore the training overhead is roughly 𝐸𝑅 = |𝜃 | + |𝜙 | + |𝜓 | 
the same as the overhead we mentioned earlier. Considering the (4)
(𝜃, 𝜙,𝜓 ) = 𝐹 𝑅𝑇−1 𝑅𝐸
training is a one-time offline task, the overhead is reasonable in
this setting. where 𝑅𝑇 and 𝑅𝐸 are the rotation matrices decomposed from the
ground-truth homography and the estimated homography, respec-
7 EVALUATION tively. 𝐹 (·) transforms a rotation matrix to three Euler angles (𝜃, 𝜙,𝜓 ).
RRE is the sum of the absolute differences in three Euler angles. RTE
In this section, we first define evaluation metrics in Section 7.1.
is defined as: 𝐸𝑇 = ∥𝑡𝑇 − 𝑡𝐸 ∥ 2 , where 𝑡𝑇 and 𝑡𝐸 are the translation
Then, we present an end-to-end evaluation of AutoMatch in Sec-
vectors decomposed from the ground-truth homography and the
tion 7.2. Next, we show application-level results in Section 7.3,
estimated homography, respectively.
which show that AutoMatch can not only significantly extend the
vehicle’s perception range but also provide vehicles with high- 7.1.3 Reprojection error and MMA. To compare the image regis-
precision localization. In addition, we compare the performance tration performance with other algorithms, we follow the same
of AutoMatch with other methods on two real-world multi-view methodology in [19, 41], which computes the reprojection error
intersection image datasets in Section 7.4. Finally, we conduct an and the mean matching accuracy (MMA). Reprojection error is the
ablation study to validate the effectiveness of our method in Sec- Euclidean distance between the observed image point 𝑝 and the
tion 7.5. image point 𝑝 ′ reprojected from the other image. It reflects the
accuracy of the estimated homography transformation. MMA is
the average percentage of correct keypoint matches per image pair.
1 NVIDIA DRIVE AGX Pegasus can achieve 320 TOPS (trillion operations per second) A keypoint match is considered correct if its reprojection error
of computing capability, while that of NVIDIA GeForce RTX 2060 is only 14 TOPS. estimated using the ground truth homography is below a given
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

Table 2: The communication overhead of different methods for boosting


vehicle perception.
Data size for Overall shared Bandwidth
Methods
registration data size needed
SIFT 73.8 KB 76.9 KB 1.2 Mbps
SuperGlue 121.2 KB 124.3 KB 2.0 Mbps
D2-Net 31.7 MB 31.7 MB 507.2 Mbps
AutoMatch 1.4 KB 4.5 KB 72 Kbps
AutoMatch is able to register the images from traffic cameras and
Figure 12: The delay and reprojection error in an end-to-end evaluation
experiment where our vehicle passes by roadside traffic cameras. vehicles in real time. We can also see that AutoMatch achieves pixel-
level image registration between traffic camera images and vehicle
images. The reprojection error can be a bit larger when the vehicle
is far away from the traffic camera or when it drives away from
the traffic camera. We also evaluate the performance of AutoMatch
when the vehicle goes at different speeds. Fig. 13 shows that when
the vehicle’s speed is 16 m/s, the reprojection error is similar to
that at 8 m/s. Fig. 14 shows the sensing distance gain vs. distance
between the vehicle and the traffic camera. The result shows that
when the vehicle has not yet entered the camera’s field of view, the
perception range can be increased by around 65%. This is because
Figure 13: The reprojection errors in end-to-end evaluation experiments the perception improvement is more significant when the overlap of
where the vehicle goes at different speeds.
the two fields of view is small. The perception range gain becomes
stable at around 45% when the vehicle enters the traffic camera’s
field of view.

7.2.2 Communication overhead. In the following evaluation, we


compare AutoMatch with four image registration algorithms. These
baselines have the same settings as AutoMatch: Input two images
and output the homography between the two images. (i) SIFT [35],
a traditional and the most widely used image registration algo-
rithm; (ii) SuperGlue [52], an algorithm based on Graph Neural
Figure 14: Sensing distance gain in the process of the vehicle gradually Network(GNN) proposed recently and also one of the state-of-the-
approaching the traffic camera. art (SOTA) image registration algorithms; (iii) COTR [30], the latest
registration algorithm based on transformer; (iv) D2-Net [19], a
threshold. This metric measures: 1) the repeatability of the key-
typical CNN-based algorithm. The implementation of SIFT is from
points: the same points in the two images need to be detected. 2) the
OpenCV [8]. For the other three baselines, we used the codes pub-
distinguishability of the keypoints detected: two different keypoints
lished by the authors and adjusted the parameters to yield the best
that look similar should not be confused as one. 3) the quality of
performance in our datasets.
the matching algorithm.
We evaluate the communication overhead of AutoMatch for traf-
fic camera-vehicle image registration by comparing it with these
7.2 End-to-End System Evaluation baselines. Communication is a simple channel from the traffic cam-
7.2.1 Delay and error. To evaluate the end-to-end system perfor- era to the vehicle. The total data to be shared consists of two
mance of AutoMatch, we implement our system as described in parts: one used for registration, i.e., the extracted keypoints and the
Section 6. We take a typical process where a vehicle passes by an keypoint descriptors, and the other is the perception information,
intersection as an example. The laptop continuously receives data which is implemented as the object bounding boxes. We do not
from the Jetson TX2 and registers the traffic camera image with its compare with baseline COTR since it registers two images in an
own image. We record the results of the image registrations and end-to-end manner, which requires the infrastructure to directly
then calculate the reprojection errors, the perception range gains, share raw images. We evaluate the average data volume used for
and the end-to-end delay. Note that the end-to-end delay includes registration and the overall average data volume shared between
both communication delay and processing time. The three metrics traffic cameras and vehicles. Besides, we also evaluate the com-
demonstrate the registration accuracy, the perception improvement, munication bandwidth needed for each method. The frequency of
and the real-time performance of AutoMatch. Since delay is a major data broadcasting from the traffic camera is set to 2 Hz as discussed
concern in autonomous driving, we also report the maximum delay in Section 5. Table 2 shows the evaluation results on the traffic
among multiple experiments. camera-vehicle dataset. It can be seen that AutoMatch reduces the
Fig. 12 shows that the maximum end-to-end delay is 82 ms, which data volume for registration and the overall shared data volume by
is faster than the processing speed (typically 100 ms per image) of about 53× and 17× compared with SIFT baselines. AutoMatch only
mainstream image-based visual tasks [12]. The results show that needs to transmit 4.5 KB data per frame to boost vehicle perception,
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

Table 3: Traffic camera pose estimation results of baselines and AutoMatch


on the traffic camera-HD map dataset.

Methods RRE RTE


SIFT 102.71° 236.75 cm
SuperGlue 31.25° 42.11 cm
COTR 79.43° 57.43 cm
D2-Net 68.22° 52.39 cm
Figure 15: Histograms of perception range gain and FoV gain before and
after registration.
AutoMatch 2.41° 9.57 cm

among which only 31% are used for registration, compared to that
of almost 100% for other baselines. The three baselines demonstrate
high communication overhead since they extract massive keypoints
and heavy descriptors for each keypoint. Besides, the bandwidth
requirement of AutoMatch is as low as 72 Kbps, which can be easily
supported by the current LTE network.

7.3 Application-Level Results


In this section, We first evaluate how much application-level percep- Figure 16: A color-coded localization error map.
tion and localization benefits AutoMatch can bring to autonomous
vehicles using real traffic datasets. This evaluation supports our
claims that: (i) AutoMatch implements and extends the vehicle’s
perception to areas that cannot be seen without the traffic camera-
vehicle image registration; (ii) AutoMatch accurately constructs
the local map of the traffic camera by matching the traffic camera
image with an HD map, which enables the traffic camera to localize (a) (b)
vehicles in its view. Then we discuss the robustness of AutoMatch
to different lighting conditions and traffic conditions.
7.3.1 Boosting vehicle perception. For each traffic camera-vehicle
image pair, we calculate the perception range gain and FoV gain of
the vehicle after image registration. As the perception range gain
and FoV gain vary under different situations (e.g., different scenes,
(c) (d)
the relative position between the vehicle and the traffic camera), Figure 17: Illustration of the robustness of AutoMatch. (a) and (b) show the
we instead plot the distributions of these two metrics in Fig. 15. result of AutoMatch in a dimly lit evening with a ground sign completely
It can be seen that AutoMatch significantly improves autonomous obscured by a white vehicle. (c) and (d) show the result in bright daytime
with two ground signs partially obscured by two black vehicles.
vehicles’ perception range by an average of 47.6%, and increases
the vehicle’s FoV by an average of 72.9%. In the best case, the FoV error map, the localization error is relatively large, because each
of the vehicle can be more than doubled. pixel at that region typically occupies more than 15 cm in world
space.
7.3.2 High-precision vehicle localization. We present localization
evaluation by comparing it with the four image registration al- 7.3.3 Performance under varied lighting and traffic conditions. Next,
gorithms introduced in Section 7.2.2. Table 3 shows the average we use two typical results in the traffic camera-vehicle datasets
RRE and RTE scores of AutoMatch and four baselines on the traffic to discuss the robustness of our system under different lighting
camera-HD map dataset. The average RREs of all four baselines conditions and traffic conditions. Heavy traffic may cause differ-
are more than 30◦ , while AutoMatch only generates 2.41◦ RRE. The ent degrees of occlusions of ground signs. Fig. 17 show two traffic
average RTEs of baselines are larger than 42 cm while AutoMatch camera-vehicle image pairs captured in the dimly lit evening and
is less than 10 cm. The large RREs and RTEs from baselines intro- bright daytime respectively. Results show that AutoMatch can work
duce non-trivial challenges to localizing autonomous vehicles. In well in different lighting conditions. This is because we apply data
contrast, AutoMatch outperforms the four baselines by 7.7% and augmentation techniques such as brightness level changes, motion
22.73% ∼ 4.04% in average RREs and RTEs, respectively. blur, and homography warps in the training process to improve
We then calculate the localization error of AutoMatch for lo- AutoMatch’s robustness to lighting and viewpoint changes. For
calizing autonomous vehicles. To visualize the results intuitively, the case where the ground sign is fully or partially occluded, we
we visualize a localization error map in Fig. 16, which shows the also show two examples in Fig. 17(a,c). In Fig. 17(a), a ground sign
localization error when a vehicle appears in different positions in is completely occluded by a white vehicle. There are also ground
the camera’s field of view. In other words, the error map shows the signs that are not visible from the traffic camera and the vehicle
accuracy of the local map. We can see that the localization error is camera at the same time. Results show that our method success-
smaller than 20 cm in 70% of the region. Note that at the top of the fully registers the images. This is achieved by the fact that the
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

Table 4: Results of different registration algorithms on the two real traffic


datasets.

Reproj.
Datasets Methods Run time MMA
error
SIFT 218.256 px 7.440 s 17.58%
SuperGlue 74.579 px 0.143 s 47.13%
Traffic camera-
COTR 91.587 px 174.730 s 40.77%
vehicle dataset
D2-Net 77.003 px 1.543 s 29.23%
AutoMatch 2.986 px 0.043 s 96.01%
SIFT 143.476 px 0.629 s 12.39%
SuperGlue 49.106 px 0.125 s 49.74%
Traffic camera-
COTR 68.402 px 67.713 s 35.22%
HD map dataset
D2-Net 61.284 px 0.921 s 21.16%
AutoMatch 4.215 px 0.088 s 92.83%

key structures allows us to detect landmark keypoints that have


high overlap rates in both images, and the explicit semantics of the
landmark keypoints allows us to match them easily. For run time,
AutoMatch is 1.42 to 4063 times faster than other baselines. The run
time of COTR is significantly longer than the other three baselines
due to the use of transformer architecture. We can also see that
AutoMatch’s performance on the traffic camera-HD map dataset
Figure 18: Qualitative results of the four baselines and AutoMatch in a real
is slightly worse than that on the traffic camera-vehicle dataset.
traffic scene. AutoMatch detects fewer keypoints (yellow) while estimates This is reasonable because HD maps have much higher resolutions
all correct matches (green) without false matches (red). than vehicle images, i.e., 7900 × 7900 vs. 1920 × 1080, and cover
a broader range. High resolution naturally leads to numerically
Group RANSAC algorithm maximizes the matches between the sets
larger reprojection errors as the reprojection error is evaluated on
of ground signs in the two images without requiring them to be
the pixel. Broader range results in more matching ground sign pairs
identical. In Fig. 17(c), two ground signs are partially obscured by
between HD maps and traffic camera images, which further leads to
vehicles. However, our method still successfully estimates the lo-
a longer search time for the landmark keypoint matching module,
cations of occluded landmark keypoints thanks to the landmark
and finally results in a longer run time.
keypoint extractor, which encodes the structure prior of the ground
Fig. 19 shows some qualitative image registration results on
sign. In conclusion, the proposed system is robust to a certain level
the traffic camera-HD map dataset, which shows that AutoMatch
of occlusions or incompleteness.
achieves more precise image registration results compared to other
7.4 Performance Comparison baselines (see the anastomosis of crossroads). Note that AutoMatch
not only focuses on the ground sign structures nearby but also
We present extensive performance evaluations by comparing with
manages to match ground signs at distance, which further improves
the same four baselines in Section 7.3.2 on the two datasets. We
the registration accuracy.
visualize a typical example of the registration result in a real traffic
scene in Fig. 18. We can see that all baselines produce hundreds or
even thousands of keypoints but can only correctly match a few 7.5 Ablation study
of them. The reason is that the baselines tend to extract keypoints We validate our landmark keypoint extractor with an ablation study.
on the roadside or distant buildings and trees, which are indistin- The ablation aims to prove the effectiveness of our design of inte-
guishable from each other or even not co-visible in both images. grating the landmark detector with the general keypoint detector.
This not only makes the registration inefficient but also produces We compare our full landmark keypoint extractor (Full) with abla-
less accurate results due to lots of false matches. On the other hand, tions that with only the landmark detector (LD only) or only the
AutoMatch only focuses on landmark keypoints of ground signs general keypoint detector (GKD only) to generate the final key-
and matches them accurately thanks to our landmark keypoint points. Other modules of our pipeline are kept unchanged. Note
extractor. that when we extract keypoints using only the general keypoint
Table 4 shows the numeric results of AutoMatch and other base- detector, the keypoints are unstructured and thus can not be used
lines on the two real traffic datasets, where we report the reprojec- in the proposed Group RANSAC. Therefore, we adopt SuperGlue and
tion error, MMA, and run time. For reprojection error, AutoMatch Nearest Neighbor search (NN ) [45] as the keypoint matching meth-
is at most a quarter of the most accurate baseline. For MMA, all ods when we experiment the GKD only. We report reprojection
four baselines are less than 50%. In contrast, AutoMatch achieves error, MMA, keypoint detection run time, and keypoint matching
more than 90% correct matches in both datasets. This is because the run time on the traffic camera-vehicle dataset at Table. 5.
baselines tend to detect many irrelevant keypoints, thus lowering We can see that while being slightly slower than others in terms
the distinguishability of the keypoints and increasing the difficulty of keypoint detection run time, our Full model achieves the small-
of keypoint matching. While in AutoMatch, focusing on common est reprojection error and highest MMA. And the proposed Group
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

Figure 19: Registration results between an HD map and a traffic camera image in the real traffic scene.

Table 5: Quantitative results of ablation study.


Methods Reproj. Detection Matching
MMA
Detector Matching error time (ms) time (ms)
LD only Group RANSAC 6.53 px 92.60% 37.91 0.23
GKD only SuperGlue 17.65 px 72.01% 36.24 24.61
GKD only NN 53.48 px 57.89% 36.51 0.54
Full Group RANSAC 2.99 px 96.01% 43.24 0.21

8 CONCLUSION AND FUTURE WORK


In conclusion, we present AutoMatch, the first system that matches
traffic camera-vehicle image pairs or traffic camera-HD map im-
age pairs at pixel-level accuracy with low communication/compute
overhead in real-time, which is a key technology for leveraging
traffic camera for assisting the perception and localization of au-
tonomous driving. Extensive evaluations on two self-collected datasets
show that AutoMatch outperforms SOTA baselines in robustness,
accuracy, and efficiency. In the future, we will extend our approach
to integrate the perceptions of multiple cameras which are typically
installed in different directions around a road intersection. We will
also study how to leverage such results to assist the perception and
localization of autonomous vehicles.

REFERENCES
[1] n.d.. Nvidia TENSORRT. https://developer.nvidia.com/tensorrt.
Figure 20: Qualitative results of ablation study. Note that the LD only + [2] n.d.. Open Neural Network Exchange. https://onnx.ai/.
Group RANSAC tends to detect inaccurate landmark locations as high- [3] Eduardo Arnold, Mehrdad Dianati, Robert de Temple, and Saber Fallah. 2020.
lighted in blue. Cooperative perception for 3D object detection in driving scenarios using in-
frastructure sensors. IEEE Transactions on Intelligent Transportation Systems
(2020).
RANSAC achieves at least two orders of magnitude faster than the [4] OpenDroneMap Authors. 2020. ODM - A command line toolkit to generate
SOTA matching algorithm SuperGlue. We also visualize the matches maps, point clouds, 3D models and DEMs from drone, balloon or kite images.
in Fig. 20. We can see that without the guidance of landmarks, GKD https://github.com/OpenDroneMap/ODM.
[5] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. 2016.
only + SuperGlue and GKD only + NN produce many noisy and indis- Learning local feature descriptors with triplets and shallow convolutional neural
criminative keypoints and further lead to numerous false matches, networks.. In Bmvc, Vol. 1. 3.
which are consistent with the quantitative results in Table 5. On [6] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust
features. In European conference on computer vision. Springer, 404–417.
the other hand, if we only use the landmark detector (LD only + [7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4:
Group RANSAC), although the landmarks are correctly matched, as Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
highlighted in blue in Fig. 20, they suffer from inaccurate location, (2020).
[8] Gary Bradski and Adrian Kaehler. 2008. Learning OpenCV: Computer vision with
which causes performance degradation. By contrast, our Full model the OpenCV library. " O’Reilly Media, Inc.".
predicts accurate structured keypoint locations and matches all of [9] Matthew Brown, Gang Hua, and Simon Winder. 2010. Discriminative learning
them correctly by combining the benefits of the general keypoint of local image descriptors. IEEE transactions on pattern analysis and machine
intelligence 33, 1 (2010), 43–57.
detector and the landmark detector. Besides, it is also worth notic- [10] Andrew Burnes. 2019. Introducing GeForce RTX SUPER Graphics Cards: Best In
ing that the performance of GKD only + SuperGlue is significantly Class Performance, Plus Ray Tracing. https://www.nvidia.com/en-us/geforce/
news/geforce-rtx-20-series-super-gpus/.
better than the SuperGlue in Table 4. They share the same pipeline
[11] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-
with the only difference being that the GKD only + SuperGlue works person 2d pose estimation using part affinity fields. In Proceedings of the IEEE
on bounding boxes instead of the whole image, which validates our conference on computer vision and pattern recognition. 7291–7299.
[12] Long Chen, Shaobo Lin, Xiankai Lu, Dongpu Cao, Hangbin Wu, Chi Guo, Chun
core idea of focusing on key structures instead of the whole image. Liu, and Fei-Yue Wang. 2021. Deep neural network based vehicle and pedestrian
detection for autonomous driving: A survey. IEEE Transactions on Intelligent
Transportation Systems 22, 6 (2021), 3234–3246.
SenSys ’22, November 6–9, 2022, Boston, MA, USA Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing and Daqi Xie.

[13] Christopher Choy, Wei Dong, and Vladlen Koltun. 2020. Deep global registra- [38] Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu,
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern and Juho Kannala. 2019. Dgc-net: Dense geometric correspondence network. In
recognition. 2514–2523. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE,
[14] Christopher Choy, Jaesik Park, and Vladlen Koltun. 2019. Fully Convolutional 1034–1042.
Geometric Features. In ICCV. [39] Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale & affine invariant interest
[15] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang point detectors. International journal of computer vision 60, 1 (2004), 63–86.
Wang. 2017. Multi-context attention for human pose estimation. In Proceedings [40] Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale & affine invariant interest
of the IEEE conference on computer vision and pattern recognition. 1831–1840. point detectors. International journal of computer vision 60, 1 (2004), 63–86.
[16] BRITISH COLUMBIA. 2019. Where intersection safety cameras are lo- [41] Krystian Mikolajczyk and Cordelia Schmid. 2005. A performance evaluation of
cated. https://www2.gov.bc.ca/gov/content/transportation/driving-and-cycling/ local descriptors. IEEE transactions on pattern analysis and machine intelligence
roadsafetybc/intersection-safety-cameras/where-the-cameras-are. 27, 10 (2005), 1615–1630.
[17] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint: [42] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri
Self-supervised interest point detection and description. In Proceedings of the Matas, Frederik Schaffalitzky, Timor Kadir, and L Van Gool. 2005. A comparison
IEEE conference on computer vision and pattern recognition workshops. 224–236. of affine region detectors. International journal of computer vision 65, 1 (2005),
[18] Jingming Dong and Stefano Soatto. 2015. Domain-size pooling in local descriptors: 43–72.
DSP-SIFT. In Proceedings of the IEEE conference on computer vision and pattern [43] Yanghui Mo, Peilin Zhang, Zhijun Chen, and Bin Ran. 2021. A method of vehicle-
recognition. 5097–5106. infrastructure cooperative perception based vehicle state information fusion
[19] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko using improved kalman filter. Multimedia Tools and Applications (2021), 1–18.
Torii, and Torsten Sattler. 2019. D2-net: A trainable cnn for joint description and [44] Marius Muja and David G Lowe. 2009. Fast approximate nearest neighbors with
detection of local features. In Proceedings of the IEEE/cvf conference on computer automatic algorithm configuration. VISAPP (1) 2, 331-340 (2009), 2.
vision and pattern recognition. 8092–8101. [45] Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms
[20] G. Elbaz, T. Avraham, and A. Fischer. 2017. 3D Point Cloud Registration for for high dimensional data. IEEE transactions on pattern analysis and machine
Localization Using a Deep Neural Network Auto-Encoder. In 2017 IEEE Conference intelligence 36, 11 (2014), 2227–2240.
on Computer Vision and Pattern Recognition (CVPR). 2472–2481. https://doi.org/ [46] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORB-
10.1109/CVPR.2017.265 SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on
[21] Alessio Fascista, Giovanni Ciccarese, Angelo Coluccia, and Giuseppe Ricci. 2017. robotics 31, 5 (2015), 1147–1163.
Angle of arrival-based cooperative positioning for smart vehicles. IEEE Transac- [47] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks
tions on Intelligent Transportation Systems 19, 9 (2017), 2880–2892. for human pose estimation. In European conference on computer vision. Springer,
[22] Martin A. Fischler and Robert C. Bolles. 1981. Random Sample Consensus: A 483–499.
Paradigm for Model Fitting with Applications to Image Analysis and Automated [48] NVIDIA. 2022. HARDWARE FOR SELF-DRIVING CARS. https://www.nvidia.
Cartography. Commun. ACM 24, 6 (June 1981), 381–395. https://doi.org/10.1145/ com/en-us/self-driving-cars/drive-platform/hardware/.
358669.358692 [49] Giuseppe Palestra, Adriana Pettinicchio, Marco Del Coco, Pierluigi Carcagnì,
[23] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for au- Marco Leo, and Cosimo Distante. 2015. Improved performance in facial expression
tonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on recognition using 32 geometric features. In International Conference on Image
Computer Vision and Pattern Recognition. IEEE, 3354–3361. Analysis and Processing. Springer, 518–528.
[24] Xiaojie Guo, Siyuan Li, Jinke Yu, Jiawan Zhang, Jiayi Ma, Lin Ma, Wei Liu, and [50] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Haibin Ling. 2019. PFLD: A practical facial landmark detector. arXiv preprint Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.
arXiv:1902.10859 (2019). Pytorch: An imperative style, high-performance deep learning library. Advances
[25] Chris Harris, Mike Stephens, et al. 1988. A combined corner and edge detector. in neural information processing systems 32 (2019).
In Alvey vision conference. Citeseer, 10–5244. [51] radenso. 2021. What’s the difference between traffic cameras, red light cameras,
[26] Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer and speed cameras? https://radenso.com/blogs/radar-university/what-s-the-
vision. Cambridge university press. difference-between-traffic-cameras-red-light-cameras-and-speed-cameras.
[27] Jared Heinly, Johannes L Schonberger, Enrique Dunn, and Jan-Michael Frahm. [52] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
2015. Reconstructing the world* in six days*(as captured by the yahoo 100 novich. 2020. Superglue: Learning feature matching with graph neural networks.
million image dataset). In Proceedings of the IEEE conference on computer vision In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni-
and pattern recognition. 3287–3295. tion. 4938–4947.
[28] INSIDER. 2016. Here’s why self-driving cars can’t handle bridges. <http://www. [53] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys.
businessinsider.com/autonomous-cars-bridges-2016-8. 2017. Quad-networks: unsupervised learning to rank for interest point detection.
[29] Mahdi Javanmardi, Ehsan Javanmardi, Yanlei Gu, and Shunsuke Kamijo. 2017. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Towards high-definition 3D urban mapping: Road feature-based registration of 1822–1830.
mobile mapping systems and aerial imagery. Remote Sensing 9, 10 (2017), 975. [54] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys.
[30] Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, and Kwang Moo 2017. Quad-networks: unsupervised learning to rank for interest point detection.
Yi. 2021. Cotr: Correspondence transformer for matching across images. In In Proceedings of the IEEE conference on computer vision and pattern recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision. 6207– 1822–1830.
6217. [55] Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion
[31] Jialin Jiao. 2018. Machine learning assisted high-definition map creation. In 2018 revisited. In Proceedings of the IEEE conference on computer vision and pattern
IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), recognition. 4104–4113.
Vol. 1. IEEE, 367–373. [56] Heiko G Seif and Xiaolong Hu. 2016. Autonomous driving in the iCity—HD maps
[32] Felix Kam and Henrik Mellin. 2019. Different frequencies of maneuver replanning as a key challenge of the automotive industry. Engineering 2, 2 (2016), 159–162.
on autonomous vehicles. [57] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and
[33] I Karls and M Mueck. 2018. Networking vehicles to everything. Evolving auto- Francesc Moreno-Noguer. 2015. Discriminative learning of deep convolutional
motive solutions. feature point descriptors. In Proceedings of the IEEE international conference on
[34] S. Kuutti, S. Fallah, K. Katsaros, M. Dianati, F. Mccullough, and A. Mouzakitis. 2018. computer vision. 118–126.
A Survey of the State-of-the-Art Localization Techniques and Their Potentials [58] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for Autonomous Vehicle Applications. IEEE Internet of Things Journal 5, 2 (2018), for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
829–846. https://doi.org/10.1109/JIOT.2018.2812300 [59] Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network
[35] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. cascade for facial point detection. In Proceedings of the IEEE conference on computer
International journal of computer vision 60, 2 (2004), 91–110. vision and pattern recognition. 3476–3483.
[36] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. 2021. Image [60] The N.Y. Times. 2017. Building a road map for the self-driving car. <https://www.
matching from handcrafted to deep features: A survey. International Journal of nytimes.com/2017/03/02/automobiles/wheels/selfdriving-cars-gps-maps.html.
Computer Vision 129, 1 (2021), 23–79. [61] Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. 2020. GOCor:
[37] Juliette Marais, Cyril Meurie, Dhouha Attia, Yassine Ruichek, and Amaury Flanc- Bringing globally optimized correspondence volumes into your neural network.
quart. 2014. Toward accurate localization in guided transport: Combining GNSS Advances in Neural Information Processing Systems 33 (2020), 14278–14290.
data and imaging information. Transportation Research Part C: Emerging Tech- [62] Prune Truong, Martin Danelljan, and Radu Timofte. 2020. GLU-Net: Global-local
nologies 43 (2014), 188–197. universal network for dense flow and correspondences. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 6258–6268.
AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles SenSys ’22, November 6–9, 2022, Boston, MA, USA

[63] Federal Highway Administration U.S. Department of Transportation. 2002. United Workshop on High-Precision Maps and Intelligent Applications for Autonomous
States Pavement Markings. https://mutcd.fhwa.dot.gov/services/publications/ Vehicles. 1–8.
fhwaop02090/index.htm. [69] Linguang Zhang and Szymon Rusinkiewicz. 2018. Learning to detect features
[64] Jessica Van Brummelen, Marie O’Brien, Dominique Gruyer, and Homayoun in texture images. In Proceedings of the IEEE conference on computer vision and
Najjaran. 2018. Autonomous vehicle perception: The technology of today and pattern recognition. 6325–6333.
tomorrow. Transportation research part C: emerging technologies 89 (2018), 384– [70] Linguang Zhang and Szymon Rusinkiewicz. 2018. Learning to detect features
406. in texture images. In Proceedings of the IEEE conference on computer vision and
[65] Harsha Vardhan. 2017. HD Maps: New age maps powering autonomous vehicles. pattern recognition. 6325–6333.
Geospatial world 22 (2017). [71] Xumiao Zhang, Anlan Zhang, Jiachen Sun, Xiao Zhu, Y Ethan Guo, Feng Qian, and
[66] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convo- Z Morley Mao. 2021. Emp: Edge-assisted multi-vehicle perception. In Proceedings
lutional pose machines. In Proceedings of the IEEE conference on Computer Vision of the 27th Annual International Conference on Mobile Computing and Networking.
and Pattern Recognition. 4724–4732. 545–558.
[67] Ron Weinstein. 2005. RFID: a technical overview and its application to the [72] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. 2013. Extensive
enterprise. IT professional 7, 3 (2005), 27–33. facial landmark localization with coarse-to-fine convolutional network cascade.
[68] Andi Zang, Runsheng Xu, Zichen Li, and David Doria. 2017. Lane boundary In Proceedings of the IEEE international conference on computer vision workshops.
extraction from satellite imagery. In Proceedings of the 1st ACM SIGSPATIAL 386–391.

You might also like