0% found this document useful (0 votes)
10 views

Image Patch-Matching with Graph-Based Learning

This paper presents a novel approach to image patch-matching in street scenes using graph-based learning to incorporate spatial relationships among landmark patches. The proposed method constructs a spatial graph to enhance feature representation and employs a joint feature and metric learning model, achieving state-of-the-art results on several datasets. The study emphasizes the importance of static landmarks in autonomous driving applications and introduces two new datasets for evaluating landmark patch-matching performance.

Uploaded by

Joseph Franklin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Image Patch-Matching with Graph-Based Learning

This paper presents a novel approach to image patch-matching in street scenes using graph-based learning to incorporate spatial relationships among landmark patches. The proposed method constructs a spatial graph to enhance feature representation and employs a joint feature and metric learning model, achieving state-of-the-art results on several datasets. The study emphasizes the importance of static landmarks in autonomous driving applications and introduces two new datasets for evaluating landmark patch-matching performance.

Uploaded by

Joseph Franklin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

Image Patch-Matching with Graph-Based Learning


in Street Scenes
Rui She, Qiyu Kang, Sijie Wang, Wee Peng Tay, Senior Member, IEEE,
Yong Liang Guan, Senior Member, IEEE, Diego Navarro Navarro, and Andreas Hartmannsgruber

Abstract—Matching landmark patches from a real-time image Semantic objects’ spatial


captured by an on-vehicle camera with landmark patches in an information/relative
locations
image database plays an important role in various computer
arXiv:2311.04617v1 [cs.CV] 8 Nov 2023

perception tasks for autonomous driving. Current methods focus Extract roadside semantic
object patches
on local matching for regions of interest and do not take into
account spatial neighborhood relationships among the image
patches, which typically correspond to objects in the environment.
In this paper, we construct a spatial graph with the graph
Matched
vertices corresponding to patches and edges capturing the spatial pairs of
Landmark patches Image patch
neighborhood information. We propose a joint feature and landmarks
matching model
metric learning model with graph-based learning. We provide a with graph-
based learning
theoretical basis for the graph-based loss by showing that the Applications:
information distance between the distributions conditioned on • Vehicle self-
localization using
matched and unmatched pairs is maximized under our framework. landmark maps
We evaluate our model using several street-scene datasets and Landmark patches • Place recognition
demonstrate that our approach achieves state-of-the-art matching • Depth estimation
Extract roadside semantic • Odometry
results. object patches
estimation
Semantic objects’ spatial • 3D reconstruction
Index Terms—Image patch-matching, graph neural network, information/relative • SLAM
Kullback-Leibler divergence, information distance maximization, locations …
visual place recognition
Fig. 1. Landmark patch-matching using spatial graphs in street scenes and its
potential applications.
I. I NTRODUCTION
S a critical and fundamental technique in visual percep- are used. The similarity of a feature pair is commonly
A tion, image matching is widely used in many applications, computed using different predefined metrics, like the L2
such as image retrieval [1] and vehicle re-identification [2]. distance and cosine distance. Moreover, a circular pattern
Conceptually, the target of a matching task is to solve with an adjustable radius is exploited in BRISK [18] and
the similarity correspondence problem for contents from an FREAK [19], which provides more efficient neighborhood
image pair [3]–[5]. In landmark-based street-scene applications, information for computing relevant pixel statistics. However,
semantic objects such as traffic signs, traffic lights and road-side these handcrafted features are not robust to viewpoint changes,
poles [6]–[8] often serve as landmarks. The correspondence varying illuminations and transformations. Consequently, the
between the landmark patches captured at different locations matching performance for methods based on such handcrafted
may be further utilized as cornerstones to solve other problems, local features is often unstable [20].
including loop-closure detection in simultaneous localization With the rapid development of artificial intelligence tech-
and mapping (SLAM) [7], [9], place recognition [10], [11], niques, deep learning methods, such as convolutional neural
multi-view camera relocalization [12], landmark-LiDAR ve- networks (CNNs), are widely used in image matching [21]–
hicle relocalization [6], [13], and landmark-based odometry [23]. In this case, high-dimensional features are exploited
estimation [8]. to replace handcrafted features in image representations. In
In traditional image patch-matching methods, handcrafted the joint feature and metric learning method [24]–[26], the
local features using pixel statistics or gradient information, representations and similarity metrics are combined in an end-
such as SIFT [14], SURF [15], HOG [16] and ORB [17], to-end learning framework, in which high-level features of
the images are extracted, and their similarities are learned
This work is supported by the Singapore Ministry of Education Academic
Research Fund Tier 2 grant MOE-T2EP20220-0002 and the RIE2020 Industry simultaneously. The feature descriptor learning method [20],
Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, [27]–[30] focuses on high-level feature learning and tries to
as well as cash and in-kind contribution from the industry partner(s). keep matched samples close and unmatched samples far from
The first two authors, R. She and Q. Kang, contributed equally to this work.
R. She, Q. Kang, S. Wang, W. P. Tay and Y. L. Guan are with the Continental- each other in the corresponding feature space. The similarity
NTU Corporate Lab, Nanyang Technological University, Singapore 639798 is computed using a predefined similarity metric. In these
(emails: {rui.she@;wang1679@e.;qiyu.kang@; wptay@;eylguan@}ntu.edu.sg). approaches, matching is based on learning feature represen-
D. N. Navarro and A. Hartmannsgruber are with Continental Automotive
Singapore Pte. Ltd., 80 Boon Keng Road, 339780, Singapore. (emails: tations of each image separately and does not exploit the
{diego.navarro.navarro; andreas.hartmannsgruber}@continental.com). relationships between objects in the images. Recent keypoint-
2

based learning methods such as D2-Net [31], ASLFeat [32] Fig. 1 for an illustration. More specifically, we focus on
and SuperGlue [33] perform the point-level correspondence static roadside objects including traffic lights, signs, lamp
based on the detected keypoints and their descriptors for the posts, and even windows on a building facade. This is
input images, which can also be used for the image matching because in most landmark-based applications, other transient
task [34]. static objects like parked cars, are inappropriate landmarks
or do not have sufficient distinctive features. Due to complex
environmental conditions like dynamic element occlusion, e.g.,
due to pedestrians, vehicles, or the scene viewpoint changing
(especially when turning at sharp corners or traversing a
stretch in opposite directions), the landmark patches may have
dramatic differences in appearance. We refer the readers to
the supplementary material for more details on the landmark
patch-matching datasets’ preparation. For a concrete illustration,
some examples of matched or unmatched landmark patches
Fig. 2. Landmark patches matching in two full-sized images sampled from the are presented in Fig. 2.
Oxford Radar RobotCar dataset. The matched landmark patches are labeled Our contributions are summarized as follows:
with the same colored bounding boxes, while the white bounding box indicates
• We propose a landmark patch-matching method with
that the landmark patch in one image has no matched pair in the other image.
Green lines indicate the constructed graph edges in our model. graph-based learning for vehicles in street scenes, which
extends the feature representation approach used in
Unlike other image patch-matching tasks, rich spatial infor- traditional image patch-matching tasks and incorporates
mation for landmark patches is often available. For example, spatial relationship information.
lamp posts along a road are usually spaced at equal intervals, • We analyze the fundamental principle and properties of the
and their relative locations with respect to (w.r.t.) each other proposed graph-based loss function from an information-
in the environment provide additional information for the theoretic perspective.
matching task. In special street scenes like the downtown • We introduce two landmark patch-matching datasets,
or central business district (CBD), landmark patch-matching which contain challenging street-view landmark patches
has advantages over conventional pixel-/point-level matching captured in an autonomous driving environment.
due to the presence of dynamic objects, such as vehicles and • We empirically demonstrate that our method achieves state-
pedestrians. These dynamic objects captured by the vehicular of-the-art performance on the landmark patch-matching
cameras may have more matched pixels across different frames task when compared to various other benchmarks.
than static landmarks. However, matching these objects is
The rest of this paper is organized as follows. In Section II,
useless or even harmful for tasks such as place recognition. To
related works are discussed. Our model and framework are
mitigate this issue, in this work, we perform the patch-matching
introduced in Section III, where we also provide a theoretical
task based on static landmarks such as traffic lights, traffic
analysis of our graph-based loss. We present experimental
signs, poles, and windows.
results in Section IV and conclude the paper in Section V. The
Inspired by graph-level representation learning [35], [36],
proofs for all lemmas or propositions proposed in this paper
we propose to construct a graph for the neighborhood of an
are provided in the appendices.
image patch and use graph-level representations to enrich the
landmark patch embedding. We identify each landmark patch
as a vertex of a graph and find the K-nearest neighbors based II. R ELATED W ORKS
on estimated spatial information. In the literature, there exist Since deep learning-based methods play dominant roles in
various spatial information estimation techniques like structure- the image-matching problem, we only discuss deep learning-
from-motion (SfM) [37], monocular or stereo depth estimation based works here. Deep learning-based methods include feature
[38] and optical attenuation masks [39]. In this paper, for the descriptor learning, joint feature and metric learning, as well
sake of illustration, we choose an off-the-shelf monocular depth as keypoint-based correspondence learning.
estimation method from [38] to estimate the landmark spatial Feature descriptor learning. High-level features of an
relations. However, any other spatial estimation or augmented image are first extracted using a neural network like a CNN
ranging sensors like LiDAR or depth camera can also be utilized so that matched samples are close while unmatched samples
in our framework. We form a clique whose vertex embeddings are distant under a similarity metric, which is chosen to
are learnable via a graph neural network (GNN) [40], [41]. This be a feature distance function. In many models [22], [23],
graph is utilized in our proposed patch-matching framework for pairwise or triplet loss is used to train the neural networks.
object information characterization. The final matching score To improve performance, in [44], a regularization is designed
is an average of the graph and vertex embedding similarity. by maximizing the spread of local feature descriptors over
We also introduce two landmark patch-matching datasets the descriptor space, from which a better embedding for
derived from the street-scene KITTI dataset [42] and the Oxford image-level features is obtained. To ensure many samples
Radar RobotCar dataset [43]. Our paper focuses on matching are accessible to the descriptor network within a few epochs,
image patches of specific static roadside objects from two L2-Net [20] uses a progressive sampling strategy. Furthermore,
full-sized images taken by cameras onboard vehicles. See HardNet [27] is designed to fully utilize the hard negative
3

samples by making the closest positive sample far away from distances. On the other hand, SuperGlue [33] is designed
the closest negative sample in a batch. The reference [28] using attention GNNs and the Sinkhorn algorithm for keypoint-
overcomes the hard sample learning issue by use of exponential based feature matching. LoFTR [50] achieves accurate semi-
Siamese and triplet losses, which naturally pay more attention to dense matches with Transformers including self and cross-
hard samples and less attention to easy ones. SOSNet is studied attention layers. Generally speaking, all the above keypoint-
in [45] to learn better local descriptors, where the second- based correspondence learning methods can be used to perform
order similarity (SOS) is introduced into the loss function the image matching task with further operations on the keypoint
as a regularization. Moreover, [29] designs two second-order matching scores [33].
components, i.e., the second-order spatial information and the To improve image matching performance, spatial information
second-order descriptor space similarity, to achieve feature map is used in [33], [50], [51] through spatial verification, graph
re-weighting and global descriptors learning, respectively. The learning, and cross attention. In spatial verification, spatial
paper [46] proposes topology consistent descriptors (TCDesc) information is usually used for the transformation calibration
based on neighborhood information of descriptors, which can w.r.t. the key points or objects, as well as a correspondence
be combined with other methods via the triplet loss. auxiliary for direction or location w.r.t. the objects of interest
Joint feature and metric learning. In joint feature and [51]. This can introduce global information to improve local
metric learning, the similarity metric is not predefined and is correspondence. In particular, transformation optimization
instead set as a trainable network together with the feature methods like RANSAC [52], fast spatial measure (FSM) [53],
extraction network. In this case, the matching task is regarded hough pyramid matching (HPM) [54] and pairwise geometric
as a binary classification task by resorting to the similarity matching (PGM) [55], can filter out weak correspondences for
metric network with a classification loss function. As a classical keypoints or local features obtained by key feature detection
method, MatchNet proposed by [24] extracts high-level features and descriptors such as SIFT [14], SURF [15], and ORB
by using deep CNNs and measures the feature similarity using [17]. The region-based or object-based verification methods
fully connected (FC) layers. To compare the different network such as Objects in Scene to Objects in Scene (OS2OS)
architectures for the matching task, several networks, including [51] and block-based image matching [56], make use of the
SiameseNet, Pseudo-SiameseNet and 2-channel network, are relative positions of local patches to refine the whole image
investigated in [21], [47]. The 2-channel network merges matching. Different from the above approaches, our method
the two images into a 2-channel image to achieve faster uses distance-based spatial information for the neighborhood
convergence. The SiameseNet and Pseudo-SiameseNet both graph construction, rather than for transformation correction
use two branches based on the same structure to extract high- or weak correspondence filtering.
dimensional features, with and without the shared weights Graph learning methods such as SuperGlue [33], GLMNet
respectively. Using the normalized cross-correlation (NCC) [57], and joint graph learning and matching network (GLAM)
as a metric, [25] proposes NCC-Net, which utilizes robust [58], are exploited to represent local features based on the
matching layers to measure the similarity of feature pairs. To neighborhood graphs for keypoints. The graphs are constructed
tackle cross-spectral image matching, AFD-Net is proposed based on the detected keypoints or the corresponding features,
by [26] to aggregate multi-level feature differences, which can and GNNs are used to learn graph representations. These
strengthen the discrimination of the network. methods achieve more robust and stable representations for the
Keypoint-based correspondence learning. In keypoint- corresponding features based on spatial information.
based correspondence learning, the main procedure is to
Different from the above methods, our approach focuses on
construct neural networks to perform keypoint detection and
the neighborhood information based on landmark distances,
description and to measure or learn the keypoints’ similarity
which is used for patch-level, rather than point-level, represen-
for matching inference. For instance, LIFT [48] is designed
tation and not used to filter weak or invalid correspondences.
based on a united deep network architecture where keypoints
Moreover, we also adopt GNNs to represent the patch-level
are detected in the first network, the orientation for cropped
neighborhood graphs, which is demonstrated to be beneficial
regions is estimated in the second network, and the feature
for the landmark patch-matching task.
description is performed in the third network. Here, the
Euclidean distance is used to measure the similarity of features.
The SuperPoint approach [49] introduces a self-supervised
domain adaptation framework named Homographic Adaptation III. L ANDMARK PATCH - MATCHING WITH G RAPH -BASED
into interest point detection and description. The D2-Net [31] L EARNING
makes use of a single CNN to perform dense feature description
and detection simultaneously, where the detection, instead In this section, we first introduce our graph-based learning
of being based on low-level image structures, is postponed framework to find matched landmark patch pairs that are
to the high-level structures, which are also used for image extracted from two images taken from on-vehicle cameras.
descriptions. Based on the D2-Net backbone architecture, The images may be taken from different perspectives and our
ASLFeat [32] is equipped with three lightweight effective framework can also identify those patches that are unmatched.
modifications, which have better local shape estimation and Fig. 2 shows examples of matched and unmatched landmark
more accurate keypoint localization. The above methods all patch pairs. We then discuss the theoretical basis for our graph-
measure the point-level correspondence based on Euclidean based learning approach.
4

Concatenate
Spatial {𝑓𝑓 (𝑥𝑥 ′ )}𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥 �∥�
𝑥𝑥
information 
𝑔𝑔(𝒢𝒢 ) 𝜓𝜓(𝒢𝒢 𝑥𝑥 )


Resnet
𝑥𝑥 Neighborhood 𝒢𝒢 𝑥𝑥  
{𝑥𝑥 ′ }𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥 Max pooling
graph 𝑥𝑥1 𝑥𝑥3
GNN
Image patch network
𝑥𝑥2  {𝜌𝜌(𝑥𝑥 ′ )}𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥
as input �∥�
Information of edges 𝜌𝜌(𝑥𝑥) 𝑓𝑓(𝑥𝑥)

𝜑𝜑(𝑥𝑥)

Semantic/instance segmentation
or object detection Loss
Discriminator Discriminator
𝑑𝑑 𝑑𝑑

𝜑𝜑(𝑦𝑦)

Spatial {𝑓𝑓 (𝑦𝑦 ′ )}𝑦𝑦′ ∈𝒢𝒢 𝑦𝑦 �∥�


information  𝜌𝜌(𝑦𝑦) 𝑓𝑓(𝑦𝑦)


Resnet
𝑦𝑦 𝒢𝒢 𝑦𝑦 
Neighborhood {𝑦𝑦 ′ }𝑦𝑦′ ∈𝒢𝒢 𝑦𝑦 
graph 𝑦𝑦1 𝑦𝑦3
GNN
Image patch
𝑦𝑦2 
network {𝜌𝜌(𝑦𝑦 ′ )}𝑦𝑦′ ∈𝒢𝒢 𝑦𝑦 Max pooling
as input
Information of edges �∥�
𝑔𝑔(𝒢𝒢 𝑦𝑦 ) Concatenate 𝜓𝜓(𝒢𝒢 𝑦𝑦 )

Legend: Convolution layer Max pooling Average pooling Bilinear layer Sigmoid function GNN block

Fig. 3. VGIDM: landmark patch-matching with the graph-based learning. The Resnet f shown in the framework is a shared network serving as the feature
descriptor function f to extract high-dimensional features from patches. Likewise, the discriminator d is also shared to make a decision for the vertex-to-graph
correspondence. The model takes as input a pair of image patches that correspond to street scene landmarks.

A. Framework Overview Mask R-CNN [65], or object detection methods, e.g., Faster
Similar to other patch-matching datasets like the multi- R-CNN [61]. These two kinds of methods can extract objects of
view stereo (MVS) dataset [59] and the DTU dataset [60], interest such as traffic lights and traffic signs from image frames.
in our work, the landmark patches are extracted from the full- Two main modules, Resnet [66] and GNN, are respectively
sized images and the matching ground truths are established used for image feature extraction and neighborhood graph
using 3D points. More specifically, the landmark patches are embedding, where the GNN can be the graph attention
extracted using well-known object detection techniques like network (GAT) [40], graph convolutional network (GCN)
Faster R-CNN [61]. To distinguish the full-sized images from [67], GraphSAGE [68] or any other GNN architecture. Given
the landmark patches, we use the term frame to denote the the vertex and graph embedding features from our model,
full-sized image from which the patches are extracted. We refer we maximize the empirical information distance between the
the readers to Section IV-A for more details on the preparation cases where patches are matched and unmatched. We call our
of the landmark patch-matching datasets. image patch-matching approach Vertex-Graph-learning-and-
We assume that the spatial information (i.e., approximate Information-Distance-Maximization (VGIDM). The details are
relative distances between landmark objects) of landmark given as follows.
patches is available. The spatial information can be obtained
from range estimation methods like the monocular depth B. Model Details
estimation networks [62]–[64] in both the training and the Our objective is to determine if two landmark patches from
testing phases. To construct a graph, we let the landmark different frames are matched with each other. In VGIDM, the
patches of a frame be vertices of the graph. For each patch or feature extraction module f first learns embeddings for the input
vertex x, we find the K nearest neighbors in terms of spatial landmark patch as well as the patches in its neighborhood graph.
locations as indicated by the observed spatial information. An The model then makes use of a learnable graph embedding
example of the constructed graph is shown in Fig. 2. For the module g to represent the neighborhood graph-level and vertex-
vertex x, we form a complete graph or clique with its K nearest level feature readout features. Finally, it uses a decision-making
neighbors found. Let G x denote this neighborhood graph. module to compute the matching classification.
Our image patch-matching framework is illustrated in Fig. 3. Feature extraction for patches. We use the Resnet f to
In this framework, the inputs for the model are image patches extract high-dimensional features for each landmark patch x.
obtained by semantic or instance segmentation methods, e.g., The Resnet output is denoted by f (x) ∈ Rn . Recall that for a
5

patch x, we form a neighborhood graph G x . For the graph G x , comparison between the features. Inputting (φ(x), ψ(G y )) to
applying f to each node in G x , we have {f (x′ )}x′ ∈G x . the discriminator d, we have
Embedding representation for the neighborhood graph
and its vertices. The graph G x is input to a GNN network d(φ(x), ψ(G y ))
g to obtain a graph-level embedding representation g(G x ). g(G y )
  
x
Specifically, the vertex features (f (x′ ))x′ ∈G x ∈ R|G |×n are = σ  ρ(x)⊤ f (x)⊤ × M ×  ρ(y) 
 
(5)
updated via the GNN, which consists of several layers of f (y)
neighborhood aggregation and node update [40], [41], followed 
by some activation functions and a final pooling layer. The = σ ρ(x)⊤ M12 ρ(y) + f (x)⊤ M21 g(G y )
x
vertex embeddings (ρ(x′ ))x′ ∈G x ∈ R|G |×n are obtained from 
the last graph convolutional/attentional layer of the GNN, + f (x)⊤ M22 ρ(y) + f (x)⊤ M23 f (y) . (6)
while the graph-level embedding representation g(G x ) ∈ Rn
is obtained as the output of the last pooling layer.
The first term ρ(x)⊤ M12 ρ(y) in (6) is used to compare the
Compared to f (x) ∈ Rn which extracts features directly vertex embeddings of x and y obtained from the GNN. This
from the patch x, ρ(x) ∈ Rn learns a feature embedding with emphasizes the domain part of the embedding. The second
additional information from its neighborhood, while g(G x ) ∈ term f (x)⊤ M21 g(G y ) and third term f (x)⊤ M22 ρ(y) are used
Rn learns an embedding for the surrounding environment itself. for the comparison of the vertex x and the neighborhood graph
Correspondence comparison. Suppose that x and y are of y. This helps to constrain GNN learning. The last term
landmark patches from two different frames, respectively. If x f (x)⊤ M23 f (y) is used to compare the Resnet features for the
and y are patches for the same real-world object, we say that two vertices, which updates the Resnet training. The same
they are matched and denote this event as x ↔ y. Otherwise, procedure is performed analogously for φ(y) = ρ(y)∥f (y) and
they are unmatched and denoted as x ↮ y. For any patch pair ψ(G x ) = g(G x )∥φ(x).
(x, y), we denote the matching ground truth label as 1{x↔y} , The learnable discriminator d(φ(x), ψ(G y )) from (6) utilizes
where 1{·} is the indicator function. In order to compare the the ensemble vertex embedding φ(x) and the neighborhood
correspondence between the patch pair (x, y), we design a graph embedding ψ(G y ). The vertex-level embedding ρ(x) and
decision-making mechanism based on the patch features. For graph-level embedding g(G x ) contain information from the
the two patches x and y, we respectively obtain f (x) and vertex feature f (x) (output of Resnet) due to the incorporation
f (y) as the features from the Resnet, ρ(x) and ρ(y) as the of neighborhood information from the GNN. In the case
vertex-level embedding features, and g(G x ) and g(G y ) as the of a large number of frames, the neighborhood graphs can
graph-level embedding features from the GNN network. be quite different as they typically consist of vertices from
Let the ensemble vertex embedding for a patch x be different frames. As a result, the embeddings ρ(x), g(G x ) and
ρ(y), g(G y ) can have different features to some degree even
if x ↔ y. Therefore, it may be appropriate to use the original
φ(x) = ρ(x)∥f (x) (1)
vertex feature f (x) to constrain the graph learning for the
vertex-level embedding ρ(y) and the graph-level embedding
and the neighborhood graph embedding for G x be g(G y ). The comparisons between f (x) and ρ(y) or g(G y )
can emphasize the principal component for the learned graph
x x x
ψ(G ) = g(G )∥φ(x) = g(G )∥ρ(x)∥f (x), (2) features. When vertices x and y are matched, ρ(y) and
g(G y ) essentially contain the information of f (x). Therefore,
comparing f (x) with ρ(y) and g(G y ) can introduce more
where ∥ is the concatenation operation. information with neighborhood characteristics for the matching
The ensemble vertex feature for x and the graph embedding process.
for G y are input to a discriminator d consisting of a bilinear Loss function and matching score. Let M be a training
layer of the form: set consisting of patch pairs (x, y). Define the graph-based
learning objective function as LempID given in (8), which
d(a, b) = σ(a⊤ × M × b), (3) depends on the discriminator d in (6). We show that LempID
is the empirical version of an information distance between
n×m
the distributions conditioned by matched and unmatched pairs
where M ∈ R is a trainable matrix and σ(·) denotes the in Proposition 1. We set our overall loss as
sigmoid function. In particular, the matrix M is designed as
min {−LempID }, (9)
  φ,ψ,d
0 M12 0
M= , (4)
M21 M22 M23 to maximize the information distance.
In the testing phase, the final matching score is given by
where M12 , M21 , M22 and M23 serve as the matrix blocks
with learnable parameters and 0 denotes the zero matrix. d(φ(x), ψ(G y )) + d(φ(y), ψ(G x ))
Smatch (x, y) = , (10)
The specifically designed block matrix (4) is to restrict the 2
6


1 X 1 
LempID = log[d(φ(x), ψ(G y ))] + log[d(φ(y), ψ(G x ))]
1{x↔y}
|M| 2
(x,y)∈M
1  
+ 1{x↮y} log[1 − d(φ(x), ψ(G y ))] + log[1 − d(φ(y), ψ(G x ))] (7)
2
1 1 X n o
= 1{x↔y} log[d(φ(x), ψ(G y ))] + 1{x↮y} log[1 − d(φ(x), ψ(G y ))]
2 |M|
(x,y)∈M
| {z }
LempID−1
1 1 X n o
+ 1{y↔x} log[d(φ(y), ψ(G x ))] + 1{y↮x} log[1 − d(φ(y), ψ(G x ))] (8)
2 |M|
(x,y)∈M
| {z }
LempID−2

and the prediction function for whether there is a match is In minimizing the loss in (9), in the asymptotic regime |M| →
given by ∞, we aim at max LID . Let D(· ∥ ·) denote the Kullback-
 φ,ψ,d
test 1, if Smatch (x, y) > Γ, Leibler (KL) divergence.
AS (x, y) = (11)
0, otherwise, Proposition 1 (Relationship with KL divergence). Suppose
where Γ is a predefined threshold. A decision “1” indicates Assumption 1 holds. For a vertex embedding φ and a neighbor-

that x and y are matched and “0” otherwise. hood graph embedding ψ, let LdID (φ, ψ) = maxd LID (φ, ψ, d),
where d∗ is the corresponding optimal discriminator. Then
C. Theoretical Basis
D(p(φ(x), ψ(G y ) | x ↔ y) ∥ p(φ(x), ψ(G y ) | x ↮ y)) (13)
In this subsection, we discuss the theoretical basis for the 1  ∗ 
graph-based learning objective function LempID defined in (8). ≥ LdID (φ, ψ) + Hb (P(x ↔ y)) . (14)
P(x ↔ y)
To make the analysis tractable, we assume that patch pairs (x, y)
are randomly generated from a distribution P. Let E be the where Hb (p) = −p log p − (1 − p) log(1 − p) is the binary
expectation operator. We start with a simplifying assumption entropy function.
as follows. Proof. See Appendix A.
y
Assumption 1. φ(x) and ψ(G ) are continuous random Remark 1. Proposition 1 suggests that maximizing LID
variables induced from P. over (φ, ψ, d) helps to distinguish between the matched and
In practice, due to the chosen activation functions used in unmatched patch pairs since their conditional distributions are
Resnet f and the GNN network g, their outputs typically satisfy forced to be very different in terms of the KL divergence.
the continuity requirement of Assumption 1. We next consider how the graph-based learning objective
In our analysis, the discriminator d is assumed to be a general function LID in (12) is influenced by perturbations in the
function without necessarily having the form (3). discriminator d.
R Let A be the set of all possible (φ(x), ψ(G y )) where
A
p(φ(x), ψ(G y ))d(φ(x), ψ(G y )) = 1, p : A 7→ R+ is a Proposition 2 (Effect of discriminator perturbation). Suppose
probability density whose set of discontinuities has Lebesgue Assumption 1 holds. Let ε be a sufficiently small perturbation to
measure zero. the discriminator d. Then, |LID (φ, ψ, d + ε) − LID (φ, ψ, d)| =
For any given landmark patches x and y, we assume that O(ε). Furthermore, we have | maxd LID (φ, ψ, d + ε) − maxd
2
P(x ↔ y) > 0 and P(x ↮ y) > 0. The probability densities of LID (φ, ψ, d)| = O(ε ).
y
(φ(x), ψ(G )) conditioned on x ↔ y and x ↮ y are denoted Proof. See Appendix B.
by p(φ(x), ψ(G y ) | x ↔ y) and p(φ(x), ψ(G y ) | x ↮ y),
respectively.1 In the following, we consider how the GNN embedding of
x
We discuss only LempID−1 in (8) since LempID−2 is sym- the neighborhood graph G of a vertex x affects the matching
metrical to it. The expectation form of LempID−1 is given effectiveness under further assumptions.
by For two landmark patches x and y, if their neighborhood
graphs G x and G y have vertices corresponding to the same set
LID = LID (φ, ψ, d) of objects, i.e., the patch and spatial information procedure
= E 1{x↔y} log d(φ(x), ψ(G y ))
 
identifies the same objects as the neighbors of x and y, we
x y
+ E 1{x↮y} log(1 − d(φ(x), ψ(G y ))) . (12) write G ↔ G .
 

Assumption 2. The ranges of φ(·) and ψ(·) are finite sets. The
1 Here we abuse notations p(φ(x), ψ(G y )|x ↔ y) to denote the conditional
probability density of (φ(x), ψ(G y )) given that x and y are matched. This is embedding φ(x) = φ(y) for landmark patches x and y are the
to avoid the cluttered notation p(φ(x),ψ(G y ))|x↔y (·, ·). same if x ↔ y. If furthermore G x ↔ G y , then ψ(G x ) = ψ(G y ).
7

While the Resnet f and GNN block g are in general datasets: the Landmark KITTI dataset and the Landmark Oxford
continuous functions of their inputs, Assumption 2 can be dataset,2 which are derived from the street-scene KITTI dataset
satisfied by restricting to a finite number of objects of interest [42] and the Oxford Radar RobotCar dataset [43], respectively.
in the environment, assuming that frames are captured from Both datasets contain image frames and LiDAR scans
approximately the same perspectives (e.g., from an on-vehicle captured from onboard cameras and Velodyne LiDAR sensors.
camera of a vehicle traveling along a fixed road) so that The landmark patches are extracted from the full-sized image
landmark patches of the same object are within a certain frames using the object detection neural network Faster R-
similarity distance of each other. Finally, the outputs of f CNN [61]. To facilitate detection efficacy, we manually label
and g can be quantized into discrete ranges, which implies several street-scene compact landmark objects including traffic
φ and ψ have finite sets of ranges. For the same object o in lights, traffic signs, poles, and facade windows for the sampled
the environment but under two different frames F1 and F2 , frames. The labels are used to train Faster R-CNN, which is
Assumption 2 says that the outputs from the embedding φ are used to produce landmark object detection for the image frames.
the same for the two frames. This implicitly assumes that φ The detected landmarks in bounding boxes are then used to
is robust to perturbation in its input. Furthermore, the outputs obtain the landmark patches for our matching experiments with
of the embedding ψ are also the same if the patch and spatial some intentionally included background, shown in Fig. 4 for
information are noiseless. example.
Proposition 3. Suppose Assumption 2 holds, and x and y are
landmark patches of frames F1 and F2 (based on the same envi-
ronment), respectively. Let m(x ↔ y) = P(G x ↔ G y | x ↔ y)
and m(x ↮ y) = P(G x ↔ G y | x ↮ y). Then we have
∥p(φ(x), ψ(G y ) | x ↔ y) − p(φ(x), ψ(G y ) | x ↮ y)∥TV (a) Landmark patches from KITTI Dataset
Xn
≥ min m(x ↔ y) p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y)
x↔y
(φ(x),ψ(G y ))∈B
o
− p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) (b) Landmark patches from Oxford Radar RobotCar Dataset

+ min m(x ↔ y) − max m(x ↮ y) − 1, (15) Fig. 4. (a) and (b) are landmark patch samples (displayed with intentionally
x↔y x↮y included background) from the KITTI dataset and Oxford Radar RobotCar
 dataset respectively.
where B = (φ(x), ψ(G y )) : p(φ(x), ψ(G y ) | x ↔ y) ≥
p(φ(x), ψ(G y ) | x ↮ y) . Here, ∥·∥TV denotes the total vari- To establish the patch-matching ground truth, we use the
ation distance, and minx↔y and maxx↮y denotes minimization vehicle locations and collected LiDAR scans to build the 3D
over all matched patch pairs (x, y) and maximization over all LiDAR reference map similar to the operations in [69]. The
unmatched patch pairs, respectively. 3D reference map is used to determine the landmark locations
Proof. See Appendix C. by projecting the 3D LiDAR points to the image frames. The
LiDAR points reflected from the landmark patch are read out to
In the ideal case where the patch and spatial informa- get the global locations of the corresponding landmark objects.
tion are noiseless, we have minx↔y m(x ↔ y) = 1 and We then compute the L2 distance of each landmark patch pair
maxx↮y m(x ↮ y) = 0. Then the right-hand side of (15) from two frames to determine the patch-matching ground truth.
simplifies to Some details of the two landmark patch-matching datasets are
introduced as follows. More dataset preparation details are
X n
y x y
p(φ(x), ψ(G ) | G ↔ G , x ↔ y)
given in the supplementary material.
(φ(x),ψ(G y ))∈B
o Landmark KITTI Dataset. The KITTI dataset3 contains
y x y
− p(φ(x), ψ(G ) | G ↮ G , x ↮ y) . (16) street-scene image frames and their corresponding LiDAR
point clouds collected in Karlsruhe, Germany. We use the
In this case, we also have p(φ(x), ψ(G y ) | x ↔ y) =
object labels provided by [70] to detect landmark patches for
p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) and p(φ(x), ψ(G y ) | x ↮
all frames including traffic lights, traffic signs and poles. An
y) = p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y). Furthermore, from
example is shown in Fig. 5. We do not include windows as
Assumption 2, any (φ(x), ψ(G y )) such that p(φ(x), ψ(G y ) |
landmarks in this dataset due to the lack of labels. Furthermore,
G x ↔ G y , x ↔ y) > 0 implies that p(φ(x), ψ(G y ) | G x ↮
to avoid “trivial matchings” between consecutive images, a
G y , x ↮ y) = 0. These probability measures are thus mutually
minimum difference of 2m between the image frames is also
singular and have a total variation distance of 1. Therefore,
set. The aforementioned operations are performed to obtain
in the ideal case, the model perfectly distinguishes between
the landmark patch-matching ground truth by projecting the
p(φ(x), ψ(G y ) | x ↔ y) and p(φ(x), ψ(G y ) | x ↮ y).
3D LiDAR scans to the image frames. Finally, 1500 frames
IV. E XPERIMENTS are selected for landmark patch-matching experiments. The
dataset is randomly split into training and testing sets, with a
A. Datasets
As there are no existing standard datasets for street-scene 2 https://github.com/AI-IT-AVs/Landmark_patch_datasets

landmark patch-matching, we introduce in this paper two new 3 http://www.cvlibs.net/datasets/kitti


8

ratio around 2 : 1. In both training and testing, we select frame keypoints. In this regard, a patch pair with a large enough ratio
pairs that are captured at locations with relative distances not of matched keypoints is regarded as a match.
more than 25m to ensure the presence of common landmarks. Model Setting. We use Resnet18 in [66] for the feature
descriptor f , with output feature dimension 512 after 17
convolution layers. In VGIDM, we choose several GNNs for
the neighborhood graph embedding, including GAT [40], GCN
[67] and GraphSAGE [68]. When using GAT, the network
contains 2 GAT blocks with the exponential linear unit (ELU).
For each GAT block, we use 4 attention heads, which compute
512 hidden features in total. As for GCN and GraphSAGE, they
both contain 2 corresponding blocks with ReLU, where there
are 512 hidden features in each block. Further details of our
model architecture are provided in the supplementary material.
The Adam optimizer is selected with a learning rate of 0.0001
Fig. 5. A semantic segmentation image and its corresponding real image,
both with bounding box labels, from the KITTI dataset.
to train the model by minimizing its corresponding loss in (9).
The number of training epochs is 150 for all datasets.
Landmark Oxford Dataset. The Oxford Radar RobotCar VGIDM with Image Depth Estimation. To test VGIDM
dataset4 contains image frames and LiDAR scans captured in the case where precise depth information like that provided
on the streets in Oxford, UK. We manually label landmarks by LiDAR is unavailable, we construct neighborhood graphs
including traffic lights, traffic signs, poles, and facade windows using estimated image depth and with different GNNs in the
for 500 sampled frames. An example is shown in Fig. 6. We backbone. Specifically, we include an image depth estimation
then train Faster R-CNN to obtain the landmarks for all 29, 687 method called Monocular Depth Prediction Module proposed
frames. To avoid “trivial matchings” between consecutive in [38]. Based on the image depth estimation, we can obtain the
images, a minimum difference of 2m between the image frames rough relative locations of the landmarks in the street scenes and
is also set. Finally, 3000 frames are selected for landmark patch- use them to construct a neighborhood graph for each landmark
matching experiments. The remaining steps are similar to that in the test procedure. The depth estimation performance is
for the Landmark KITTI dataset. provided in the supplementary material. The estimated image
pixel depths are transformed to 3D locations w.r.t. the camera
using its intrinsic matrix. We then use the estimated locations
to test VGIDM. In this depth estimation method, the pre-trained
ResNeXt101 model from [38] is utilized in our experiments,
and the images are from the two landmark datasets. We extract
the predicted depth points from the static roadside landmarks,
including traffic lights, traffic signs, and poles, to compute
the locations of the objects. Therefore, we can construct the
Fig. 6. Examples of the ground truth landmark bounding box labels for the neighborhood graphs and test the VGIDM.
Oxford Radar RobotCar dataset. Implementation. For a given sequence of street scene frames
captured by a vehicular camera, we perform the following
B. Experimental Details training steps: i) We use object detection methods like faster
Baseline Methods. We compare VGIDM with several R-CNN [61] to extract landmark patches for each frame.
baseline methods, including MatchNet [24], SiameseNet [47], The landmarks include traffic lights, traffic signs, poles, and
HardNet [27], SOSNet [45], D2-Net [31], ASLFeat [32], Su- windows. ii) We manually label matching landmark patches. To
perGlue [33] and LoFTR [50]. The MatchNet and SiameseNet determine the global locations of these landmarks, we combine
are regarded as joint feature and metric learning methods, vehicular Global Positioning System (GPS) information with
combining deep CNNs and an FC layer to learn features and data from LiDAR or stereo cameras. With this information,
their metrics. The decision-making process for the matching we are able to establish the ground truth for the matching
task is based on the output of the FC layer. HardNet and landmark patches between two frames captured at the same
SOSNet focus on similarity measures to distinguish the location. iii) We take the global locations of landmarks to
learned high-dimensional features, where the feature descriptors construct the neighborhood graph for each landmark patch
are almost all based on deep CNNs consisting of several based on K-NN. iv) We train the VGIDM using landmark
convolution layers with batch normalization (BN) or rectified patch pairs with ground-truth labels. The details of VGIDM
linear units (ReLUs). In testing, the Euclidean distance between with training loss and test score are given in Section III-B.
the output patch features is used for the decision-making. D2- During testing, we perform steps i and iii as above but
Net, ASLFeat, SuperGlue and LoFTR are based on keypoint in step iii, we create neighborhood graphs by estimating the
correspondence and perform the matching task according to relative locations of landmarks using a stereo camera or a
the ratio of the matched keypoints among the whole set of depth estimation method, which replaces the need for GPS and
LiDAR information. The ground truth for computing the testing
4 http://ori.ox.ac.uk/datasets/radar-robotcar-dataset performance is found based on GPS and LiDAR information.
9

TABLE I
M ATCHING PERFORMANCE ON THE L ANDMARK KITTI DATASET. T HE BEST AND THE SECOND - BEST RESULT FOR EACH CRITERION ARE HIGHLIGHTED IN
RED AND BLUE RESPECTIVELY.

Methods Precision Recall F1 -Score AUC


MatchNet [24] 0.9039 ± 0.0027 0.9483 ± 0.0105 0.9255 ± 0.0050 0.8229 ± 0.0055
SiameseNet [47] 0.7953 ± 0.0124 0.8960 ± 0.0208 0.8426 ± 0.0159 0.8328 ± 0.0162
HardNet [27] 0.9041 ± 0.0016 0.9562 ± 0.0177 0.9294 ± 0.0093 0.8261 ± 0.0088
SOSNet [45] 0.9042 ± 0.0015 0.9563 ± 0.0160 0.9294 ± 0.0083 0.8261 ± 0.0080
D2-Net [31] 0.9115 ± 0.0031 0.8789 ± 0.0131 0.8949 ± 0.0076 0.8115 ± 0.0082
ASLFeat [32] 0.9189 ± 0.0022 0.9008 ± 0.0082 0.9098 ± 0.0048 0.8312 ± 0.0057
SuperGlue [33] 0.9067 ± 0.0039 0.9125 ± 0.0123 0.9096 ± 0.0072 0.8155 ± 0.0093
LoFTR [50] 0.9069 ± 0.0025 0.9243 ± 0.0110 0.9154 ± 0.0059 0.8197 ± 0.0064
VGIDM (GAT) [ours] 0.9425 ± 0.0020 0.9733 ± 0.0050 0.9577 ± 0.0026 0.8977 ± 0.0038
VGIDM (GCN) [ours] 0.9340 ± 0.0027 0.9753 ± 0.0076 0.9543 ± 0.0044 0.8847 ± 0.0063
VGIDM (GraphSAGE) [ours] 0.9464 ± 0.0042 0.9653 ± 0.0129 0.9557 ± 0.0073 0.9007 ± 0.0098

VGIDM VGIDM VGIDM


(GraphSAGE) (GCN) (GAT) LoFTR SuperGlue ASLFeat D2-Net SOSNet HardNet SiameseNet MatchNet

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0

Fig. 7. Examples of matched and mismatched pairs from the Landmark KITTI dataset. A green or red box indicates a correct or incorrect prediction result,
respectively. “GT” stands for ground truth.

The remaining steps are the same as those used during training. different street scenes, the performances on both datasets are
different. From Tables I and II, we also observe that nearly all
the methods have better performance on the Landmark KITTI
C. Performance Evaluation
dataset compared with the Landmark Oxford dataset. This may
be caused by more similarity among the window patches in the
Performance on Landmark KITTI Dataset. Table I Landmark Oxford dataset, which makes distinguishing them
summarizes the test performance of models trained with 150 more difficult. A few matching prediction examples from the
training epochs on the Landmark KITTI dataset. The evaluation Landmark Oxford dataset are shown in Fig. 8.
uses statistics including mean value and standard deviation from Performance Analysis. The VGIDM variants (with GAT,
5 experiments. From Table I, we observe that VGIDM (with GCN or GraphSAGE) not only make use of landmark patch
GAT, GCN or GraphSAGE) outperforms the other baseline information but also the neighborhood information in the
methods under all four criteria, with a slight performance decision-making process for the matching task. Other feature
difference among these VGIDM models. This implies that descriptor learning as well as joint feature and metric learning
graph-based learning makes a positive difference in matching methods such as MatchNet, SiameseNet, HardNet and SOSNet,
efficiency. Moreover, we observe that VGIDM with GAT has depend only on the individual image patch rather than the
a more stable performance than the other methods. Several neighborhood relationships. An erroneous match can happen
examples of the matching prediction are shown in Fig. 7. between patches from two similar but distinct objects. VGIDM
Performance on Landmark Oxford Dataset. From Ta- mitigates this error by using the neighborhood information.
ble II, we observe that the VGIDM variants with different However, VGIDM requires more computing resources for
GNNs outperform the other benchmark methods on almost neighborhood graph processing.
all measures. Since the Oxford Radar RobotCar and KITTI On the other hand, keypoint-based learning methods such
datasets have different image qualities and are collected in as D2-Net, ASLFeat, SuperGlue and LoFTR, suffer from low
10

TABLE II
M ATCHING PERFORMANCE ON THE L ANDMARK OXFORD DATASET. T HE BEST AND THE SECOND - BEST RESULT FOR EACH CRITERION ARE HIGHLIGHTED IN
RED AND BLUE RESPECTIVELY.

Methods Precision Recall F1 -Score AUC


MatchNet [24] 0.8742 ± 0.0068 0.9589 ± 0.0047 0.9146 ± 0.0025 0.7723 ± 0.0119
SiameseNet [47] 0.7210 ± 0.0086 0.8968 ± 0.0109 0.7992 ± 0.0076 0.7748 ± 0.0088
HardNet [27] 0.8762 ± 0.0011 0.9533 ± 0.0094 0.9131 ± 0.0048 0.7747 ± 0.0047
SOSNet [45] 0.8763 ± 0.0010 0.9544 ± 0.0086 0.9136 ± 0.0044 0.7752 ± 0.0043
D2-Net [31] 0.8032 ± 0.0033 0.9005 ± 0.0084 0.8491 ± 0.0052 0.6194 ± 0.0078
ASLFeat [32] 0.8729 ± 0.0036 0.9048 ± 0.0073 0.8886 ± 0.0035 0.7548 ± 0.0062
SuperGlue [33] 0.8639 ± 0.0054 0.8747 ± 0.0077 0.8692 ± 0.0049 0.7305 ± 0.0099
LoFTR [50] 0.8515 ± 0.0020 0.9837 ± 0.0060 0.9129 ± 0.0029 0.7346 ± 0.0045
VGIDM (GAT) [ours] 0.9052 ± 0.0047 0.9517 ± 0.0044 0.9278 ± 0.0040 0.8263 ± 0.0092
VGIDM (GCN) [ours] 0.8918 ± 0.0051 0.9515 ± 0.0077 0.9206 ± 0.0037 0.8025 ± 0.0087
VGIDM (GraphSAGE) [ours] 0.8938 ± 0.0046 0.9648 ± 0.0052 0.9279 ± 0.0045 0.8104 ± 0.0096

VGIDM VGIDM VGIDM


(GraphSAGE) (GCN) (GAT) LoFTR SuperGlue ASLFeat D2-Net SOSNet HardNet SiameseNet MatchNet

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result:0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result:0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1

GT: 1

Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0

GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1

Fig. 8. Examples of matched and mismatched pairs from the Landmark Oxford dataset. A green or red box indicates a correct or incorrect prediction result,
respectively. “GT” stands for ground truth.

pixel resolution of the image patches. As a landmark can be D. Ablation Study


far away from the camera on the vehicle, its corresponding
image patch can be small. As a result, it is more likely for Feature Pair Discrimination. We perform ablation studies
these models to make mistakes in the matching decision. on different feature pairs for the matching task shown in
Table IV. The feature pair comparison include d(f (x), f (y))
Cross-Validation. To evaluate the generalization capability for Resnet features, d(ρ(x), ρ(y)) for vertex embeddings,
of VGIDM, we train VGIDM on the Landmark Oxford dataset d(φ(x), φ(y)) for ensemble vertex embeddings, as well as
and test it on the Landmark KITTI dataset. From Table III and d(ψ(G x ), ψ(G y )) for neighborhood graph embeddings, where
Table I, we observe that the VGIDM variants still outperform the learnable layer d given by (3) as the metric. For each feature
the other baselines in all metrics. From Table III and Table II, comparison, we train the corresponding models for 150 epochs
when we train on the Landmark KITTI dataset and test on and select the optimal test result for the matching task based on
the Landmark Oxford dataset, VGIDM outperforms the other the Landmark Oxford dataset. From Table IV, we observe that
baselines in precision and AUC. Since the Landmark Oxford our proposed vertex-to-graph comparison outperforms the other
dataset contains windows as landmarks while not the Landmark feature pairs in most metrics. The feature pairs containing graph
KITTI dataset, the test performance on the Landmark Oxford information, like ψ(G x ) and ψ(G y ), have an advantage over
dataset deteriorates more significantly. In Tables I and II, the those based only on vertex information, like f (x) and f (y).
baselines D2-Net, ASLFeat, SuperGlue and LoFTR do not This demonstrates the benefit of utilizing graph information.
perform training and test on the same dataset. Since there Discriminator Function. We evaluate the effectiveness
is no point-level ground-truth for our landmark patches, we of the learnable discriminator d by comparing it with other
adopt pre-trained models for these baselines from the literature discriminator functions. Specifically, we replace the learnable
[31]–[33], [50]. The other baselines are trained and tested on discriminator d with either cosine similarity or L2 distance
the same datasets. in (6). We evaluate the patch-matching task on the Landmark
11

TABLE III
C ROSS -VALIDATION ON THE L ANDMARK KITTI DATASET OR L ANDMARK OXFORD DATASET USING THE TRAINED MODEL BASED ON THE L ANDMARK
OXFORD DATASET OR L ANDMARK KITTI DATASET, RESPECTIVELY. T HE “B EST IN TABLE I” OR “B EST IN TABLE II” METHOD REFERS TO THE
BEST- PERFORMING BASELINE OUT OF D2-N ET, ASLF EAT, S UPER G LUE AND L O FTR FROM TABLE I OR TABLE II.

Cross-Validation Methods Precision Recall F1 -Score AUC


VGIDM (GAT) 0.9238 ± 0.0020 0.9047 ± 0.0039 0.9141 ± 0.0027 0.8403 ± 0.0041
Oxford dataset (Train)
VGIDM (GCN) 0.9084 ± 0.0019 0.9579 ± 0.0074 0.9325 ± 0.0041 0.8341 ± 0.0050
& KITTI dataset (Test)
VGIDM (GraphSAGE) 0.9255 ± 0.0029 0.9483 ± 0.0102 0.9368 ± 0.0061 0.8597 ± 0.0080
Baselines tested on KITTI dataset Best in Table I 0.9189 ± 0.0022 0.9243 ± 0.0110 0.9154 ± 0.0059 0.8312 ± 0.0057
VGIDM (GAT) 0.9022 ± 0.0031 0.8930 ± 0.0040 0.8976 ± 0.0024 0.8013 ± 0.0051
KITTI dataset (Train)
VGIDM (GCN) 0.8909 ± 0.0043 0.8845 ± 0.0124 0.8877 ± 0.0074 0.7799 ± 0.0096
& Oxford dataset (Test)
VGIDM (GraphSAGE) 0.8918 ± 0.0055 0.9098 ± 0.0064 0.9007 ± 0.0043 0.7893 ± 0.0099
Baselines tested on Oxford dataset Best in Table II 0.8729 ± 0.0036 0.9837 ± 0.0060 0.9129 ± 0.0029 0.7548 ± 0.0062

TABLE IV We compare with VGIDM (GraphSAGE), which is trained


A BLATION STUDY FOR FEATURE PAIR DISCRIMINATION . on the Landmark Oxford dataset. We observe that including
Feature neighborhood information generally improves the performance
Precision Recall F1 -Score AUC
Comparison of every baseline, but VGIDM still outperforms them. This
d(f (x), f (y)) 0.8817 0.9146 0.8979 0.7733 indicates that the neighborhood graph feature representation in
d(ρ(x), ρ(y)) 0.7772 0.9720 0.8637 0.5680
d(φ(x), φ(y)) 0.8819 0.9560 0.9175 0.7860 VGDIM has advantages in the patch-matching task. As shown
d(ψ(G x ), ψ(G y )) 0.9029 0.9427 0.9224 0.8193 in Table VII, which presents the inference runtime for one pair
d(φ(x), ψ(G y )) 0.9097 0.9533 0.9310 0.8347 of frames (with around twenty patch pairs for comparison),
the computational complexity of VGIDM is lower than most
baselines, except MatchNet.
Oxford dataset. From Table V, we observe that the proposed
learnable discriminator d outperforms the other discriminators, TABLE VII
which is likely due to the adaptability of neural networks to I NFERENCE RUNTIME COMPARISON FOR DIFFERENT METHODS ON THE
different feature dimensions. L ANDMARK KITTI DATASET.

SIFT+
TABLE V MatchNet SuperGlue LoFTR VGIDM
Methods RANSAC
A BLATION STUDY FOR DIFFERENT DISCRIMINATOR FUNCTIONS . (+neighbors) (+neighbors) (+neighbors) (GraphSAGE)
(+neighbors)
Inference
Feature 0.7035s 0.1591s 3.8072s 4.3740s 0.2013s
Precision Recall F1 -Score AUC runtime
Discriminator
Cosine similarity 0.9007 0.8707 0.8854 0.7913
L2 distance 0.7277 0.8800 0.7966 0.5540 E. Computational Complexity
Learnable d 0.9097 0.9533 0.9310 0.8347 To evaluate the runtime performance, we test VGIDM on
an NVIDIA RTX A5000 GPU. Table VIII shows the inference
TABLE VI runtime (mean time for one pair of frames in the testing phase)
M ATCHING PERFORMANCE OF DIFFERENT METHODS BASED ON SPATIAL for the VGIDM variants with different GNNs. Specifically,
NEIGHBORHOOD INFORMATION .
given a pair of frames (i.e., full-size images), an average of
Methods Precision Recall F1 -Score AUC around twenty patch pairs are compared, which takes less
SIFT [71]+RANSAC [52] (+neighbors) 0.8965 0.9467 0.9209 0.8093 than 0.25 seconds. The time taken is acceptable for practical
MatchNet [24] (+neighbors) 0.9023 0.9600 0.9302 0.8240
SuperGlue [33] (+neighbors) 0.9225 0.9200 0.9212 0.8440 applications, such as place recognition and autonomous driving.
LoFTR [50] (+neighbors) 0.9192 0.9413 0.9302 0.8467 Moreover, the amount of the parameters for these VGIDM
VGIDM [ours] 0.9280 0.9627 0.9450 0.8693 networks with the GAT, GCN and GraphSAGE is 12.16M,
12.16M and 12.66M, respectively.
Spatial Neighborhood Information. We investigate whether
the performance improvement of VGIDM is mainly due to the TABLE VIII
spatial neighborhood information. To do this, we introduce the I NFERENCE RUNTIME OF VGIDM ON L ANDMARK OXFORD DATASET.
neighborhood graphs used in VGIDM to other baseline methods. VGIDM VGIDM VGIDM
Specifically, for a given vertex, we sort its neighbors according Methods
(GAT) (GCN) (GraphSAGE)
to increasing distances from it. We then use each baseline Inference Runtime 0.2330s 0.1953s 0.2092s
method to compare not only the vertex pair but also the pairs
of their corresponding neighbors with the same sort order. Then,
we calculate the average of the predicted scores for the vertex F. Further Possible Applications
pair and its neighbor pairs. Finally, we decide whether there is 1) Application of VGIDM in Visual Place Recognition: A
a match based on a threshold, which is a hyperparameter tuned possible application of VGIDM is visual place recognition. We
separately to achieve the best performance for each baseline. apply our local patch-matching to obtain global frame matching
Table VI shows results on the Landmark KITTI dataset, where to determine if two frames show the same place. In visual
the best test performance for each baseline model is selected. place recognition, we construct a bipartite graph with edges
12

Landmark patch extraction VGIDM Net SuperGlue NetVLAD


Predicted matching results
Patches GT: 1
with their
neighborhood
information Predicted Result: 1 Predicted Result: 1 Predicted Result: 1
𝒔𝒊𝒋
GT: 1
Landmark patches
Image patch
matching with Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
VGDIM Net
Score matrix

Dustbin {𝒘𝒊𝒋 }
Weights GT: 1
scores+
Patches Sinkhorn Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
with their algorithm
Landmark patches neighborhood
information Output results GT: 0
Landmark patch extraction
Predicted Result: 0 Predicted Result: 1 Predicted Result: 0

Fig. 9. The diagram of visual place recognition with VGIDM. GT: 0

Predicted Result: 0 Predicted Result: 0 Predicted Result: 1


being the scores output by our network for all landmark patch
GT: 0
pairs, which is used to construct a matching score matrix. Then,
Predicted Result: 0 Predicted Result: 1 Predicted Result: 0
similar to the Optimal Matching Layer in [33], by appending
learnable dustbin scores for the score matrix, the Sinkhorn Fig. 11. Several examples of place recognition on the Oxford dataset where
algorithm is used to output the partial assignment. Finally, we the model is trained on the KITTI dataset. The prediction “1” or “0” indicates
obtain frame-matching results by summing up the elements in the frames are from the same or different places. A green or red box indicates
a correct or incorrect prediction result.
the matching score matrix with the weights from the partial
assignment. The details are shown in Fig. 9, where GAT is locations is less than 10 meters. To recognize the same places
chosen for the GNN part in VGIDM. more accurately, the thresholds of the matching results for
place image pairs are set to higher recall levels. The results and
TABLE IX example outputs are shown in Table IX and Fig. 10, respectively.
V ISUAL PLACE RECOGNITION PERFORMANCE IN KITTI DATA .
From Table IX, it is observed that all the methods perform
Methods MatchNet NetVLAD SuperGlue LoFTR VGIDM well, with VGIDM having a slight advantage. The reason may
F1 -Score 0.9668 0.9702 0.9694 0.9711 0.9719 be that there exist obvious differences among the image pairs
Accuracy 0.9360 0.9424 0.9408 0.9440 0.9452
that are not from the same place. However, unlike MatchNet,
NetVLAD, SuperGlue, and LoFTR, which require the full
VGIDM Net SuperGlue NetVLAD
image, VGIDM can perform place recognition by using only
GT: 1 landmark patches and their spatial relationships. Moreover, the
Predicted Result: 1 Predicted Result: 0 Predicted Result: 0
landmark patches based on static objects are more stable than
GT: 1 the keypoints based on edges or corners and are not affected by
Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 noisy image pixels from non-persistent objects or surroundings.
GT: 1 We conduct cross-validation experiments for the visual place
Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 recognition task. Specifically, we use VGIDM (GAT) trained on
GT: 0
the KITTI dataset, and compare it with baselines for inference
Predicted Result: 0
on the Oxford dataset. We use 3000 frames in the experiments.
Predicted Result: 0 Predicted Result: 0
From Table X, we observe that VGIDM is superior to the
GT: 0
baselines. This suggests that VGIDM has good generalization
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0
ability. We include a few examples in Fig. 11.
GT: 0 2) Application of VGIDM in Stereo Depth Estimation of
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Landmarks: Another application is stereo depth estimation of
landmarks. Similar to the steps described in Section IV-A, we
Fig. 10. Several examples of place recognition on the KITTI dataset. The
prediction “1” indicates the frames are from the same place, while “0” indicates obtain the landmarks from the full-sized frames captured from
they are from different places. A green box indicates a correct prediction result both the left and right stereo cameras.
while a red box an incorrect one. Different from the experiment settings in Section IV-A where
the matching is performed for landmark patches in image
TABLE X
C ROSS - VALIDATION PERFORMANCE FOR VISUAL PLACE RECOGNITION ON
frames captured at different locations, here we use VGIDM to
OXFORD DATASET. perform the matching between landmark patches captured at
the same location but from different cameras. We split 3000
Methods MatchNet NetVLAD SuperGlue LoFTR VGIDM
F1 -Score 0.9010 0.9069 0.9190 0.9207 0.9266
pairs of stereo frames into training and testing sets with a
Accuracy 0.8273 0.8303 0.8563 0.8607 0.8680 ratio of around 2 : 1. During testing, we set a high similarity
threshold of 0.9 to prevent false positive matching. Since the
We compare VGIDM with MatchNet [24], NetVLAD [72], landmark objects we have chosen have regular shapes, the
SuperGlue [33], and LoFTR [50] under the place recognition original narrow landmark object detection boxes (without the
task with around 600 pairs of place images from the KITTI intentionally added background to form the landmark patches)
dataset. The two places contained in each image frame pair are sufficiently accurate to locate the landmarks in the frames.
are regarded as the same if the distance between the camera For each of the matched landmark objects in the left and
13

TABLE XI with fewer landmarks. Additionally, combining our patch-level


ACCURACY OF DEPTH ESTIMATION . matching method with point-level methods shows promise in
Monocular achieving more accurate pixel-level or sub-pixel-level matching.
Method DIFFNet [62] VGIDM [ours]
Depth [38] To this end, we can use our method as a post-processing step
RMSE (m) 14.22 4.45 0.86 to emphasize the keypoints with more attention or to filter
out weak correspondences. We can further generalize VGIDM
right frames, we compute the pixel disparity on the center to multi-view camera relocalization [12] to estimate camera
line (average of left and right sides) of the original bounding poses by determining the matched landmarks in multiple image
box. The depth of the pixels on the landmark bounding box frames. Furthermore, the matched landmarks can serve as
middle lines can be calculated using the camera focal length anchor points or interest regions to aid other applications. In
and distance. Following the diagram in Fig. 12, the coarse LiDAR relocalization [6], [13] or LiDAR odometry estimation,
monocular depth is improved to a more accurate stereo depth by restricting the LiDAR points from matched landmarks using
as shown in Table XI. The vanilla Monocular Depth [38] VGIDM, many outlier points can be removed, leading to better
only achieves 14.22m RMSE. After applying VGIDM, we estimation accuracy.
can improve the depth estimation accuracy to 0.86m RMSE.
In contrast, the current state-of-the-art DIFFNET [62] only A PPENDIX A
achieves 4.45m RMSE performance in stereo depth estimation. P ROOF OF P ROPOSITION 1
However, a direct application of VGIDM can only output From (12), we have
the estimated depth for sparse pixels (only for the chosen
landmarks). To improve general stereo depth estimation, it is LID
possible to incorporate VGIDM into existing methods, e.g., = P(x ↔ y)E[log d(φ(x), ψ(G y )) | x ↔ y]
using VGIDM’s accurately estimated stereo landmark depths as + P(x ↮ y)E[log(1 − d(φ(x), ψ(G y ))) | x ↮ y] (17)
calibration for other general stereo depth estimation algorithms Z n
like DIFFNet [62], HRDepth [63], and CADepth [64]. = P(x ↔ y)p(φ(x), ψ(G y ) | x ↔ y) log d(φ(x), ψ(G y ))
A
+ P(x ↮ y)p(φ(x), ψ(G y ) | x ↮ y)
Coarse monocular depth o
estimation
· log(1 − d(φ(x), ψ(G y ))) d(φ(x), ψ(G y )), (18)
VDIM Landmark
Left: Landmark Calculate center line disparity
Net depth
patch
matching estimation where P(x ↔ y) and P(x ↮ y) are the matched and unmatched
result
Right: probabilities. From (18), it is clear that LID is concave in
d(φ(x), ψ(G y )) for every (x, y). Taking the first derivatives
and setting them to zero, we obtain
Fig. 12. Depth estimation: from coarse monocular depth to fine stereo depth.
P(x ↔ y)p(φ(x), ψ(G y ) | x ↔ y)
d∗ (φ(x), ψ(G y )) = . (19)
VGIDM can serve as a module in various learning-based p(φ(x), ψ(G y ))
localization techniques, such as Detect-SLAM [73], EAO- Substituting (19) into (18), we have
SLAM [74], and other semantic SLAM with object-level ∗
data association [75], [76]. These techniques underscore the LdID
p(φ(x), ψ(G y ) | x ↔ y)
 
importance of landmark patch matching, which is central to
= P(x ↔ y)E log x ↔ y
VGIDM in real-world applications. Consequently, our approach p(φ(x), ψ(G y ) | x ↮ y)
has crucial implications for diverse applications, demonstrating 
p(φ(x), ψ(G y ) | x ↮ y)

its versatility and effectiveness. + E log − Hb (P(x ↔ y)).
p(φ(x), ψ(G y ))
(20)
V. C ONCLUSION
Applying Jensen’s inequality to the second term in the right-
We have developed an image patch-matching model VGIDM hand side of (20), we have
that incorporates spatial information of the landmark patches
p(φ(x), ψ(G y ) | x ↮ y)
 
through graph-based learning. We provided a theoretical basis E log
for our approach. Extensive experiments demonstrate that our p(φ(x), ψ(G y ))
p(φ(x), ψ(G y ) | x ↮ y)
 
approach outperforms the current state-of-the-art baselines, ≤ log E
which do not take into account the spatial relationships between p(φ(x), ψ(G y ))
Z
patches. Our framework indicates that such spatial information
= log p(φ(x), ψ(G y ) | x ↮ y)d(φ(x), ψ(G y ))
can be beneficial to landmark patch-matching in street scenes. A
In future work, it is of interest to incorporate a greater variety = 0. (21)
of objects as landmarks to adapt VGIDM to more diverse
street scenes and generalize to more datasets. To achieve this, Therefore, from (20), we have
we can train VGIDM on landmarks from a wider range of LdID∗ ≤ P(x ↔ y)
classes. As our method is better suited for matching tasks in
· D(p(φ(x), ψ(G y ) | x ↔ y) ∥ p(φ(x), ψ(G y ) | x ↮ y))
scenarios with sufficient landmarks or static objects, pixel-level
matching methods can serve a complementary role in scenarios − Hb (P(x ↔ y)). (22)
14

LID (φ, ψ, d˜∗ ) − LID (φ, ψ, d∗ )


ε2 (P(x ↔ y)p(φ(x), ψ(G y ) | x ↔ y) + P(x ↮ y)p(φ(x), ψ(G y ) | x ↮ y))3
Z
=− d(φ(x), ψ(G y )) + o(ε2 ) (25)
2 A P(x ↔ y)P(x ↮ y)p(φ(x), ψ(G y ) | x ↔ y)p(φ(x), ψ(G y ) | x ↮ y)

o
Rearranging the inequality completes the proof. − p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) (26)
≥ min m(x ↔ y)
x↔y
A PPENDIX B X n
P ROOF OF P ROPOSITION 2 · p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y)
Let d˜ = d + ϵ. Similar to (17) in the proof of Proposition 1, (φ(x),ψ(G y ))∈B
o
it is easy to see − p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) + min m(x ↔ y)
x↔y
˜
LID (φ, ψ, d) − max m(x ↮ y) − 1, (27)
y x↮y
= P(x ↔ y)E[log(d(φ(x), ψ(G )) + ε) | x ↔ y]
where the inequality (26) follows from 0 ≤ m(x ↔ y) ≤ 1
+ P(x ↮ y)E[log(1 − d(φ(x), ψ(G y )) − ε) | x ↮ y],
and 0 ≤ m(x ↮ y) ≤ 1. The proof is now complete.
(23)
where the notations are the same as those in (17). R EFERENCES
According to Taylor’s series expansion theorem [77], [1] X. Zhang, S. Wang, Z. Li, and S. Ma, “Landmark image retrieval by
we have jointing feature refinement and multimodal classifier learning,” IEEE
Trans. Cybern., vol. 48, no. 6, pp. 1682–1695, Jun. 2017.
˜ − LID (φ, ψ, d)
LID (φ, ψ, d) [2] J. Zhu, H. Zeng, J. Huang, S. Liao, Z. Lei, C. Cai, and L. Zheng, “Vehicle
( re-identification using quadruple directional deep learning features,” IEEE

1
 Trans. Intell. Transp. Syst., vol. 21, no. 1, pp. 410–420, Mar. 2019.
= ε P(x ↔ y)E x↔y [3] S. Wang, X. Guo, Y. Tie, L. Qi, and L. Guan, “Deep local feature
d(φ(x), ψ(G y )) descriptor learning with dual hard batch construction,” IEEE Trans.
 ) Image Process., vol. 29, pp. 9572–9583, Oct. 2020.
1 [4] D. Quan, S. Wang, Y. Li, B. Yang, H. Ning, J. Chanussot, H. Biao, and
− P(x ↮ y)E y
x↮y L. Jiao, “Multi-relation attention network for image patch matching,”
1 − d(φ(x), ψ(G )) IEEE Trans. Image Process., vol. 30, pp. 7127–7142, Aug. 2021.
( [5] S. Liao and A. C. Chung, “Nonrigid brain MR image registration using
ε2
 
1 uniform spherical region descriptor,” IEEE Trans. Image Process., vol. 21,
− P(x ↔ y)E x ↔ y
2 (d(φ(x), ψ(G y )))2 no. 1, pp. 157–169, Jun. 2011.
[6] N. Engel and etc., “Deeplocalization: Landmark-based self-localization
 ) with deep neural networks,” in Proc. IEEE Intell. Transp. Syst. Conf.,
1
+ P(x ↮ y)E x ↮ y 2019, pp. 926–933.
(1 − d(φ(x), ψ(G y )))2 [7] Y. Wang, Y. Qiu, P. Cheng, and X. Duan, “Robust loop closure detection
integrating visual-spatial-semantic information via topological graphs
+ o(ε2 ). (24) and CNN features,” J. Remote Sensing, vol. 12, no. 23, p. 3890, Oct.
2020.
Furthermore, by substituting d(φ(x), ψ(G y )) = [8] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and H. S. Sawhney,
d∗ (φ(x), ψ(G y )) given in (19) into (24), we have (25) “Ten-fold improvement in visual odometry using landmark matching,” in
Proc. IEEE Int. Conf. Comput. Vision, 2007, pp. 1–8.
and the proof is complete. [9] J. Vincent, M. Labbé, J. S. Lauzon, F. Grondin, P. M. Comtois-Rivet, and
F. Michaud, “Dynamic object tracking and masking for visual SLAM,”
in Proc. IEEE Int. Conf. Intell. Robots Syst., 2020, pp. 4974–4979.
A PPENDIX C [10] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell,
P ROOF OF P ROPOSITION 3 B. Upcroft, and M. Milford, “Place recognition with convnet landmarks:
Viewpoint-robust, condition-robust, training-free,” in Proc. Robot.: Sci.
We have Syst., 2017, pp. 5702–5708.
[11] S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-NetVLAD:
y y
∥p(φ(x), ψ(G ) | x ↔ y) − p(φ(x), ψ(G ) | x ↮ y)∥TV Multi-scale fusion of locally-global descriptors for place recognition,”
X n in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2021, pp.
= m(x ↔ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) 14 141–14 152.
[12] F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera
(φ(x),ψ(G y ))∈B
relocalization with graph neural networks,” in Proc. IEEE Int. Conf.
+ (1 − m(x ↔ y))p(φ(x), ψ(G y ) | G x ↮ G y , x ↔ y) Comput. Vision Pattern Recognit., 2020, pp. 11 372–11 381.
[13] J. Zhang and S. Singh, “Visual-lidar odometry and mapping: Low-drift,
− m(x ↮ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↮ y) robust, and fast,” in Proc. IEEE Int. Conf. Robot. Automat., 2015, pp.
2174–2181.
o
y x y
− (1 − m(x ↮ y))p(φ(x), ψ(G ) | G ↮ G , x ↮ y) [14] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
X n Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Jan. 2004.
≥ min m(x ↔ y) [15] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust
x↔y features (SURF),” J. Comput. Vision Image Understanding, vol. 110,
(φ(x),ψ(G y ))∈B
no. 3, pp. 346–359, Jun. 2008.
· p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) [16] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
trained, multiscale, deformable part model,” in Proc. IEEE Int. Conf.
− max m(x ↮ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↮ y) Comput. Vision Pattern Recognit., 2008, pp. 1–8.
x↮y
15

[17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient [41] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A
alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vision, comprehensive survey on graph neural networks,” IEEE Trans. Neural
2011, pp. 2564–2571. Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, Mar. 2020.
[18] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust [42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
invariant scalable keypoints,” in Proc. IEEE Int. Conf. Comput. Vision, driving? The KITTI vision benchmark suite,” in Proc. IEEE Int. Conf.
2011, pp. 2548–2555. Comput. Vision, 2012, pp. 3354–3361.
[19] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” [43] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The Oxford
in Proc. IEEE Int. Conf. Comput. Vision, 2012, pp. 510–517. Radar RobotCar Dataset: A radar extension to the Oxford RobotCar
[20] Y. Tian, B. Fan, and F. Wu, “L2-net: Deep learning of discriminative Dataset,” in Proc. IEEE Int. Conf. Robot. Automat., 2020, pp. 6433–6438.
patch descriptor in euclidean space,” in Proc. IEEE Int. Conf. Comput. [44] X. Zhang, F. X. Yu, S. Kumar, and S. F. Chang, “Learning spread-out
Vision, 2017, pp. 661–669. local feature descriptors,” in Proc. IEEE Int. Conf. Comput. Vision, 2017,
[21] S. Zagoruyko and N. Komodakis, “Learning to compare image patches pp. 4595–4603.
via convolutional neural networks,” in Proc. IEEE Int. Conf. Comput. [45] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “SOSNet:
Vision, 2015, pp. 4353–4361. Second order similarity regularization for local descriptor learning,” in
[22] V. Kumar BG, G. Carneiro, and I. Reid, “Learning local image descriptors Proc. IEEE Int. Conf. Comput. Vision, 2019, pp. 11 016–11 025.
with deep siamese and triplet convolutional networks by minimising [46] H. Pan, Y. Chen, Z. He, F. Meng, and N. Fan, “TCDesc: Learning
global loss functions,” in Proc. IEEE Int. Conf. Comput. Vision, 2016, topology consistent descriptors for image matching,” IEEE Trans. Circuits
pp. 5385–5394. Syst. Video Technol., vol. 521, pp. 436–444, Aug. 2021.
[23] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature [47] I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features for
descriptors with triplets and shallow convolutional neural networks,” in image matching,” in Proc. IEEE Int. Conf. Pattern Recognit., 2016, pp.
Proc. British Mach. Vision Conf., 2016, pp. 1–11. 378–383.
[24] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “MatchNet: [48] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invariant
Unifying feature and metric learning for patch-based matching,” in Proc. feature transform,” in Proc. Eur. Conf. Comput. Vision, 2016, pp. 467–
IEEE Int. Conf. Comput. Vision, 2015, pp. 3279–3286. 483.
[25] A. Subramaniam, P. Balasubramanian, and A. Mittal, “NCC-net: Normal- [49] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
ized cross correlation based deep matcher with robustness to illumination supervised interest point detection and description,” in Proc. IEEE Int.
variations,” in Proc. IEEE Winter Conf. Appl. Comput. Vision, 2018, pp. Conf. Comput. Vision Pattern Recognit. Workshops, 2018, pp. 224–236.
1944–1953. [50] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “LoFTR: Detector-
[26] D. Quan, X. Liang, S. Wang, S. Wei, Y. Li, H. Ning, and L. Jiao, “AFD- free local feature matching with transformers,” in Proc. IEEE Int. Conf.
Net: Aggregated feature difference learning for cross-spectral image Comput. Vision Pattern Recognit., 2021, pp. 8922–8931.
patch matching,” in Proc. IEEE Int. Conf. Comput. Vision, 2019, pp. [51] J. Brogan, A. Bharati, D. Moreira, A. Rocha, K. W. Bowyer, P. J. Flynn,
3017–3026. and W. J. Scheirer, “Fast local spatial verification for feature-agnostic
[27] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to large-scale image retrieval,” IEEE Trans. Image Process., vol. 30, pp.
know your neighbor’s margins: Local descriptor learning loss,” in Proc. 6892–6905, 2021.
Conf. Neural Inform. Process. Syst., 2017, pp. 1–10.
[52] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm
[28] S. Wang, Y. Li, X. Liang, D. Quan, B. Yang, S. Wei, and L. Jiao, “Better for model fitting with applications to image analysis and automated
and faster: Exponential loss for image patch matching,” in Proc. IEEE cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.
Int. Conf. Comput. Vision, 2019, pp. 4812–4821.
[53] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
[29] T. Ng, V. Balntas, Y. Tian, and K. Mikolajczyk, “SOLAR: Second-order retrieval with large vocabularies and fast spatial matching,” in Proc.
loss and attention for image retrieval,” in Proc. Eur. Conf. Comput. Vision, IEEE Int. Conf. Comput. Vision Pattern Recognit., 2007, pp. 1–8.
2020, pp. 253–270.
[54] Y. Avrithis and G. Tolias, “Hough pyramid matching: Speeded-up
[30] Y. Miao, Z. Lin, X. Ma, G. Ding, and J. Han, “Learning transformation-
geometry re-ranking for large scale image retrieval,” Int. J. Comput.
invariant local descriptors with low-coupling binary codes,” IEEE Trans.
Vision, vol. 107, pp. 1–19, 2014.
Image Process., vol. 30, pp. 7554–7566, Aug. 2021.
[55] X. Li, M. Larson, and A. Hanjalic, “Pairwise geometric matching for
[31] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and
large-scale object retrieval,” in Proc. IEEE Int. Conf. Comput. Vision
T. Sattler, “D2-net: A trainable CNN for joint description and detection of
Pattern Recognit., 2015, pp. 5153–5161.
local features,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit.,
2019, pp. 8092–8101. [56] Y. Wang, R. Zhao, L. Liang, X. Zheng, Y. Cen, and S. Kan, “Block-
based image matching for image retrieval,” J. Vis. Commun. Image
[32] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang,
Representation, vol. 74, p. 102998, 2021.
and L. Quan, “ASLFeat: Learning local features of accurate shape and
localization,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., [57] B. Jiang, P. Sun, and B. Luo, “GLMNet: Graph learning-matching
2020, pp. 6589–6598. convolutional networks for feature matching,” Pattern Recognit., vol.
[33] P. E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: 121, p. 108167, 2022.
Learning feature matching with graph neural networks,” in Proc. IEEE [58] H. Liu, T. Wang, Y. Li, C. Lang, Y. Jin, and H. Ling, “Joint graph
Int. Conf. Comput. Vision Pattern Recognit., 2020, pp. 4938–4947. learning and matching for semantic feature correspondence,” Pattern
[34] M. Amiri and H. R. Rabiee, “RASIM: A novel rotation and scale invariant Recognit., vol. 134, p. 109059, 2023.
matching of local image interest points,” IEEE Trans. Image Process., [59] S. Winder, G. Hua, and M. Brown, “Picking the best daisy,” in Proc.
vol. 20, no. 12, pp. 3580–3591, May 2011. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2009, pp. 178–185.
[35] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. [60] H. Aanæs, A. L. Dahl, and K. Steenstrup Pedersen, “Interesting interest
Hjelm, “Deep graph infomax,” in Proc. Int. Conf. Learn. Representations, points,” Int. J. Comput. Vision, vol. 97, no. 1, pp. 18–35, Jun. 2012.
2018, pp. 1–13. [61] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
[36] F. Y. Sun, J. Hoffman, V. Verma, and J. Tang, “InfoGraph: Unsupervised object detection with region proposal networks,” IEEE Trans. Pattern
and semi-supervised graph-level representation learning via mutual Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2016.
information maximization,” in Proc. Int. Conf. Learn. Representations, [62] H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular depth
2020, pp. 1–13. estimation with internal feature fusion,” arXiv preprint arXiv:2110.09482,
[37] J. L. Schonberger and J. M. Frahm, “Structure-from-motion revisited,” 2021.
in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2016, pp. [63] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan,
4104–4113. “HR-Depth: High resolution self-supervised monocular depth estimation,”
[38] W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, arXiv preprint arXiv:2012.07356, 2020.
“Learning to recover 3D scene shape from a single image,” in Proc. IEEE [64] J. Yan, H. Zhao, P. Bu, and Y. Jin, “Channel-wise attention-based network
Int. Conf. Comput. Vision Pattern Recognit., 2021, pp. 204–213. for self-supervised monocular depth estimation,” in Proc. Int. Conf. 3D
[39] H. Farid and E. P. Simoncelli, “A differential optical range camera,” in Vision, 2021, pp. 464–473.
Proc. Annu. Meeting Optical Soc. Amer., 1996, pp. 1–10. [65] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
[40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and IEEE Int. Conf. Comput. Vision, 2017, pp. 2961–2969.
Y. Bengio, “Graph attention networks,” in Proc. Int. Conf. Learn. [66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Representations, 2018, pp. 1–12. recognition,” in Proc. IEEE Int. Conf. Comput. Vision, 2016, pp. 770–778.
16

[67] T. N. Kipf and M. Welling, “Semi-supervised classification with graph


convolutional networks,” in Proc. Int. Conf. Learn. Representations, 2017,
pp. 1–14.
[68] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” in Proc. Conf. Neural Inform. Process. Syst.,
2017, pp. 1025–1035.
[69] M. A. Uy and G. H. Lee, “PointNetVLAD: Deep point cloud based
retrieval for large-scale place recognition,” in Proc. IEEE Int. Conf.
Comput. Vision Pattern Recognit., 2018, pp. 4470–4479.
[70] H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother,
“Augmented reality meets computer vision: Efficient data generation for
urban driving scenes,” Int. J. Compu. Vision, vol. 126, no. 9, pp. 961–972,
Mar. 2018.
[71] D. G. Lowe, “Object recognition from local scale-invariant features.” Int.
J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
[72] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:
CNN architecture for weakly supervised place recognition,” in Proc.
IEEE Int. Conf. Comput. Vision Pattern Recognit., 2016, pp. 5297–5307.
[73] F. Zhong, S. Wang, Z. Zhang, and Y. Wang, “Detect-SLAM: Making
object detection and slam mutually beneficial,” in Proc. IEEE Winter
Conf. Appl. Comput. Vision, 2018, pp. 1001–1010.
[74] Y. Wu, Y. Zhang, D. Zhu, Y. Feng, S. Coleman, and D. Kerr, “EAO-
SLAM: Monocular semi-dense object slam based on ensemble data
association,” in Proc. IEEE Int. Conf. Intell. Robots Syst., 2020, pp.
4966–4973.
[75] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic SLAM with autonomous
object-level data association,” in Proc. IEEE Int. Conf. Robot. Automat.,
2021, pp. 11 203–11 209.
[76] X. Lin, J. Ruan, Y. Yang, L. He, Y. Guan, and H. Zhang, “Robust data
association against detection deficiency for semantic slam,” IEEE Trans.
Autom. Sci. Eng., 2023.
[77] R. H. Crowell and W. E. Slesnick, Calculus with Analytic Geometry.
WW Norton, 1968.
17

S UPPLEMENTARY M ATERIAL
A. Landmark Datasets for Image Patch Matching
In this section, we present two landmark patch matching
datasets,5 named the Landmark KITTI Dataset and the Land-
mark Oxford Dataset, derived from the street scene KITTI
dataset and the Oxford Radar RobotCar Dataset respectively.
We first briefly introduce the two original public datasets both
of which contain image frames and LiDAR scans captured
from onboard cameras and Velodyne LiDAR sensors. The
KITTI dataset is a public dataset6 with multi-sensor data for
autonomous driving. It contains street scene image frames and
their corresponding LiDAR point clouds, which are captured in
Karlsruhe, Germany, using the Point Grey Flea 2 (FL2-14S3C-
C) Camera and Velodyne HDL-64E Laserscanner, respectively.
The frame resolution is 1241 × 376 pixels. The Oxford Radar
RobotCar dataset7 contains image frames and LiDAR scans
captured on the streets in Oxford, UK, by the Point Grey
Grasshopper2 (GS2-FW-14S5C-C) Camera and Velodyne HDL- Fig. 13. Several examples of ground truth landmark bounding box labels
based on semantic segmentation masks in the KITTI dataset. Left: semantic
32E Laserscanner, respectively. The resolution of each frame segmentation images with bounding box labels. Right: real images with
in this dataset is 1280 × 960 pixels. bounding box labels.
We extract the landmark object patches from the full-sized
image frames of the two original street scene datasets using an
object detection neural network. In the literature on landmark-
based applications, Edge Boxes are used to detect a bounding
box around a patch that contains a large number of internal
contours compared to the number of contours exiting from the
box, which indicates the presence of an object in the enclosed
patch. DeepLabV3+ is used to extract significant landmark
regions. However, all of the aforementioned patch extraction or
landmark detection approaches are not stable when removing
dynamic objects and many noisy regions are presented. By
contrast, in our datasets, we use Faster R-CNN as the stable
landmark object detector to locate the region of interest for
static roadside objects including traffic lights, traffic signs,
poles, and facade windows. To facilitate the detection efficacy,
we manually labeled those objects using the frames from the
street scene KITTI dataset and the Oxford Radar RobotCar
dataset. In Faster R-CNN, we choose Resnet50 with Feature
Pyramid Network (FPN) as the backbone, which is already
pretrained on the Imagenet dataset. During training, we use
Adam optimizer with learning rate 0.0002 and weight decay
0.0001 to train the detector for 50 epochs. The training batch
size is set as 2 and random horizontal flipping is used for data Fig. 14. Several examples of the ground truth landmark bounding box labels
augmentation. for the Oxford Radar RobotCar dataset.
We next introduce our landmark patch matching datasets. For
both the Landmark KITTI dataset and the Landmark Oxford labels to object bounding box labels. We use the skim-
dataset, the full-sized image frames are captured by stereo age.measure.label to label connected regions for pixel classes
cameras, and we only use the left frames to extract landmark including traffic lights, traffic signs and poles. See Fig. 13 for
patches. The details like the landmark object bounding box an example. In some rare cases, multiple poles may overlap and
labels and the patch matching ground truth are described the connected region algorithm outputs an inaccurate bounding
separately for each dataset as follows. box. We manually exclude these overlapped objects in the
Landmark KITTI Dataset. The segmentation labels are generated bounding box labels. As mentioned above, Faster
semantic segmentation masks. To perform landmark object R-CNN trained using the labels is used to produce the object
detection, we need to first convert the semantic segmentation detection results for all the other unlabeled frames contained
5 https://github.com/AI-IT-AVs/Landmark_patch_datasets in the dataset.
6 http://www.cvlibs.net/datasets/kitti/ We project the surrounding LiDAR points onto the image
7 http://ori.ox.ac.uk/datasets/radar-robotcar-dataset frame plane using the intrinsic camera matrix and extrinsic
18

TABLE XII
N EURAL NETWORK MODELS AND THE PARAMETERS IN THE IMAGE PATCH MATCHING FRAMEWORK .

Mapping Models Layers (model parameters) Dimension of Outputs


f Resnet Resnet18 (without the last FC layer, with 17 convolution layers) 512
GAT block 1 (4 attention heads, 4 × 128 hidden features & ELU) 512
GAT block 2 (4 attention heads, 4 × 128 hidden features & ELU)/ 512
GAT/
GCN block 1 (512 hidden features & ReLU)
g GCN/
GCN block 2 (512 hidden features & ReLU)/
GraphSAGE
GraphSAGE block 1 ([512, 512] hidden features, ReLU & BatchNorm)
GraphSAGE block 2 (512 hidden features)
d Discrimiator Bilinear layer (four 512 × 512 hidden partitioned matrices & Sigmoid function) 1

dataset, we manually labeled landmarks including traffic lights,


traffic signs, poles, and facade windows for 500 frames. See
Fig. 14 for examples. Compared with the Landmark KITTI
Dataset, we additionally include the window class in this
dataset. (Window labels are not available for the Landmark
KITTI Dataset yet. We will enrich the Landmark KITTI Dataset
with window labels in future work.) We then train Faster R-
CNN to obtain the landmarks for all 29, 687 frames. Similar
operations are performed to obtain the final landmark patches
with matching ground truths. See Fig. 15 for some landmark
patch examples.
(a) Landmark KITTI Dataset

B. Detailed Model Parameters


The details of the model setting mentioned in Section IV-B
of the paper are provided in the following Table XII.

C. Monocular Depth Estimation for VGIDM


In our work, we assume the spatial information of the seg-
mentation is available to construct the neighborhood graphs in
VGIDM. In Section IV of the paper, we perform evaluations on
the two landmark datasets where Monocular Depth Prediction
(b) Landmark Oxford Dataset Module is used to obtain the spacial relationships among
Fig. 15. (a) and (b) are examples of landmark patch pairs from the Landmark landmark patches contained in full-size images. The reported
KITTI dataset and the Landmark Oxford Dataset respectively. AbsRel of this depth estimation method is around 14 meters.
Fig. 16 shows a few examples of the predicted depth. We
observe that many objects in the predicted depth visualization
camera matrix. Here, we have used the sensors’ information are well distinguished from their surroundings.
(i.e., vehicle global ground truth locations) to accumulate
RGB Predicted Depth RGB Predicted Depth

collected LiDAR scans to build the 3D LiDAR reference map.


Due to the limited LiDAR field of view, a single LiDAR
scan may not have any LiDAR point corresponding to some KITTI

landmarks. To avoid this, we build a unified 3D LiDAR Dataset

reference map similar to that in PointNetVLAD. Based on


the 3D reference map, the LiDAR points reflected from the
landmark patch are read out to obtain the global locations of the
corresponding landmark objects. We apply DBSCAN to filter
out some outlier points and obtain compact landmark objects. Oxford Radar
RobotCar

We then compute the L2 distance of each landmark patch pair Dataset

from two frames to determine the patch matching ground truth.


We have also gone through all the frames manually to remove
or correct a few noisy landmark objects. Finally, for each Fig. 16. Image depth estimation results from the Monocular Depth Prediction
detected landmark object, we intentionally expand its bounding Module for the KITTI dataset and the Oxford Radar RobotCar Dataset.
box by 15 pixels on each side to include some background
information. See Fig. 15 for an example.
Landmark Oxford Dataset. To build the Landmark Oxford

You might also like