Image Patch-Matching with Graph-Based Learning
Image Patch-Matching with Graph-Based Learning
perception tasks for autonomous driving. Current methods focus Extract roadside semantic
object patches
on local matching for regions of interest and do not take into
account spatial neighborhood relationships among the image
patches, which typically correspond to objects in the environment.
In this paper, we construct a spatial graph with the graph
Matched
vertices corresponding to patches and edges capturing the spatial pairs of
Landmark patches Image patch
neighborhood information. We propose a joint feature and landmarks
matching model
metric learning model with graph-based learning. We provide a with graph-
based learning
theoretical basis for the graph-based loss by showing that the Applications:
information distance between the distributions conditioned on • Vehicle self-
localization using
matched and unmatched pairs is maximized under our framework. landmark maps
We evaluate our model using several street-scene datasets and Landmark patches • Place recognition
demonstrate that our approach achieves state-of-the-art matching • Depth estimation
Extract roadside semantic • Odometry
results. object patches
estimation
Semantic objects’ spatial • 3D reconstruction
Index Terms—Image patch-matching, graph neural network, information/relative • SLAM
Kullback-Leibler divergence, information distance maximization, locations …
visual place recognition
Fig. 1. Landmark patch-matching using spatial graphs in street scenes and its
potential applications.
I. I NTRODUCTION
S a critical and fundamental technique in visual percep- are used. The similarity of a feature pair is commonly
A tion, image matching is widely used in many applications, computed using different predefined metrics, like the L2
such as image retrieval [1] and vehicle re-identification [2]. distance and cosine distance. Moreover, a circular pattern
Conceptually, the target of a matching task is to solve with an adjustable radius is exploited in BRISK [18] and
the similarity correspondence problem for contents from an FREAK [19], which provides more efficient neighborhood
image pair [3]–[5]. In landmark-based street-scene applications, information for computing relevant pixel statistics. However,
semantic objects such as traffic signs, traffic lights and road-side these handcrafted features are not robust to viewpoint changes,
poles [6]–[8] often serve as landmarks. The correspondence varying illuminations and transformations. Consequently, the
between the landmark patches captured at different locations matching performance for methods based on such handcrafted
may be further utilized as cornerstones to solve other problems, local features is often unstable [20].
including loop-closure detection in simultaneous localization With the rapid development of artificial intelligence tech-
and mapping (SLAM) [7], [9], place recognition [10], [11], niques, deep learning methods, such as convolutional neural
multi-view camera relocalization [12], landmark-LiDAR ve- networks (CNNs), are widely used in image matching [21]–
hicle relocalization [6], [13], and landmark-based odometry [23]. In this case, high-dimensional features are exploited
estimation [8]. to replace handcrafted features in image representations. In
In traditional image patch-matching methods, handcrafted the joint feature and metric learning method [24]–[26], the
local features using pixel statistics or gradient information, representations and similarity metrics are combined in an end-
such as SIFT [14], SURF [15], HOG [16] and ORB [17], to-end learning framework, in which high-level features of
the images are extracted, and their similarities are learned
This work is supported by the Singapore Ministry of Education Academic
Research Fund Tier 2 grant MOE-T2EP20220-0002 and the RIE2020 Industry simultaneously. The feature descriptor learning method [20],
Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, [27]–[30] focuses on high-level feature learning and tries to
as well as cash and in-kind contribution from the industry partner(s). keep matched samples close and unmatched samples far from
The first two authors, R. She and Q. Kang, contributed equally to this work.
R. She, Q. Kang, S. Wang, W. P. Tay and Y. L. Guan are with the Continental- each other in the corresponding feature space. The similarity
NTU Corporate Lab, Nanyang Technological University, Singapore 639798 is computed using a predefined similarity metric. In these
(emails: {rui.she@;wang1679@e.;qiyu.kang@; wptay@;eylguan@}ntu.edu.sg). approaches, matching is based on learning feature represen-
D. N. Navarro and A. Hartmannsgruber are with Continental Automotive
Singapore Pte. Ltd., 80 Boon Keng Road, 339780, Singapore. (emails: tations of each image separately and does not exploit the
{diego.navarro.navarro; andreas.hartmannsgruber}@continental.com). relationships between objects in the images. Recent keypoint-
2
based learning methods such as D2-Net [31], ASLFeat [32] Fig. 1 for an illustration. More specifically, we focus on
and SuperGlue [33] perform the point-level correspondence static roadside objects including traffic lights, signs, lamp
based on the detected keypoints and their descriptors for the posts, and even windows on a building facade. This is
input images, which can also be used for the image matching because in most landmark-based applications, other transient
task [34]. static objects like parked cars, are inappropriate landmarks
or do not have sufficient distinctive features. Due to complex
environmental conditions like dynamic element occlusion, e.g.,
due to pedestrians, vehicles, or the scene viewpoint changing
(especially when turning at sharp corners or traversing a
stretch in opposite directions), the landmark patches may have
dramatic differences in appearance. We refer the readers to
the supplementary material for more details on the landmark
patch-matching datasets’ preparation. For a concrete illustration,
some examples of matched or unmatched landmark patches
Fig. 2. Landmark patches matching in two full-sized images sampled from the are presented in Fig. 2.
Oxford Radar RobotCar dataset. The matched landmark patches are labeled Our contributions are summarized as follows:
with the same colored bounding boxes, while the white bounding box indicates
• We propose a landmark patch-matching method with
that the landmark patch in one image has no matched pair in the other image.
Green lines indicate the constructed graph edges in our model. graph-based learning for vehicles in street scenes, which
extends the feature representation approach used in
Unlike other image patch-matching tasks, rich spatial infor- traditional image patch-matching tasks and incorporates
mation for landmark patches is often available. For example, spatial relationship information.
lamp posts along a road are usually spaced at equal intervals, • We analyze the fundamental principle and properties of the
and their relative locations with respect to (w.r.t.) each other proposed graph-based loss function from an information-
in the environment provide additional information for the theoretic perspective.
matching task. In special street scenes like the downtown • We introduce two landmark patch-matching datasets,
or central business district (CBD), landmark patch-matching which contain challenging street-view landmark patches
has advantages over conventional pixel-/point-level matching captured in an autonomous driving environment.
due to the presence of dynamic objects, such as vehicles and • We empirically demonstrate that our method achieves state-
pedestrians. These dynamic objects captured by the vehicular of-the-art performance on the landmark patch-matching
cameras may have more matched pixels across different frames task when compared to various other benchmarks.
than static landmarks. However, matching these objects is
The rest of this paper is organized as follows. In Section II,
useless or even harmful for tasks such as place recognition. To
related works are discussed. Our model and framework are
mitigate this issue, in this work, we perform the patch-matching
introduced in Section III, where we also provide a theoretical
task based on static landmarks such as traffic lights, traffic
analysis of our graph-based loss. We present experimental
signs, poles, and windows.
results in Section IV and conclude the paper in Section V. The
Inspired by graph-level representation learning [35], [36],
proofs for all lemmas or propositions proposed in this paper
we propose to construct a graph for the neighborhood of an
are provided in the appendices.
image patch and use graph-level representations to enrich the
landmark patch embedding. We identify each landmark patch
as a vertex of a graph and find the K-nearest neighbors based II. R ELATED W ORKS
on estimated spatial information. In the literature, there exist Since deep learning-based methods play dominant roles in
various spatial information estimation techniques like structure- the image-matching problem, we only discuss deep learning-
from-motion (SfM) [37], monocular or stereo depth estimation based works here. Deep learning-based methods include feature
[38] and optical attenuation masks [39]. In this paper, for the descriptor learning, joint feature and metric learning, as well
sake of illustration, we choose an off-the-shelf monocular depth as keypoint-based correspondence learning.
estimation method from [38] to estimate the landmark spatial Feature descriptor learning. High-level features of an
relations. However, any other spatial estimation or augmented image are first extracted using a neural network like a CNN
ranging sensors like LiDAR or depth camera can also be utilized so that matched samples are close while unmatched samples
in our framework. We form a clique whose vertex embeddings are distant under a similarity metric, which is chosen to
are learnable via a graph neural network (GNN) [40], [41]. This be a feature distance function. In many models [22], [23],
graph is utilized in our proposed patch-matching framework for pairwise or triplet loss is used to train the neural networks.
object information characterization. The final matching score To improve performance, in [44], a regularization is designed
is an average of the graph and vertex embedding similarity. by maximizing the spread of local feature descriptors over
We also introduce two landmark patch-matching datasets the descriptor space, from which a better embedding for
derived from the street-scene KITTI dataset [42] and the Oxford image-level features is obtained. To ensure many samples
Radar RobotCar dataset [43]. Our paper focuses on matching are accessible to the descriptor network within a few epochs,
image patches of specific static roadside objects from two L2-Net [20] uses a progressive sampling strategy. Furthermore,
full-sized images taken by cameras onboard vehicles. See HardNet [27] is designed to fully utilize the hard negative
3
samples by making the closest positive sample far away from distances. On the other hand, SuperGlue [33] is designed
the closest negative sample in a batch. The reference [28] using attention GNNs and the Sinkhorn algorithm for keypoint-
overcomes the hard sample learning issue by use of exponential based feature matching. LoFTR [50] achieves accurate semi-
Siamese and triplet losses, which naturally pay more attention to dense matches with Transformers including self and cross-
hard samples and less attention to easy ones. SOSNet is studied attention layers. Generally speaking, all the above keypoint-
in [45] to learn better local descriptors, where the second- based correspondence learning methods can be used to perform
order similarity (SOS) is introduced into the loss function the image matching task with further operations on the keypoint
as a regularization. Moreover, [29] designs two second-order matching scores [33].
components, i.e., the second-order spatial information and the To improve image matching performance, spatial information
second-order descriptor space similarity, to achieve feature map is used in [33], [50], [51] through spatial verification, graph
re-weighting and global descriptors learning, respectively. The learning, and cross attention. In spatial verification, spatial
paper [46] proposes topology consistent descriptors (TCDesc) information is usually used for the transformation calibration
based on neighborhood information of descriptors, which can w.r.t. the key points or objects, as well as a correspondence
be combined with other methods via the triplet loss. auxiliary for direction or location w.r.t. the objects of interest
Joint feature and metric learning. In joint feature and [51]. This can introduce global information to improve local
metric learning, the similarity metric is not predefined and is correspondence. In particular, transformation optimization
instead set as a trainable network together with the feature methods like RANSAC [52], fast spatial measure (FSM) [53],
extraction network. In this case, the matching task is regarded hough pyramid matching (HPM) [54] and pairwise geometric
as a binary classification task by resorting to the similarity matching (PGM) [55], can filter out weak correspondences for
metric network with a classification loss function. As a classical keypoints or local features obtained by key feature detection
method, MatchNet proposed by [24] extracts high-level features and descriptors such as SIFT [14], SURF [15], and ORB
by using deep CNNs and measures the feature similarity using [17]. The region-based or object-based verification methods
fully connected (FC) layers. To compare the different network such as Objects in Scene to Objects in Scene (OS2OS)
architectures for the matching task, several networks, including [51] and block-based image matching [56], make use of the
SiameseNet, Pseudo-SiameseNet and 2-channel network, are relative positions of local patches to refine the whole image
investigated in [21], [47]. The 2-channel network merges matching. Different from the above approaches, our method
the two images into a 2-channel image to achieve faster uses distance-based spatial information for the neighborhood
convergence. The SiameseNet and Pseudo-SiameseNet both graph construction, rather than for transformation correction
use two branches based on the same structure to extract high- or weak correspondence filtering.
dimensional features, with and without the shared weights Graph learning methods such as SuperGlue [33], GLMNet
respectively. Using the normalized cross-correlation (NCC) [57], and joint graph learning and matching network (GLAM)
as a metric, [25] proposes NCC-Net, which utilizes robust [58], are exploited to represent local features based on the
matching layers to measure the similarity of feature pairs. To neighborhood graphs for keypoints. The graphs are constructed
tackle cross-spectral image matching, AFD-Net is proposed based on the detected keypoints or the corresponding features,
by [26] to aggregate multi-level feature differences, which can and GNNs are used to learn graph representations. These
strengthen the discrimination of the network. methods achieve more robust and stable representations for the
Keypoint-based correspondence learning. In keypoint- corresponding features based on spatial information.
based correspondence learning, the main procedure is to
Different from the above methods, our approach focuses on
construct neural networks to perform keypoint detection and
the neighborhood information based on landmark distances,
description and to measure or learn the keypoints’ similarity
which is used for patch-level, rather than point-level, represen-
for matching inference. For instance, LIFT [48] is designed
tation and not used to filter weak or invalid correspondences.
based on a united deep network architecture where keypoints
Moreover, we also adopt GNNs to represent the patch-level
are detected in the first network, the orientation for cropped
neighborhood graphs, which is demonstrated to be beneficial
regions is estimated in the second network, and the feature
for the landmark patch-matching task.
description is performed in the third network. Here, the
Euclidean distance is used to measure the similarity of features.
The SuperPoint approach [49] introduces a self-supervised
domain adaptation framework named Homographic Adaptation III. L ANDMARK PATCH - MATCHING WITH G RAPH -BASED
into interest point detection and description. The D2-Net [31] L EARNING
makes use of a single CNN to perform dense feature description
and detection simultaneously, where the detection, instead In this section, we first introduce our graph-based learning
of being based on low-level image structures, is postponed framework to find matched landmark patch pairs that are
to the high-level structures, which are also used for image extracted from two images taken from on-vehicle cameras.
descriptions. Based on the D2-Net backbone architecture, The images may be taken from different perspectives and our
ASLFeat [32] is equipped with three lightweight effective framework can also identify those patches that are unmatched.
modifications, which have better local shape estimation and Fig. 2 shows examples of matched and unmatched landmark
more accurate keypoint localization. The above methods all patch pairs. We then discuss the theoretical basis for our graph-
measure the point-level correspondence based on Euclidean based learning approach.
4
Concatenate
Spatial {𝑓𝑓 (𝑥𝑥 ′ )}𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥 �∥�
𝑥𝑥
information
𝑔𝑔(𝒢𝒢 ) 𝜓𝜓(𝒢𝒢 𝑥𝑥 )
Resnet
𝑥𝑥 Neighborhood 𝒢𝒢 𝑥𝑥
{𝑥𝑥 ′ }𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥 Max pooling
graph 𝑥𝑥1 𝑥𝑥3
GNN
Image patch network
𝑥𝑥2 {𝜌𝜌(𝑥𝑥 ′ )}𝑥𝑥 ′ ∈𝒢𝒢 𝑥𝑥
as input �∥�
Information of edges 𝜌𝜌(𝑥𝑥) 𝑓𝑓(𝑥𝑥)
𝜑𝜑(𝑥𝑥)
Semantic/instance segmentation
or object detection Loss
Discriminator Discriminator
𝑑𝑑 𝑑𝑑
𝜑𝜑(𝑦𝑦)
Resnet
𝑦𝑦 𝒢𝒢 𝑦𝑦
Neighborhood {𝑦𝑦 ′ }𝑦𝑦′ ∈𝒢𝒢 𝑦𝑦
graph 𝑦𝑦1 𝑦𝑦3
GNN
Image patch
𝑦𝑦2
network {𝜌𝜌(𝑦𝑦 ′ )}𝑦𝑦′ ∈𝒢𝒢 𝑦𝑦 Max pooling
as input
Information of edges �∥�
𝑔𝑔(𝒢𝒢 𝑦𝑦 ) Concatenate 𝜓𝜓(𝒢𝒢 𝑦𝑦 )
Legend: Convolution layer Max pooling Average pooling Bilinear layer Sigmoid function GNN block
Fig. 3. VGIDM: landmark patch-matching with the graph-based learning. The Resnet f shown in the framework is a shared network serving as the feature
descriptor function f to extract high-dimensional features from patches. Likewise, the discriminator d is also shared to make a decision for the vertex-to-graph
correspondence. The model takes as input a pair of image patches that correspond to street scene landmarks.
A. Framework Overview Mask R-CNN [65], or object detection methods, e.g., Faster
Similar to other patch-matching datasets like the multi- R-CNN [61]. These two kinds of methods can extract objects of
view stereo (MVS) dataset [59] and the DTU dataset [60], interest such as traffic lights and traffic signs from image frames.
in our work, the landmark patches are extracted from the full- Two main modules, Resnet [66] and GNN, are respectively
sized images and the matching ground truths are established used for image feature extraction and neighborhood graph
using 3D points. More specifically, the landmark patches are embedding, where the GNN can be the graph attention
extracted using well-known object detection techniques like network (GAT) [40], graph convolutional network (GCN)
Faster R-CNN [61]. To distinguish the full-sized images from [67], GraphSAGE [68] or any other GNN architecture. Given
the landmark patches, we use the term frame to denote the the vertex and graph embedding features from our model,
full-sized image from which the patches are extracted. We refer we maximize the empirical information distance between the
the readers to Section IV-A for more details on the preparation cases where patches are matched and unmatched. We call our
of the landmark patch-matching datasets. image patch-matching approach Vertex-Graph-learning-and-
We assume that the spatial information (i.e., approximate Information-Distance-Maximization (VGIDM). The details are
relative distances between landmark objects) of landmark given as follows.
patches is available. The spatial information can be obtained
from range estimation methods like the monocular depth B. Model Details
estimation networks [62]–[64] in both the training and the Our objective is to determine if two landmark patches from
testing phases. To construct a graph, we let the landmark different frames are matched with each other. In VGIDM, the
patches of a frame be vertices of the graph. For each patch or feature extraction module f first learns embeddings for the input
vertex x, we find the K nearest neighbors in terms of spatial landmark patch as well as the patches in its neighborhood graph.
locations as indicated by the observed spatial information. An The model then makes use of a learnable graph embedding
example of the constructed graph is shown in Fig. 2. For the module g to represent the neighborhood graph-level and vertex-
vertex x, we form a complete graph or clique with its K nearest level feature readout features. Finally, it uses a decision-making
neighbors found. Let G x denote this neighborhood graph. module to compute the matching classification.
Our image patch-matching framework is illustrated in Fig. 3. Feature extraction for patches. We use the Resnet f to
In this framework, the inputs for the model are image patches extract high-dimensional features for each landmark patch x.
obtained by semantic or instance segmentation methods, e.g., The Resnet output is denoted by f (x) ∈ Rn . Recall that for a
5
patch x, we form a neighborhood graph G x . For the graph G x , comparison between the features. Inputting (φ(x), ψ(G y )) to
applying f to each node in G x , we have {f (x′ )}x′ ∈G x . the discriminator d, we have
Embedding representation for the neighborhood graph
and its vertices. The graph G x is input to a GNN network d(φ(x), ψ(G y ))
g to obtain a graph-level embedding representation g(G x ). g(G y )
x
Specifically, the vertex features (f (x′ ))x′ ∈G x ∈ R|G |×n are = σ ρ(x)⊤ f (x)⊤ × M × ρ(y)
(5)
updated via the GNN, which consists of several layers of f (y)
neighborhood aggregation and node update [40], [41], followed
by some activation functions and a final pooling layer. The = σ ρ(x)⊤ M12 ρ(y) + f (x)⊤ M21 g(G y )
x
vertex embeddings (ρ(x′ ))x′ ∈G x ∈ R|G |×n are obtained from
the last graph convolutional/attentional layer of the GNN, + f (x)⊤ M22 ρ(y) + f (x)⊤ M23 f (y) . (6)
while the graph-level embedding representation g(G x ) ∈ Rn
is obtained as the output of the last pooling layer.
The first term ρ(x)⊤ M12 ρ(y) in (6) is used to compare the
Compared to f (x) ∈ Rn which extracts features directly vertex embeddings of x and y obtained from the GNN. This
from the patch x, ρ(x) ∈ Rn learns a feature embedding with emphasizes the domain part of the embedding. The second
additional information from its neighborhood, while g(G x ) ∈ term f (x)⊤ M21 g(G y ) and third term f (x)⊤ M22 ρ(y) are used
Rn learns an embedding for the surrounding environment itself. for the comparison of the vertex x and the neighborhood graph
Correspondence comparison. Suppose that x and y are of y. This helps to constrain GNN learning. The last term
landmark patches from two different frames, respectively. If x f (x)⊤ M23 f (y) is used to compare the Resnet features for the
and y are patches for the same real-world object, we say that two vertices, which updates the Resnet training. The same
they are matched and denote this event as x ↔ y. Otherwise, procedure is performed analogously for φ(y) = ρ(y)∥f (y) and
they are unmatched and denoted as x ↮ y. For any patch pair ψ(G x ) = g(G x )∥φ(x).
(x, y), we denote the matching ground truth label as 1{x↔y} , The learnable discriminator d(φ(x), ψ(G y )) from (6) utilizes
where 1{·} is the indicator function. In order to compare the the ensemble vertex embedding φ(x) and the neighborhood
correspondence between the patch pair (x, y), we design a graph embedding ψ(G y ). The vertex-level embedding ρ(x) and
decision-making mechanism based on the patch features. For graph-level embedding g(G x ) contain information from the
the two patches x and y, we respectively obtain f (x) and vertex feature f (x) (output of Resnet) due to the incorporation
f (y) as the features from the Resnet, ρ(x) and ρ(y) as the of neighborhood information from the GNN. In the case
vertex-level embedding features, and g(G x ) and g(G y ) as the of a large number of frames, the neighborhood graphs can
graph-level embedding features from the GNN network. be quite different as they typically consist of vertices from
Let the ensemble vertex embedding for a patch x be different frames. As a result, the embeddings ρ(x), g(G x ) and
ρ(y), g(G y ) can have different features to some degree even
if x ↔ y. Therefore, it may be appropriate to use the original
φ(x) = ρ(x)∥f (x) (1)
vertex feature f (x) to constrain the graph learning for the
vertex-level embedding ρ(y) and the graph-level embedding
and the neighborhood graph embedding for G x be g(G y ). The comparisons between f (x) and ρ(y) or g(G y )
can emphasize the principal component for the learned graph
x x x
ψ(G ) = g(G )∥φ(x) = g(G )∥ρ(x)∥f (x), (2) features. When vertices x and y are matched, ρ(y) and
g(G y ) essentially contain the information of f (x). Therefore,
comparing f (x) with ρ(y) and g(G y ) can introduce more
where ∥ is the concatenation operation. information with neighborhood characteristics for the matching
The ensemble vertex feature for x and the graph embedding process.
for G y are input to a discriminator d consisting of a bilinear Loss function and matching score. Let M be a training
layer of the form: set consisting of patch pairs (x, y). Define the graph-based
learning objective function as LempID given in (8), which
d(a, b) = σ(a⊤ × M × b), (3) depends on the discriminator d in (6). We show that LempID
is the empirical version of an information distance between
n×m
the distributions conditioned by matched and unmatched pairs
where M ∈ R is a trainable matrix and σ(·) denotes the in Proposition 1. We set our overall loss as
sigmoid function. In particular, the matrix M is designed as
min {−LempID }, (9)
φ,ψ,d
0 M12 0
M= , (4)
M21 M22 M23 to maximize the information distance.
In the testing phase, the final matching score is given by
where M12 , M21 , M22 and M23 serve as the matrix blocks
with learnable parameters and 0 denotes the zero matrix. d(φ(x), ψ(G y )) + d(φ(y), ψ(G x ))
Smatch (x, y) = , (10)
The specifically designed block matrix (4) is to restrict the 2
6
1 X 1
LempID = log[d(φ(x), ψ(G y ))] + log[d(φ(y), ψ(G x ))]
1{x↔y}
|M| 2
(x,y)∈M
1
+ 1{x↮y} log[1 − d(φ(x), ψ(G y ))] + log[1 − d(φ(y), ψ(G x ))] (7)
2
1 1 X n o
= 1{x↔y} log[d(φ(x), ψ(G y ))] + 1{x↮y} log[1 − d(φ(x), ψ(G y ))]
2 |M|
(x,y)∈M
| {z }
LempID−1
1 1 X n o
+ 1{y↔x} log[d(φ(y), ψ(G x ))] + 1{y↮x} log[1 − d(φ(y), ψ(G x ))] (8)
2 |M|
(x,y)∈M
| {z }
LempID−2
and the prediction function for whether there is a match is In minimizing the loss in (9), in the asymptotic regime |M| →
given by ∞, we aim at max LID . Let D(· ∥ ·) denote the Kullback-
φ,ψ,d
test 1, if Smatch (x, y) > Γ, Leibler (KL) divergence.
AS (x, y) = (11)
0, otherwise, Proposition 1 (Relationship with KL divergence). Suppose
where Γ is a predefined threshold. A decision “1” indicates Assumption 1 holds. For a vertex embedding φ and a neighbor-
∗
that x and y are matched and “0” otherwise. hood graph embedding ψ, let LdID (φ, ψ) = maxd LID (φ, ψ, d),
where d∗ is the corresponding optimal discriminator. Then
C. Theoretical Basis
D(p(φ(x), ψ(G y ) | x ↔ y) ∥ p(φ(x), ψ(G y ) | x ↮ y)) (13)
In this subsection, we discuss the theoretical basis for the 1 ∗
graph-based learning objective function LempID defined in (8). ≥ LdID (φ, ψ) + Hb (P(x ↔ y)) . (14)
P(x ↔ y)
To make the analysis tractable, we assume that patch pairs (x, y)
are randomly generated from a distribution P. Let E be the where Hb (p) = −p log p − (1 − p) log(1 − p) is the binary
expectation operator. We start with a simplifying assumption entropy function.
as follows. Proof. See Appendix A.
y
Assumption 1. φ(x) and ψ(G ) are continuous random Remark 1. Proposition 1 suggests that maximizing LID
variables induced from P. over (φ, ψ, d) helps to distinguish between the matched and
In practice, due to the chosen activation functions used in unmatched patch pairs since their conditional distributions are
Resnet f and the GNN network g, their outputs typically satisfy forced to be very different in terms of the KL divergence.
the continuity requirement of Assumption 1. We next consider how the graph-based learning objective
In our analysis, the discriminator d is assumed to be a general function LID in (12) is influenced by perturbations in the
function without necessarily having the form (3). discriminator d.
R Let A be the set of all possible (φ(x), ψ(G y )) where
A
p(φ(x), ψ(G y ))d(φ(x), ψ(G y )) = 1, p : A 7→ R+ is a Proposition 2 (Effect of discriminator perturbation). Suppose
probability density whose set of discontinuities has Lebesgue Assumption 1 holds. Let ε be a sufficiently small perturbation to
measure zero. the discriminator d. Then, |LID (φ, ψ, d + ε) − LID (φ, ψ, d)| =
For any given landmark patches x and y, we assume that O(ε). Furthermore, we have | maxd LID (φ, ψ, d + ε) − maxd
2
P(x ↔ y) > 0 and P(x ↮ y) > 0. The probability densities of LID (φ, ψ, d)| = O(ε ).
y
(φ(x), ψ(G )) conditioned on x ↔ y and x ↮ y are denoted Proof. See Appendix B.
by p(φ(x), ψ(G y ) | x ↔ y) and p(φ(x), ψ(G y ) | x ↮ y),
respectively.1 In the following, we consider how the GNN embedding of
x
We discuss only LempID−1 in (8) since LempID−2 is sym- the neighborhood graph G of a vertex x affects the matching
metrical to it. The expectation form of LempID−1 is given effectiveness under further assumptions.
by For two landmark patches x and y, if their neighborhood
graphs G x and G y have vertices corresponding to the same set
LID = LID (φ, ψ, d) of objects, i.e., the patch and spatial information procedure
= E 1{x↔y} log d(φ(x), ψ(G y ))
identifies the same objects as the neighbors of x and y, we
x y
+ E 1{x↮y} log(1 − d(φ(x), ψ(G y ))) . (12) write G ↔ G .
Assumption 2. The ranges of φ(·) and ψ(·) are finite sets. The
1 Here we abuse notations p(φ(x), ψ(G y )|x ↔ y) to denote the conditional
probability density of (φ(x), ψ(G y )) given that x and y are matched. This is embedding φ(x) = φ(y) for landmark patches x and y are the
to avoid the cluttered notation p(φ(x),ψ(G y ))|x↔y (·, ·). same if x ↔ y. If furthermore G x ↔ G y , then ψ(G x ) = ψ(G y ).
7
While the Resnet f and GNN block g are in general datasets: the Landmark KITTI dataset and the Landmark Oxford
continuous functions of their inputs, Assumption 2 can be dataset,2 which are derived from the street-scene KITTI dataset
satisfied by restricting to a finite number of objects of interest [42] and the Oxford Radar RobotCar dataset [43], respectively.
in the environment, assuming that frames are captured from Both datasets contain image frames and LiDAR scans
approximately the same perspectives (e.g., from an on-vehicle captured from onboard cameras and Velodyne LiDAR sensors.
camera of a vehicle traveling along a fixed road) so that The landmark patches are extracted from the full-sized image
landmark patches of the same object are within a certain frames using the object detection neural network Faster R-
similarity distance of each other. Finally, the outputs of f CNN [61]. To facilitate detection efficacy, we manually label
and g can be quantized into discrete ranges, which implies several street-scene compact landmark objects including traffic
φ and ψ have finite sets of ranges. For the same object o in lights, traffic signs, poles, and facade windows for the sampled
the environment but under two different frames F1 and F2 , frames. The labels are used to train Faster R-CNN, which is
Assumption 2 says that the outputs from the embedding φ are used to produce landmark object detection for the image frames.
the same for the two frames. This implicitly assumes that φ The detected landmarks in bounding boxes are then used to
is robust to perturbation in its input. Furthermore, the outputs obtain the landmark patches for our matching experiments with
of the embedding ψ are also the same if the patch and spatial some intentionally included background, shown in Fig. 4 for
information are noiseless. example.
Proposition 3. Suppose Assumption 2 holds, and x and y are
landmark patches of frames F1 and F2 (based on the same envi-
ronment), respectively. Let m(x ↔ y) = P(G x ↔ G y | x ↔ y)
and m(x ↮ y) = P(G x ↔ G y | x ↮ y). Then we have
∥p(φ(x), ψ(G y ) | x ↔ y) − p(φ(x), ψ(G y ) | x ↮ y)∥TV (a) Landmark patches from KITTI Dataset
Xn
≥ min m(x ↔ y) p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y)
x↔y
(φ(x),ψ(G y ))∈B
o
− p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) (b) Landmark patches from Oxford Radar RobotCar Dataset
+ min m(x ↔ y) − max m(x ↮ y) − 1, (15) Fig. 4. (a) and (b) are landmark patch samples (displayed with intentionally
x↔y x↮y included background) from the KITTI dataset and Oxford Radar RobotCar
dataset respectively.
where B = (φ(x), ψ(G y )) : p(φ(x), ψ(G y ) | x ↔ y) ≥
p(φ(x), ψ(G y ) | x ↮ y) . Here, ∥·∥TV denotes the total vari- To establish the patch-matching ground truth, we use the
ation distance, and minx↔y and maxx↮y denotes minimization vehicle locations and collected LiDAR scans to build the 3D
over all matched patch pairs (x, y) and maximization over all LiDAR reference map similar to the operations in [69]. The
unmatched patch pairs, respectively. 3D reference map is used to determine the landmark locations
Proof. See Appendix C. by projecting the 3D LiDAR points to the image frames. The
LiDAR points reflected from the landmark patch are read out to
In the ideal case where the patch and spatial informa- get the global locations of the corresponding landmark objects.
tion are noiseless, we have minx↔y m(x ↔ y) = 1 and We then compute the L2 distance of each landmark patch pair
maxx↮y m(x ↮ y) = 0. Then the right-hand side of (15) from two frames to determine the patch-matching ground truth.
simplifies to Some details of the two landmark patch-matching datasets are
introduced as follows. More dataset preparation details are
X n
y x y
p(φ(x), ψ(G ) | G ↔ G , x ↔ y)
given in the supplementary material.
(φ(x),ψ(G y ))∈B
o Landmark KITTI Dataset. The KITTI dataset3 contains
y x y
− p(φ(x), ψ(G ) | G ↮ G , x ↮ y) . (16) street-scene image frames and their corresponding LiDAR
point clouds collected in Karlsruhe, Germany. We use the
In this case, we also have p(φ(x), ψ(G y ) | x ↔ y) =
object labels provided by [70] to detect landmark patches for
p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) and p(φ(x), ψ(G y ) | x ↮
all frames including traffic lights, traffic signs and poles. An
y) = p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y). Furthermore, from
example is shown in Fig. 5. We do not include windows as
Assumption 2, any (φ(x), ψ(G y )) such that p(φ(x), ψ(G y ) |
landmarks in this dataset due to the lack of labels. Furthermore,
G x ↔ G y , x ↔ y) > 0 implies that p(φ(x), ψ(G y ) | G x ↮
to avoid “trivial matchings” between consecutive images, a
G y , x ↮ y) = 0. These probability measures are thus mutually
minimum difference of 2m between the image frames is also
singular and have a total variation distance of 1. Therefore,
set. The aforementioned operations are performed to obtain
in the ideal case, the model perfectly distinguishes between
the landmark patch-matching ground truth by projecting the
p(φ(x), ψ(G y ) | x ↔ y) and p(φ(x), ψ(G y ) | x ↮ y).
3D LiDAR scans to the image frames. Finally, 1500 frames
IV. E XPERIMENTS are selected for landmark patch-matching experiments. The
dataset is randomly split into training and testing sets, with a
A. Datasets
As there are no existing standard datasets for street-scene 2 https://github.com/AI-IT-AVs/Landmark_patch_datasets
ratio around 2 : 1. In both training and testing, we select frame keypoints. In this regard, a patch pair with a large enough ratio
pairs that are captured at locations with relative distances not of matched keypoints is regarded as a match.
more than 25m to ensure the presence of common landmarks. Model Setting. We use Resnet18 in [66] for the feature
descriptor f , with output feature dimension 512 after 17
convolution layers. In VGIDM, we choose several GNNs for
the neighborhood graph embedding, including GAT [40], GCN
[67] and GraphSAGE [68]. When using GAT, the network
contains 2 GAT blocks with the exponential linear unit (ELU).
For each GAT block, we use 4 attention heads, which compute
512 hidden features in total. As for GCN and GraphSAGE, they
both contain 2 corresponding blocks with ReLU, where there
are 512 hidden features in each block. Further details of our
model architecture are provided in the supplementary material.
The Adam optimizer is selected with a learning rate of 0.0001
Fig. 5. A semantic segmentation image and its corresponding real image,
both with bounding box labels, from the KITTI dataset.
to train the model by minimizing its corresponding loss in (9).
The number of training epochs is 150 for all datasets.
Landmark Oxford Dataset. The Oxford Radar RobotCar VGIDM with Image Depth Estimation. To test VGIDM
dataset4 contains image frames and LiDAR scans captured in the case where precise depth information like that provided
on the streets in Oxford, UK. We manually label landmarks by LiDAR is unavailable, we construct neighborhood graphs
including traffic lights, traffic signs, poles, and facade windows using estimated image depth and with different GNNs in the
for 500 sampled frames. An example is shown in Fig. 6. We backbone. Specifically, we include an image depth estimation
then train Faster R-CNN to obtain the landmarks for all 29, 687 method called Monocular Depth Prediction Module proposed
frames. To avoid “trivial matchings” between consecutive in [38]. Based on the image depth estimation, we can obtain the
images, a minimum difference of 2m between the image frames rough relative locations of the landmarks in the street scenes and
is also set. Finally, 3000 frames are selected for landmark patch- use them to construct a neighborhood graph for each landmark
matching experiments. The remaining steps are similar to that in the test procedure. The depth estimation performance is
for the Landmark KITTI dataset. provided in the supplementary material. The estimated image
pixel depths are transformed to 3D locations w.r.t. the camera
using its intrinsic matrix. We then use the estimated locations
to test VGIDM. In this depth estimation method, the pre-trained
ResNeXt101 model from [38] is utilized in our experiments,
and the images are from the two landmark datasets. We extract
the predicted depth points from the static roadside landmarks,
including traffic lights, traffic signs, and poles, to compute
the locations of the objects. Therefore, we can construct the
Fig. 6. Examples of the ground truth landmark bounding box labels for the neighborhood graphs and test the VGIDM.
Oxford Radar RobotCar dataset. Implementation. For a given sequence of street scene frames
captured by a vehicular camera, we perform the following
B. Experimental Details training steps: i) We use object detection methods like faster
Baseline Methods. We compare VGIDM with several R-CNN [61] to extract landmark patches for each frame.
baseline methods, including MatchNet [24], SiameseNet [47], The landmarks include traffic lights, traffic signs, poles, and
HardNet [27], SOSNet [45], D2-Net [31], ASLFeat [32], Su- windows. ii) We manually label matching landmark patches. To
perGlue [33] and LoFTR [50]. The MatchNet and SiameseNet determine the global locations of these landmarks, we combine
are regarded as joint feature and metric learning methods, vehicular Global Positioning System (GPS) information with
combining deep CNNs and an FC layer to learn features and data from LiDAR or stereo cameras. With this information,
their metrics. The decision-making process for the matching we are able to establish the ground truth for the matching
task is based on the output of the FC layer. HardNet and landmark patches between two frames captured at the same
SOSNet focus on similarity measures to distinguish the location. iii) We take the global locations of landmarks to
learned high-dimensional features, where the feature descriptors construct the neighborhood graph for each landmark patch
are almost all based on deep CNNs consisting of several based on K-NN. iv) We train the VGIDM using landmark
convolution layers with batch normalization (BN) or rectified patch pairs with ground-truth labels. The details of VGIDM
linear units (ReLUs). In testing, the Euclidean distance between with training loss and test score are given in Section III-B.
the output patch features is used for the decision-making. D2- During testing, we perform steps i and iii as above but
Net, ASLFeat, SuperGlue and LoFTR are based on keypoint in step iii, we create neighborhood graphs by estimating the
correspondence and perform the matching task according to relative locations of landmarks using a stereo camera or a
the ratio of the matched keypoints among the whole set of depth estimation method, which replaces the need for GPS and
LiDAR information. The ground truth for computing the testing
4 http://ori.ox.ac.uk/datasets/radar-robotcar-dataset performance is found based on GPS and LiDAR information.
9
TABLE I
M ATCHING PERFORMANCE ON THE L ANDMARK KITTI DATASET. T HE BEST AND THE SECOND - BEST RESULT FOR EACH CRITERION ARE HIGHLIGHTED IN
RED AND BLUE RESPECTIVELY.
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0
Fig. 7. Examples of matched and mismatched pairs from the Landmark KITTI dataset. A green or red box indicates a correct or incorrect prediction result,
respectively. “GT” stands for ground truth.
The remaining steps are the same as those used during training. different street scenes, the performances on both datasets are
different. From Tables I and II, we also observe that nearly all
the methods have better performance on the Landmark KITTI
C. Performance Evaluation
dataset compared with the Landmark Oxford dataset. This may
be caused by more similarity among the window patches in the
Performance on Landmark KITTI Dataset. Table I Landmark Oxford dataset, which makes distinguishing them
summarizes the test performance of models trained with 150 more difficult. A few matching prediction examples from the
training epochs on the Landmark KITTI dataset. The evaluation Landmark Oxford dataset are shown in Fig. 8.
uses statistics including mean value and standard deviation from Performance Analysis. The VGIDM variants (with GAT,
5 experiments. From Table I, we observe that VGIDM (with GCN or GraphSAGE) not only make use of landmark patch
GAT, GCN or GraphSAGE) outperforms the other baseline information but also the neighborhood information in the
methods under all four criteria, with a slight performance decision-making process for the matching task. Other feature
difference among these VGIDM models. This implies that descriptor learning as well as joint feature and metric learning
graph-based learning makes a positive difference in matching methods such as MatchNet, SiameseNet, HardNet and SOSNet,
efficiency. Moreover, we observe that VGIDM with GAT has depend only on the individual image patch rather than the
a more stable performance than the other methods. Several neighborhood relationships. An erroneous match can happen
examples of the matching prediction are shown in Fig. 7. between patches from two similar but distinct objects. VGIDM
Performance on Landmark Oxford Dataset. From Ta- mitigates this error by using the neighborhood information.
ble II, we observe that the VGIDM variants with different However, VGIDM requires more computing resources for
GNNs outperform the other benchmark methods on almost neighborhood graph processing.
all measures. Since the Oxford Radar RobotCar and KITTI On the other hand, keypoint-based learning methods such
datasets have different image qualities and are collected in as D2-Net, ASLFeat, SuperGlue and LoFTR, suffer from low
10
TABLE II
M ATCHING PERFORMANCE ON THE L ANDMARK OXFORD DATASET. T HE BEST AND THE SECOND - BEST RESULT FOR EACH CRITERION ARE HIGHLIGHTED IN
RED AND BLUE RESPECTIVELY.
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result:0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result:0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1
GT: 1
Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
GT: 0
Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 1 Predicted Result: 0 Predicted Result: 0 Predicted Result: 0 Predicted Result: 1 Predicted Result: 1
Fig. 8. Examples of matched and mismatched pairs from the Landmark Oxford dataset. A green or red box indicates a correct or incorrect prediction result,
respectively. “GT” stands for ground truth.
TABLE III
C ROSS -VALIDATION ON THE L ANDMARK KITTI DATASET OR L ANDMARK OXFORD DATASET USING THE TRAINED MODEL BASED ON THE L ANDMARK
OXFORD DATASET OR L ANDMARK KITTI DATASET, RESPECTIVELY. T HE “B EST IN TABLE I” OR “B EST IN TABLE II” METHOD REFERS TO THE
BEST- PERFORMING BASELINE OUT OF D2-N ET, ASLF EAT, S UPER G LUE AND L O FTR FROM TABLE I OR TABLE II.
SIFT+
TABLE V MatchNet SuperGlue LoFTR VGIDM
Methods RANSAC
A BLATION STUDY FOR DIFFERENT DISCRIMINATOR FUNCTIONS . (+neighbors) (+neighbors) (+neighbors) (GraphSAGE)
(+neighbors)
Inference
Feature 0.7035s 0.1591s 3.8072s 4.3740s 0.2013s
Precision Recall F1 -Score AUC runtime
Discriminator
Cosine similarity 0.9007 0.8707 0.8854 0.7913
L2 distance 0.7277 0.8800 0.7966 0.5540 E. Computational Complexity
Learnable d 0.9097 0.9533 0.9310 0.8347 To evaluate the runtime performance, we test VGIDM on
an NVIDIA RTX A5000 GPU. Table VIII shows the inference
TABLE VI runtime (mean time for one pair of frames in the testing phase)
M ATCHING PERFORMANCE OF DIFFERENT METHODS BASED ON SPATIAL for the VGIDM variants with different GNNs. Specifically,
NEIGHBORHOOD INFORMATION .
given a pair of frames (i.e., full-size images), an average of
Methods Precision Recall F1 -Score AUC around twenty patch pairs are compared, which takes less
SIFT [71]+RANSAC [52] (+neighbors) 0.8965 0.9467 0.9209 0.8093 than 0.25 seconds. The time taken is acceptable for practical
MatchNet [24] (+neighbors) 0.9023 0.9600 0.9302 0.8240
SuperGlue [33] (+neighbors) 0.9225 0.9200 0.9212 0.8440 applications, such as place recognition and autonomous driving.
LoFTR [50] (+neighbors) 0.9192 0.9413 0.9302 0.8467 Moreover, the amount of the parameters for these VGIDM
VGIDM [ours] 0.9280 0.9627 0.9450 0.8693 networks with the GAT, GCN and GraphSAGE is 12.16M,
12.16M and 12.66M, respectively.
Spatial Neighborhood Information. We investigate whether
the performance improvement of VGIDM is mainly due to the TABLE VIII
spatial neighborhood information. To do this, we introduce the I NFERENCE RUNTIME OF VGIDM ON L ANDMARK OXFORD DATASET.
neighborhood graphs used in VGIDM to other baseline methods. VGIDM VGIDM VGIDM
Specifically, for a given vertex, we sort its neighbors according Methods
(GAT) (GCN) (GraphSAGE)
to increasing distances from it. We then use each baseline Inference Runtime 0.2330s 0.1953s 0.2092s
method to compare not only the vertex pair but also the pairs
of their corresponding neighbors with the same sort order. Then,
we calculate the average of the predicted scores for the vertex F. Further Possible Applications
pair and its neighbor pairs. Finally, we decide whether there is 1) Application of VGIDM in Visual Place Recognition: A
a match based on a threshold, which is a hyperparameter tuned possible application of VGIDM is visual place recognition. We
separately to achieve the best performance for each baseline. apply our local patch-matching to obtain global frame matching
Table VI shows results on the Landmark KITTI dataset, where to determine if two frames show the same place. In visual
the best test performance for each baseline model is selected. place recognition, we construct a bipartite graph with edges
12
Dustbin {𝒘𝒊𝒋 }
Weights GT: 1
scores+
Patches Sinkhorn Predicted Result: 1 Predicted Result: 1 Predicted Result: 0
with their algorithm
Landmark patches neighborhood
information Output results GT: 0
Landmark patch extraction
Predicted Result: 0 Predicted Result: 1 Predicted Result: 0
o
Rearranging the inequality completes the proof. − p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) (26)
≥ min m(x ↔ y)
x↔y
A PPENDIX B X n
P ROOF OF P ROPOSITION 2 · p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y)
Let d˜ = d + ϵ. Similar to (17) in the proof of Proposition 1, (φ(x),ψ(G y ))∈B
o
it is easy to see − p(φ(x), ψ(G y ) | G x ↮ G y , x ↮ y) + min m(x ↔ y)
x↔y
˜
LID (φ, ψ, d) − max m(x ↮ y) − 1, (27)
y x↮y
= P(x ↔ y)E[log(d(φ(x), ψ(G )) + ε) | x ↔ y]
where the inequality (26) follows from 0 ≤ m(x ↔ y) ≤ 1
+ P(x ↮ y)E[log(1 − d(φ(x), ψ(G y )) − ε) | x ↮ y],
and 0 ≤ m(x ↮ y) ≤ 1. The proof is now complete.
(23)
where the notations are the same as those in (17). R EFERENCES
According to Taylor’s series expansion theorem [77], [1] X. Zhang, S. Wang, Z. Li, and S. Ma, “Landmark image retrieval by
we have jointing feature refinement and multimodal classifier learning,” IEEE
Trans. Cybern., vol. 48, no. 6, pp. 1682–1695, Jun. 2017.
˜ − LID (φ, ψ, d)
LID (φ, ψ, d) [2] J. Zhu, H. Zeng, J. Huang, S. Liao, Z. Lei, C. Cai, and L. Zheng, “Vehicle
( re-identification using quadruple directional deep learning features,” IEEE
1
Trans. Intell. Transp. Syst., vol. 21, no. 1, pp. 410–420, Mar. 2019.
= ε P(x ↔ y)E x↔y [3] S. Wang, X. Guo, Y. Tie, L. Qi, and L. Guan, “Deep local feature
d(φ(x), ψ(G y )) descriptor learning with dual hard batch construction,” IEEE Trans.
) Image Process., vol. 29, pp. 9572–9583, Oct. 2020.
1 [4] D. Quan, S. Wang, Y. Li, B. Yang, H. Ning, J. Chanussot, H. Biao, and
− P(x ↮ y)E y
x↮y L. Jiao, “Multi-relation attention network for image patch matching,”
1 − d(φ(x), ψ(G )) IEEE Trans. Image Process., vol. 30, pp. 7127–7142, Aug. 2021.
( [5] S. Liao and A. C. Chung, “Nonrigid brain MR image registration using
ε2
1 uniform spherical region descriptor,” IEEE Trans. Image Process., vol. 21,
− P(x ↔ y)E x ↔ y
2 (d(φ(x), ψ(G y )))2 no. 1, pp. 157–169, Jun. 2011.
[6] N. Engel and etc., “Deeplocalization: Landmark-based self-localization
) with deep neural networks,” in Proc. IEEE Intell. Transp. Syst. Conf.,
1
+ P(x ↮ y)E x ↮ y 2019, pp. 926–933.
(1 − d(φ(x), ψ(G y )))2 [7] Y. Wang, Y. Qiu, P. Cheng, and X. Duan, “Robust loop closure detection
integrating visual-spatial-semantic information via topological graphs
+ o(ε2 ). (24) and CNN features,” J. Remote Sensing, vol. 12, no. 23, p. 3890, Oct.
2020.
Furthermore, by substituting d(φ(x), ψ(G y )) = [8] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and H. S. Sawhney,
d∗ (φ(x), ψ(G y )) given in (19) into (24), we have (25) “Ten-fold improvement in visual odometry using landmark matching,” in
Proc. IEEE Int. Conf. Comput. Vision, 2007, pp. 1–8.
and the proof is complete. [9] J. Vincent, M. Labbé, J. S. Lauzon, F. Grondin, P. M. Comtois-Rivet, and
F. Michaud, “Dynamic object tracking and masking for visual SLAM,”
in Proc. IEEE Int. Conf. Intell. Robots Syst., 2020, pp. 4974–4979.
A PPENDIX C [10] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell,
P ROOF OF P ROPOSITION 3 B. Upcroft, and M. Milford, “Place recognition with convnet landmarks:
Viewpoint-robust, condition-robust, training-free,” in Proc. Robot.: Sci.
We have Syst., 2017, pp. 5702–5708.
[11] S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-NetVLAD:
y y
∥p(φ(x), ψ(G ) | x ↔ y) − p(φ(x), ψ(G ) | x ↮ y)∥TV Multi-scale fusion of locally-global descriptors for place recognition,”
X n in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2021, pp.
= m(x ↔ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) 14 141–14 152.
[12] F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera
(φ(x),ψ(G y ))∈B
relocalization with graph neural networks,” in Proc. IEEE Int. Conf.
+ (1 − m(x ↔ y))p(φ(x), ψ(G y ) | G x ↮ G y , x ↔ y) Comput. Vision Pattern Recognit., 2020, pp. 11 372–11 381.
[13] J. Zhang and S. Singh, “Visual-lidar odometry and mapping: Low-drift,
− m(x ↮ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↮ y) robust, and fast,” in Proc. IEEE Int. Conf. Robot. Automat., 2015, pp.
2174–2181.
o
y x y
− (1 − m(x ↮ y))p(φ(x), ψ(G ) | G ↮ G , x ↮ y) [14] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
X n Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Jan. 2004.
≥ min m(x ↔ y) [15] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust
x↔y features (SURF),” J. Comput. Vision Image Understanding, vol. 110,
(φ(x),ψ(G y ))∈B
no. 3, pp. 346–359, Jun. 2008.
· p(φ(x), ψ(G y ) | G x ↔ G y , x ↔ y) [16] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
trained, multiscale, deformable part model,” in Proc. IEEE Int. Conf.
− max m(x ↮ y)p(φ(x), ψ(G y ) | G x ↔ G y , x ↮ y) Comput. Vision Pattern Recognit., 2008, pp. 1–8.
x↮y
15
[17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient [41] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A
alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vision, comprehensive survey on graph neural networks,” IEEE Trans. Neural
2011, pp. 2564–2571. Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, Mar. 2020.
[18] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust [42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
invariant scalable keypoints,” in Proc. IEEE Int. Conf. Comput. Vision, driving? The KITTI vision benchmark suite,” in Proc. IEEE Int. Conf.
2011, pp. 2548–2555. Comput. Vision, 2012, pp. 3354–3361.
[19] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” [43] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The Oxford
in Proc. IEEE Int. Conf. Comput. Vision, 2012, pp. 510–517. Radar RobotCar Dataset: A radar extension to the Oxford RobotCar
[20] Y. Tian, B. Fan, and F. Wu, “L2-net: Deep learning of discriminative Dataset,” in Proc. IEEE Int. Conf. Robot. Automat., 2020, pp. 6433–6438.
patch descriptor in euclidean space,” in Proc. IEEE Int. Conf. Comput. [44] X. Zhang, F. X. Yu, S. Kumar, and S. F. Chang, “Learning spread-out
Vision, 2017, pp. 661–669. local feature descriptors,” in Proc. IEEE Int. Conf. Comput. Vision, 2017,
[21] S. Zagoruyko and N. Komodakis, “Learning to compare image patches pp. 4595–4603.
via convolutional neural networks,” in Proc. IEEE Int. Conf. Comput. [45] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “SOSNet:
Vision, 2015, pp. 4353–4361. Second order similarity regularization for local descriptor learning,” in
[22] V. Kumar BG, G. Carneiro, and I. Reid, “Learning local image descriptors Proc. IEEE Int. Conf. Comput. Vision, 2019, pp. 11 016–11 025.
with deep siamese and triplet convolutional networks by minimising [46] H. Pan, Y. Chen, Z. He, F. Meng, and N. Fan, “TCDesc: Learning
global loss functions,” in Proc. IEEE Int. Conf. Comput. Vision, 2016, topology consistent descriptors for image matching,” IEEE Trans. Circuits
pp. 5385–5394. Syst. Video Technol., vol. 521, pp. 436–444, Aug. 2021.
[23] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature [47] I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features for
descriptors with triplets and shallow convolutional neural networks,” in image matching,” in Proc. IEEE Int. Conf. Pattern Recognit., 2016, pp.
Proc. British Mach. Vision Conf., 2016, pp. 1–11. 378–383.
[24] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “MatchNet: [48] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invariant
Unifying feature and metric learning for patch-based matching,” in Proc. feature transform,” in Proc. Eur. Conf. Comput. Vision, 2016, pp. 467–
IEEE Int. Conf. Comput. Vision, 2015, pp. 3279–3286. 483.
[25] A. Subramaniam, P. Balasubramanian, and A. Mittal, “NCC-net: Normal- [49] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
ized cross correlation based deep matcher with robustness to illumination supervised interest point detection and description,” in Proc. IEEE Int.
variations,” in Proc. IEEE Winter Conf. Appl. Comput. Vision, 2018, pp. Conf. Comput. Vision Pattern Recognit. Workshops, 2018, pp. 224–236.
1944–1953. [50] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “LoFTR: Detector-
[26] D. Quan, X. Liang, S. Wang, S. Wei, Y. Li, H. Ning, and L. Jiao, “AFD- free local feature matching with transformers,” in Proc. IEEE Int. Conf.
Net: Aggregated feature difference learning for cross-spectral image Comput. Vision Pattern Recognit., 2021, pp. 8922–8931.
patch matching,” in Proc. IEEE Int. Conf. Comput. Vision, 2019, pp. [51] J. Brogan, A. Bharati, D. Moreira, A. Rocha, K. W. Bowyer, P. J. Flynn,
3017–3026. and W. J. Scheirer, “Fast local spatial verification for feature-agnostic
[27] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to large-scale image retrieval,” IEEE Trans. Image Process., vol. 30, pp.
know your neighbor’s margins: Local descriptor learning loss,” in Proc. 6892–6905, 2021.
Conf. Neural Inform. Process. Syst., 2017, pp. 1–10.
[52] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm
[28] S. Wang, Y. Li, X. Liang, D. Quan, B. Yang, S. Wei, and L. Jiao, “Better for model fitting with applications to image analysis and automated
and faster: Exponential loss for image patch matching,” in Proc. IEEE cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.
Int. Conf. Comput. Vision, 2019, pp. 4812–4821.
[53] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
[29] T. Ng, V. Balntas, Y. Tian, and K. Mikolajczyk, “SOLAR: Second-order retrieval with large vocabularies and fast spatial matching,” in Proc.
loss and attention for image retrieval,” in Proc. Eur. Conf. Comput. Vision, IEEE Int. Conf. Comput. Vision Pattern Recognit., 2007, pp. 1–8.
2020, pp. 253–270.
[54] Y. Avrithis and G. Tolias, “Hough pyramid matching: Speeded-up
[30] Y. Miao, Z. Lin, X. Ma, G. Ding, and J. Han, “Learning transformation-
geometry re-ranking for large scale image retrieval,” Int. J. Comput.
invariant local descriptors with low-coupling binary codes,” IEEE Trans.
Vision, vol. 107, pp. 1–19, 2014.
Image Process., vol. 30, pp. 7554–7566, Aug. 2021.
[55] X. Li, M. Larson, and A. Hanjalic, “Pairwise geometric matching for
[31] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and
large-scale object retrieval,” in Proc. IEEE Int. Conf. Comput. Vision
T. Sattler, “D2-net: A trainable CNN for joint description and detection of
Pattern Recognit., 2015, pp. 5153–5161.
local features,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit.,
2019, pp. 8092–8101. [56] Y. Wang, R. Zhao, L. Liang, X. Zheng, Y. Cen, and S. Kan, “Block-
based image matching for image retrieval,” J. Vis. Commun. Image
[32] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang,
Representation, vol. 74, p. 102998, 2021.
and L. Quan, “ASLFeat: Learning local features of accurate shape and
localization,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., [57] B. Jiang, P. Sun, and B. Luo, “GLMNet: Graph learning-matching
2020, pp. 6589–6598. convolutional networks for feature matching,” Pattern Recognit., vol.
[33] P. E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: 121, p. 108167, 2022.
Learning feature matching with graph neural networks,” in Proc. IEEE [58] H. Liu, T. Wang, Y. Li, C. Lang, Y. Jin, and H. Ling, “Joint graph
Int. Conf. Comput. Vision Pattern Recognit., 2020, pp. 4938–4947. learning and matching for semantic feature correspondence,” Pattern
[34] M. Amiri and H. R. Rabiee, “RASIM: A novel rotation and scale invariant Recognit., vol. 134, p. 109059, 2023.
matching of local image interest points,” IEEE Trans. Image Process., [59] S. Winder, G. Hua, and M. Brown, “Picking the best daisy,” in Proc.
vol. 20, no. 12, pp. 3580–3591, May 2011. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2009, pp. 178–185.
[35] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. [60] H. Aanæs, A. L. Dahl, and K. Steenstrup Pedersen, “Interesting interest
Hjelm, “Deep graph infomax,” in Proc. Int. Conf. Learn. Representations, points,” Int. J. Comput. Vision, vol. 97, no. 1, pp. 18–35, Jun. 2012.
2018, pp. 1–13. [61] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
[36] F. Y. Sun, J. Hoffman, V. Verma, and J. Tang, “InfoGraph: Unsupervised object detection with region proposal networks,” IEEE Trans. Pattern
and semi-supervised graph-level representation learning via mutual Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2016.
information maximization,” in Proc. Int. Conf. Learn. Representations, [62] H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular depth
2020, pp. 1–13. estimation with internal feature fusion,” arXiv preprint arXiv:2110.09482,
[37] J. L. Schonberger and J. M. Frahm, “Structure-from-motion revisited,” 2021.
in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2016, pp. [63] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan,
4104–4113. “HR-Depth: High resolution self-supervised monocular depth estimation,”
[38] W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, arXiv preprint arXiv:2012.07356, 2020.
“Learning to recover 3D scene shape from a single image,” in Proc. IEEE [64] J. Yan, H. Zhao, P. Bu, and Y. Jin, “Channel-wise attention-based network
Int. Conf. Comput. Vision Pattern Recognit., 2021, pp. 204–213. for self-supervised monocular depth estimation,” in Proc. Int. Conf. 3D
[39] H. Farid and E. P. Simoncelli, “A differential optical range camera,” in Vision, 2021, pp. 464–473.
Proc. Annu. Meeting Optical Soc. Amer., 1996, pp. 1–10. [65] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
[40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and IEEE Int. Conf. Comput. Vision, 2017, pp. 2961–2969.
Y. Bengio, “Graph attention networks,” in Proc. Int. Conf. Learn. [66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Representations, 2018, pp. 1–12. recognition,” in Proc. IEEE Int. Conf. Comput. Vision, 2016, pp. 770–778.
16
S UPPLEMENTARY M ATERIAL
A. Landmark Datasets for Image Patch Matching
In this section, we present two landmark patch matching
datasets,5 named the Landmark KITTI Dataset and the Land-
mark Oxford Dataset, derived from the street scene KITTI
dataset and the Oxford Radar RobotCar Dataset respectively.
We first briefly introduce the two original public datasets both
of which contain image frames and LiDAR scans captured
from onboard cameras and Velodyne LiDAR sensors. The
KITTI dataset is a public dataset6 with multi-sensor data for
autonomous driving. It contains street scene image frames and
their corresponding LiDAR point clouds, which are captured in
Karlsruhe, Germany, using the Point Grey Flea 2 (FL2-14S3C-
C) Camera and Velodyne HDL-64E Laserscanner, respectively.
The frame resolution is 1241 × 376 pixels. The Oxford Radar
RobotCar dataset7 contains image frames and LiDAR scans
captured on the streets in Oxford, UK, by the Point Grey
Grasshopper2 (GS2-FW-14S5C-C) Camera and Velodyne HDL- Fig. 13. Several examples of ground truth landmark bounding box labels
based on semantic segmentation masks in the KITTI dataset. Left: semantic
32E Laserscanner, respectively. The resolution of each frame segmentation images with bounding box labels. Right: real images with
in this dataset is 1280 × 960 pixels. bounding box labels.
We extract the landmark object patches from the full-sized
image frames of the two original street scene datasets using an
object detection neural network. In the literature on landmark-
based applications, Edge Boxes are used to detect a bounding
box around a patch that contains a large number of internal
contours compared to the number of contours exiting from the
box, which indicates the presence of an object in the enclosed
patch. DeepLabV3+ is used to extract significant landmark
regions. However, all of the aforementioned patch extraction or
landmark detection approaches are not stable when removing
dynamic objects and many noisy regions are presented. By
contrast, in our datasets, we use Faster R-CNN as the stable
landmark object detector to locate the region of interest for
static roadside objects including traffic lights, traffic signs,
poles, and facade windows. To facilitate the detection efficacy,
we manually labeled those objects using the frames from the
street scene KITTI dataset and the Oxford Radar RobotCar
dataset. In Faster R-CNN, we choose Resnet50 with Feature
Pyramid Network (FPN) as the backbone, which is already
pretrained on the Imagenet dataset. During training, we use
Adam optimizer with learning rate 0.0002 and weight decay
0.0001 to train the detector for 50 epochs. The training batch
size is set as 2 and random horizontal flipping is used for data Fig. 14. Several examples of the ground truth landmark bounding box labels
augmentation. for the Oxford Radar RobotCar dataset.
We next introduce our landmark patch matching datasets. For
both the Landmark KITTI dataset and the Landmark Oxford labels to object bounding box labels. We use the skim-
dataset, the full-sized image frames are captured by stereo age.measure.label to label connected regions for pixel classes
cameras, and we only use the left frames to extract landmark including traffic lights, traffic signs and poles. See Fig. 13 for
patches. The details like the landmark object bounding box an example. In some rare cases, multiple poles may overlap and
labels and the patch matching ground truth are described the connected region algorithm outputs an inaccurate bounding
separately for each dataset as follows. box. We manually exclude these overlapped objects in the
Landmark KITTI Dataset. The segmentation labels are generated bounding box labels. As mentioned above, Faster
semantic segmentation masks. To perform landmark object R-CNN trained using the labels is used to produce the object
detection, we need to first convert the semantic segmentation detection results for all the other unlabeled frames contained
5 https://github.com/AI-IT-AVs/Landmark_patch_datasets in the dataset.
6 http://www.cvlibs.net/datasets/kitti/ We project the surrounding LiDAR points onto the image
7 http://ori.ox.ac.uk/datasets/radar-robotcar-dataset frame plane using the intrinsic camera matrix and extrinsic
18
TABLE XII
N EURAL NETWORK MODELS AND THE PARAMETERS IN THE IMAGE PATCH MATCHING FRAMEWORK .