0% found this document useful (0 votes)
35 views6 pages

Megaloc

MegaLoc: One Retrieval to Place Them All

Uploaded by

pokingsanta899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Megaloc

MegaLoc: One Retrieval to Place Them All

Uploaded by

pokingsanta899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MegaLoc: One Retrieval to Place Them All

Gabriele Berton Carlo Masone


Polytechnic of Turin Polytechnic of Turin
[email protected]
arXiv:2502.17237v1 [cs.CV] 24 Feb 2025

Abstract

Retrieving images from the same location as a given


query is an important component of multiple computer vi-
sion tasks, like Visual Place Recognition, Landmark Re-
trieval, Visual Localization, 3D reconstruction, and SLAM.
However, existing solutions are built to specifically work
for one of these tasks, and are known to fail when the
requirements slightly change or when they meet out-of-
distribution data. In this paper we combine a variety of
existing methods, training techniques, and datasets to train
a retrieval model, called MegaLoc, that is performant on
multiple tasks. We find that MegaLoc (1) achieves state
of the art on a large number of Visual Place Recognition
datasets, (2) impressive results on common Landmark Re-
trieval datasets, and (3) sets a new state of the art for Vi-
sual Localization on the LaMAR datasets, where we only
changed the retrieval method to the existing localization Figure 1. Qualitative examples of predictions by MegaLoc. Each
pipeline. The code for MegaLoc is available at https: pair of images represents a query and its top-1 prediction from the
//github.com/gmberton/MegaLoc SF-XL dataset, searched across the 2.8M database spanning 150
km2 across San Francisco. Predictions in green are correct, red
are wrong.

1. Introduction
This paper tackles the task of retrieving images from a large solutions for each of them. As these three tasks contin-
database that represent the same place as a given query im- ued to diverge, over the years papers have avoided show-
age. But what does it mean for two images to be “from the ing results of their methods on more than one of these
same place”? Depending on who you ask, you’ll get differ- tasks: VPR papers don’t show results on LR, and LR pa-
ent answers: pers don’t show results on VPR. In the meantime, 3D vi-
1. Landmark Retrieval (LR) folks will tell you that two sion pipelines like COLMAP [30], Hierarchical Localiza-
photos are from the same place if they depict the same tion [28] and GLOMAP [22] keep using outdated retrieval
landmark, regardless of how close to each other the two methods, like RootSIFT with bag-of-words [3, 10, 32] and
photos were taken [40]; NetVLAD [4]. In this paper we aim to put an end to this, by
2. Visual Place Recognition (VPR) people set a camera training a single model that achieves SOTA (or almost) on
pose distance of 25 meters to define if two images are all of these tasks, showcasing robustness across diverse do-
positives (i.e. from the same place) [4]; mains. To train this model we do not propose any “technical
3. Visual Localization (VL) / 3D Vision researchers will novelty”, but we use all the lessons learned from all these
tell you that two images need to have their pose as close three task, putting together a combination of good samplers,
as possible to be considered the same place. datasets, and general training techniques.
Even though image retrieval is a core component in all “Why does it matter?”, you may ask. Imagine you are
three tasks, their different definitions and requirement has doing 3D reconstruction, where image retrieval is a funda-
inevitably led to the development of ad-hoc image retrieval mental component, on a collection of diverse scenes (e.g. to

1
create datasets like MegaDepth [18], MegaScenes [37], or method assures that each class contains images that repre-
for the evergreen Image Matching Challenge [6]). In some sent a given place from diverse perspectives, while ensuring
cases there would be small scenes (e.g. reconstruction of a that no visual overlap exists between two different places.
fountain), requiring a retrieval model that is able to retrieve EigenPlaces provides two sub-batches, one made of frontal-
nearby images (few meters away), which is something VPR facing images (i.e. with the camera facing straight along the
models excel at, but LR models underperform (see [8] Tab. street) and one of lateral-facing images.
14). In other cases however, the scene might be large (e.g.
a big landmark like a church), with images hundreds of me- Google Street View Cities (GSV-Cities) is a dataset of
ters away: while LR models are designed for this, VPR 530k images split into 62k places/classes from 40 cities,
models achieve poor results in this situations (see Sec. 3). where each class contains at least 4 images with same ori-
Given these considerations, we note how neither VPR nor entation and is at least 100 meters from any other class.
LR provide models for the diverse cases of 3D reconstruc- Given that GSV-Cities is already split into non-overlapping
tions, creating a gap in literature that is filled by MegaLoc. classes, it is not strictly necessary to apply a particular sam-
As another example where a model like MegaLoc is nec- pling technique. We therefore directly feed the GSV-Cities
essary, one can think of Visual Place Recognition (which dataset to the multi-similarity loss, as in the original GSV-
is also the first step for Visual Localization), where models Cities paper [1].
are evaluated by using a 25 meters threshold (and queries
in popular datasets always have at least one positive within
25 meters). However, in the real world the nearest image to Mapillary Street-Level Sequences (MSLS) is a dataset
a given query might be 100 meters away, and while ideally of 1.6M images split in contiguous sequences, across 30
we would still want to retrieve it, a VPR model is unlikely to different cities over 9 years. To ideally sample data from
work in such case, as it has been trained to ignore anything the MSLS dataset, we use the mining technique described in
further away from the camera. the CliqueMining paper [33]. This method ensures that the
In this paper we demonstrate that, by leveraging a di- places selected for each batch depict visually similar (but
verse set of data sources and best practices from LR, VPR geographically different) places (i.e. hard negatives), so that
and VL, we obtain a single image retrieval model that the loss can be as high as possible and effectively teach the
works well across all these tasks. Our model is called model to disambiguate between similar-looking places.
MegaLoc and it is released at https://github.com/
gmberton/MegaLoc MegaScenes is a collection of 100k 3D structure-from-
motion reconstructions, composed of 2M images from
2. Method Wikimedia Commons. Simply using each reconstruction
as a class, and sampling random images from such class,
The core idea of this paper is to fuse data from multiple could lead to images that do not have any visual overlap,
datasets, and train a single model. We use five datasets e.g. two images could show opposites facades of a building,
containing both outdoor and indoor images and catering to therefore having no visual overlap while belonging to the
different image localization tasks: GSV-Cities [1], Map- same 3D reconstruction. Therefore we make sure that when
illary Street-Level Sequences (MSLS) [39], MegaScenes we sample a set of four images from a given reconstruction,
[37], ScanNet [13] and San Francisco eXtra Large (SF-XL) each of these four images should have visual overlap with
[7]. At each training iteration, we extract six-sub batches each other (we define visual overlap as having at least 1%
of data, one for each dataset (except SF-XL, from which of 3D points in common in the 3D reconstruction).
two sub-batches are sampled) and use a multi-similarity loss
[38] computed over each sub-batch. Each sub-batch is made
ScanNet is a dataset of 2.5M views from 1500 scans from
of 128 images, containing 4 images (called quadruplets)
707 indoor places. To train on ScanNet we use each scene
from 32 different places/classes. Given that these datasets
as a class, and select quadruplets so that each pair of images
have diverse format, they require different sampling tech-
within a quadruplet has visual overlap (i.e. less than 10 me-
niques. In the following paragraphs we explain how data is
ters and 30° apart); simultaneously we ensure that no two
sampled from each dataset.
images from different quadruplets has visual overlap.

San Francisco eXtra Large (SF-XL) is a dataset of 41M 3. Experiments


images with GPS and orientation from 12 different years,
densely covering the entire city of San Francisco across
3.1. Implementation details
time. To select ideal quadruplets for training, we use the During training, images are resized to 224×224, while for
sampling technique presented in EigenPlaces [9]. This inference we resize them to 322×322, following [16]. We

2
Desc. Baidu [34] Eynsham [8, 12] MSLS val [39] Pitts250k [4, 14] Pitts30k [4, 14] SF-XL v1 [7] SF-XL v2 [7] SF-XL night [5] SF-XL occlusion [5] Tokyo 24/7 [36]
Method
Dim. R1 R10 R1 R10 R1 R10 R1 R10 R1 R10 R1 R10 R1 R10 R1 R10 R1 R10 R1 R10
NetVLAD [4] 4096 69.0 95.0 77.7 90.5 54.5 70.4 85.9 95.0 85.0 94.4 40.1 57.7 76.9 91.1 6.7 14.2 9.2 22.4 69.8 82.9
AP-GeM [27] 2048 59.8 90.8 68.3 84.0 56.0 72.9 80.0 93.5 80.7 94.1 37.9 54.1 66.4 84.6 7.5 16.7 5.3 14.5 57.5 77.5
CosPlace [7] 2048 52.0 80.4 90.0 94.9 85.0 92.6 92.3 98.4 90.9 96.7 76.6 85.5 88.8 96.8 23.6 32.8 30.3 44.7 87.3 95.6
MixVPR [2] 4096 71.9 94.7 89.6 94.4 83.2 91.9 94.3 98.9 91.6 96.4 72.5 80.9 88.6 95.0 19.5 30.5 30.3 38.2 87.0 94.0
EigenPlaces [9] 2048 69.1 91.9 90.7 95.4 85.9 93.1 94.1 98.7 92.5 97.6 84.0 90.7 90.8 96.7 23.6 34.5 32.9 52.6 93.0 97.5
AnyLoc [17] 49152 75.6 95.2 85.0 94.1 58.7 74.5 89.4 98.0 86.3 96.7 - - - - - - - - 87.6 97.5
Salad [16] 8448 72.7 93.6 91.6 95.9 88.2 95.0 95.0 99.2 92.3 97.4 88.7 94.4 94.6 98.2 46.1 62.4 50.0 68.4 94.6 98.1
CricaVPR [20] 10752 65.6 93.2 88.0 94.3 76.7 87.2 92.6 98.3 90.0 96.7 62.6 78.9 86.3 96.0 25.8 40.6 27.6 47.4 82.9 93.7
CliqueMining [33] 8448 72.9 92.7 91.9 96.2 91.6 95.9 95.3 99.2 92.6 97.8 85.5 92.6 94.5 98.3 46.1 60.9 44.7 64.5 96.8 97.8
MegaLoc (Ours) 8448 87.7 98.0 92.6 96.8 91.0 95.8 96.4 99.3 94.1 98.2 95.3 98.0 94.8 98.5 52.8 73.8 51.3 75.0 96.5 99.4

Table 1. Recall@1 and Recall@10 on multiple VPR datasets. Best overall results on each dataset are in bold, second best results
underlined. Results marked with a “-” did not fit in 480GB of RAM (2.8M features of 49k dimensions require 560GB for a float32-based
kNN).

CAB (Phone) HGE (Phone) LIN (Phone) CAB (HoloLens) HGE (HoloLens) LIN (HoloLens)
Method
(1, 0.1) (5, 1.0) (1, 0.1) (5, 1.0) (1, 0.1) (5, 1.0) (1, 0.1) (5, 1.0) (1, 0.1) (5, 1.0) (1, 0.1) (5, 1.0)
NetVLAD 43.4 54.0 54.8 80.0 74.4 87.8 63.1 81.4 57.9 71.6 76.1 83.0
AP-GeM 39.4 52.0 58.0 81.3 69.1 82.0 62.9 82.5 65.6 76.6 80.7 91.1
Fusion (NetVLAD+AP-GeM) 41.4 53.8 56.3 82.4 76.0 89.4 63.2 83.1 63.1 75.1 78.5 87.0
CosPlace 29.0 37.4 54.4 81.3 63.3 75.7 56.4 77.8 55.6 69.8 80.6 91.4
MixVPR 40.9 50.8 59.2 83.8 77.5 89.8 65.2 84.7 63.3 74.7 83.6 92.2
EigenPlaces 32.3 44.7 56.3 81.3 70.2 82.6 63.9 81.8 60.2 72.5 84.8 93.1
AnyLoc 48.0 59.8 58.8 83.0 77.2 92.4 69.7 88.5 70.1 81.0 81.4 90.4
Salad 44.2 55.6 65.3 92.2 81.7 94.0 71.5 90.7 75.3 85.2 91.3 99.4
CricaVPR 40.4 52.0 63.7 89.3 80.7 93.1 73.9 90.7 72.5 81.6 89.1 98.4
CliqueMining 44.2 55.6 66.0 91.4 80.5 93.1 74.2 90.9 77.3 86.3 92.0 98.8
MegaLoc (Ours) 47.0 60.4 67.2 92.9 83.3 94.9 77.4 93.4 72.9 83.5 92.2 99.0

Table 2. Results on LaMAR’s datasets, computed on each of the three locations, for both types of queries (HoloLens and Phone), which
include both indoor and outdoor. For each location we report the recall at (1°, 10cm) and (5°, 1m), following the LaMAR paper [29].

use RandAugment [11] for data augmentation, as in [1], and VRAM requirement of training MegaLoc from (roughly)
AdamW [19] as optimizer. Training is performed for 40k 300GB to 60GB.
iterations. The loss is simply computed as L = L1 + L2 +
L3 + L4 + L5 + L6 , where each Ln is the multi-similarity
loss computed on one of the sub-batches.
3.2. Results
We perform experiments on three different types of tasks:
The architecture consists of a DINO-v2-base backbone • Visual Place Recognition, where the task is to retrieve im-
[21] followed by a SALAD [16] aggregation layer, which ages that are within 25 meters from the query (Sec. 3.2.1);
has shown state-of-the-art performances over multiple VPR • Visual Localization, where retrieval is part of a bigger
datasets [16, 33]. The SALAD layer is computed with 64 pipeline that aims at finding the precise pose of the query
clusters, 256 channels per cluster, a global token of 256 and given a set of posed images (Sec. 3.2.2);
an MLP dimension of 512. The SALAD layer is followed • Landmark Retrieval, i.e. retrieving images that depict the
by a linear projection (from a dimension of 16640 to 8448) same landmark as the query (Sec. 3.2.3).
and an L2 normalization.

3.2.1. Visual Place Recognition


Memory-efficient GPU training is achieved using Py-
Torch [23], by ensuring that the computational graph for We run experiments on a comprehensive set of Visual Place
each loss stays in memory as little as possible. In practice Recognition datasets. These datasets contain a large va-
(in the code), instead of adding the computational graph for riety of domains, including: outdoor, indoor, street-view,
each loss into a single giant graph, we compute each loss hand-held camera, car-mounted camera, night, occlusions,
and perform the backward() operation independently: call- long-term changes, grayscale. Results are shown in Tab. 1.
ing backward() in PyTorch not only computes the gradient While other high-performing VPR models (like SALAD
(which is added to any existing gradient), but also frees the and CliqueMining) achieve very good results (i.e. compara-
computational graph (hence freeing memory). The step() ble to MegaLoc) on most datasets, MegaLoc vastly outper-
(and zero grad()) method is then called only once (after forms every other model on Baidu, which is an indoor-only
six backward() calls. This simple technique reduces the dataset.

3
Figure 2. Failure cases, grouped in 4 categories. Each one of the 4 column represent a category of failure cases: for each category we
show 5 examples, made of 3 images, namely the query and its top-2 predictions with MegaLoc, which can be in red or green depending
if the prediction is correct (i.e. within 25 meters). The 4 categories that we identified are (1) very difficult cases, which are unlikely to
be solved any time soon; (2) difficult cases, which can probably be solved by slightly better models than the current ones or simple post-
processing; (3) incorrect GPS labels, which, surprisingly, exist also in Mapillary and Google StreetView data; (4) predictions just out of
the 25m threshold, which despite being considered negatives in VPR, are actually useful predictions for real-world applications.

Method
R-Oxford R-Paris method. Results are reported in Tab. 2.
E M H E M H
NetVLAD 24.1 16.1 4.7 61.2 46.3 22.0 3.2.3. Landmark Retrieval
AP-GeM 49.6 37.6 19.3 82.5 69.5 45.5
CosPlace 32.1 23.4 10.3 57.6 45.0 22.3 For the task of Landmark Retrieval we compute results on
MixVPR 38.2 28.4 10.8 61.9 48.3 25.0 the most used datasets in literature, namely (the revisited
EigenPlaces 29.4 22.9 11.8 60.9 47.3 23.6 versions of [26]) Oxford5k [24] and Paris6k [25]. To do
AnyLoc 64.2 45.5 18.9 82.8 68.5 48.8
this we relied on the official codebase for the datasets2 , by
Salad 55.2 42.3 21.4 76.6 66.2 44.8
CricaVPR 57.0 39.2 15.3 80.0 68.9 48.9 simply swapping the retrieval method. Results, reported in
CliqueMining 52.2 41.0 22.1 71.8 60.5 41.2 Tab. 3, show a large gap between MegaLoc and previous
MegaLoc (Ours) 91.0 79.0 62.1 95.3 89.6 77.1 VPR models on this task, which can be simply explained
by the fact that previous models were only optimized for the
standard VPR metric of retrieving images within 25 meters
Table 3. Results on Landmark Retrieval datasets, respectively
from the query.
Revisited Paris 6k [25, 26] and Revisited Oxford 5k [24, 26].
3.2.4. Failure Cases
We identified a series of 4 main categories of “failure cases”
3.2.2. Visual Localization
that prevent the results from reaching 100% recalls, and we
Image retrieval is a core tool to solve 3D vision tasks, in present them in Fig. 2. We note however that, from a prac-
pipelines like visual localization (e.g. Hierarchical Local- tical perspective, the only real failure cases are depicted
ization [28] and InLoc [35]) and 3D reconstructions (e.g. in the second category/column of Fig. 2: furthermore, in
COLMAP [30, 31] and GLOMAP [22]). To understand if most similar cases SOTA models (i.e. not only MegaLoc,
our method can help this use case, we compute results on but also other recent ones) can actually retrieve precise pre-
the three datasets of Lamar [29], which comprise various dictions, meaning that these failure cases can be likely solv-
challenges, including plenty of visual aliasing from both in- able by some simple post-processing techniques (e.g. re-
door and outdoor imagery. To do this, we relied on the of- ranking with image matchers, or majority voting). Finally,
ficial LaMAR codebase1 by simply replacing the retrieval another failure case that we noted, is when database images
1 https://github.com/microsoft/lamar-benchmark 2 https://github.com/filipradenovic/revisitop

4
do not cover properly the search area: this is very com- weakly supervised place recognition. IEEE Transactions
mon in the Mapillary (MSLS) dataset, where database im- on Pattern Analysis and Machine Intelligence, 40(6):1437–
ages only show one direction (e.g. photos along a road taken 1451, 2018. 1, 3
from north to south), while the queries are photos facing the [5] Giovanni Barbarani, Mohamad Mostafa, Hajali Bayramov,
other direction. We note however, that in the real world this Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Bar-
bara Caputo. Are local features all you need for cross-
can be easily solved by collecting database images in multi-
domain visual place recognition? In CVPRW, pages 6155–
ple directions, which is also common in most test datasets,
6165, 2023. 3
like Eynsham, Pitts30k, Tokyo 24/7 and SF-XL. [6] Fabio Bellavia, Jiri Matas, Dmytro Mishkin, Luca Morelli,
Fabio Remondino, Weiwei Sun, Amy Tabb, Eduard Trulls,
4. Conclusion and limitations Kwang Moo Yi, Sohier Dane, and Ashley Chow. Im-
age matching challenge 2024 - hexathlon. https://
So, is image retrieval for localization solved? Well, almost. kaggle.com/competitions/image- matching-
While some datasets still show some room for improve- challenge-2024, 2024. Kaggle. 2
ment, we note that this is often due to either arguably un- [7] Gabriele Berton, Carlo Masone, and Barbara Caputo. Re-
solvable failure cases, wrong labels, and very few cases that thinking visual geo-localization for large-scale applications.
can be solved by better models. We emphasize however that In IEEE Conference on Computer Vision and Pattern Recog-
this has been the case for some time, as previous DINO-v2- nition, pages 4868–4878, 2022. 2, 3, 5
based models, like SALAD and CliqueMining, show very [8] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo
high results on classic VPR datasets. What is still missing Masone, Gabriela Csurka, Torsten Sattler, and Barbara Ca-
from literature is models like MegaLoc that achieve good puto. Deep visual geo-localization benchmark, 2023. 2, 3
results in a variety of diverse tasks and domains. [9] Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and
Should you always use MegaLoc? Well, almost, except Carlo Masone. Eigenplaces: Training viewpoint robust
models for visual place recognition. In Proceedings of the
for at least 3 use-cases. MegaLoc has shown great results
IEEE/CVF International Conference on Computer Vision
on a variety of related tasks, and, unlike other VPR models, (ICCV), pages 11080–11090, 2023. 2, 3
achieves good results on landmark retrieval, which make it [10] Gabriela Csurka, Christopher Dance, Lixin Fan, Jutta
a great option also for retrieval for 3D reconstruction tasks, Willamowski, and Cédric Bray. Visual categorization with
besides standard VPR and visual localization tasks. How- bags of keypoints. In European Conference on Computer
ever, experiments show that MegaLoc is outperformed by Vision, 2004. 1
CliqueMining in MSLS, which is a dataset made of (almost [11] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.
entirely) forward facing images (i.e. photos where the cam- Randaugment: Practical automated data augmentation with
era is facing the same direction of the street, instead of fac- a reduced search space. In Advances in Neural Information
ing sideways towards the side of the street). Another use Processing Systems, pages 18613–18624. Curran Associates,
case where MegaLoc is likely to be suboptimal is in very Inc., 2020. 3
unusual natural environments, like forests or caves, where [12] M. Cummins and P. Newman. Highly scalable appearance-
only slam - FAB-MAP 2.0. In Robotics: Science and Sys-
instead AnyLoc has been shown to work well [17]. A third
tems, 2009. 3
and final use case where other models might be preferred [13] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-
to MegaLoc is for embedded systems, where one might opt ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
for more lightweight models, like the ResNet-18 [15] ver- Richly-annotated 3d reconstructions of indoor scenes. In
sions of CosPlace [7], which has 11M parameters instead of Proc. Computer Vision and Pattern Recognition (CVPR),
MegaLoc’s 228M. IEEE, 2017. 2
[14] Petr Gronát, Guillaume Obozinski, Josef Sivic, and Tomá
References Pajdla. Learning and calibrating per-location classifiers for
visual place recognition. In 2013 IEEE Conference on Com-
[1] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. puter Vision and Pattern Recognition, pages 907–914, 2013.
Gsv-cities: Toward appropriate supervised visual place 3
recognition. Neurocomputing, 513:194–203, 2022. 2, 3 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
[2] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. for image recognition. In IEEE Conference on Computer
Mixvpr: Feature mixing for visual place recognition. In Pro- Vision and Pattern Recognition, pages 770–778, 2016. 5
ceedings of the IEEE/CVF Winter Conference on Applica- [16] Sergio Izquierdo and Javier Civera. Optimal transport aggre-
tions of Computer Vision, pages 2998–3007, 2023. 3 gation for visual place recognition. In IEEE Conference on
[3] R. Arandjelović and Andrew Zisserman. Three things every- Computer Vision and Pattern Recognition, 2024. 2, 3
one should know to improve object retrieval. pages 2911– [17] Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr-
2918, 2012. 1 ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava
[4] Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pa- Krishna, and Sourav Garg. Anyloc: Towards universal vi-
jdla, and Josef Sivic. NetVLAD: CNN architecture for sual place recognition. arXiv, 2023. 3, 5

5
[18] Zhengqi Li and Noah Snavely. Megadepth: Learning single- Ondrej Miksik, and Marc Pollefeys. LaMAR: Benchmark-
view depth prediction from internet photos. In Proceed- ing Localization and Mapping for Augmented Reality. In
ings of the IEEE conference on computer vision and pattern ECCV, 2022. 3, 4
recognition, pages 2041–2050, 2018. 2 [30] Johannes Lutz Schönberger and Jan-Michael Frahm.
[19] Ilya Loshchilov and Frank Hutter. Decoupled weight de- Structure-from-motion revisited. In CVPR, 2016. 1, 4
cay regularization. In International Conference on Learning [31] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
Representations, 2019. 3 and Jan-Michael Frahm. Pixelwise view selection for un-
[20] Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, structured multi-view stereo. In ECCV, 2016. 4
Yaowei Wang, and Chun Yuan. Cricavpr: Cross-image [32] Johannes L. Schönberger, True Price, Torsten Sattler, Jan-
correlation-aware representation learning for visual place Michael Frahm, and Marc Pollefeys. A vote-and-verify strat-
recognition. In Proceedings of the IEEE/CVF Conference egy for fast spatial verification in image retrieval. In Com-
on Computer Vision and Pattern Recognition (CVPR), 2024. puter Vision – ACCV 2016, pages 321–337, Cham, 2017.
3 Springer International Publishing. 1
[21] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. [33] Javier Civera Sergio Izquierdo. Close, but not there: Boost-
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, ing geographic distance sensitivity in visual place recogni-
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- tion. In European Conference on Computer Vision (ECCV),
sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- 2024. 2, 3
Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- [34] Xun Sun, Yuanfan Xie, Peiwen Luo, and Liang Wang. A
las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, dataset for benchmarking image-based localization. 2017
Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo- IEEE Conference on Computer Vision and Pattern Recog-
janowski. Dinov2: Learning robust visual features without nition (CVPR), pages 5641–5649, 2017. 3
supervision, 2023. 3
[35] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea
[22] Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak-
hannes Lutz Schönberger. Global Structure-from-Motion ihiko Torii. InLoc: Indoor visual localization with dense
Revisited. In European Conference on Computer Vision matching and view synthesis. In IEEE Conference on Com-
(ECCV), 2024. 1, 4 puter Vision and Pattern Recognition, 2018. 4
[23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[36] A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pa-
James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
jdla. 24/7 place recognition by view synthesis. IEEE Trans-
ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
actions on Pattern Analysis and Machine Intelligence, 40(2):
son, Andreas Kopf, Edward Yang, Zachary DeVito, Mar-
257–271, 2018. 3
tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
[37] Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai
Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch:
Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah
An imperative style, high-performance deep learning library.
Snavely. Megascenes: Scene-level view synthesis at scale.
In Advances in Neural Information Processing Systems 32,
In ECCV, 2024. 2
pages 8024–8035. Curran Associates, Inc., 2019. 3
[24] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and [38] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and
Andrew Zisserman. Object retrieval with large vocabular- Matthew R Scott. Multi-similarity loss with general pair
ies and fast spatial matching. In IEEE Conference on Com- weighting for deep metric learning. In Proceedings of the
puter Vision and Pattern Recognition. IEEE Computer Soci- IEEE Conference on Computer Vision and Pattern Recogni-
ety, 2007. 4 tion, pages 5022–5030, 2019. 2
[25] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and [39] Frederik Warburg, Søren Hauberg, Manuel López-
Andrew Zisserman. Lost in quantization: Improving par- Antequera, Pau Gargallo, Yubin Kuang, and Javier
ticular object retrieval in large scale image databases. In Civera. Mapillary street-level sequences: A dataset for
IEEE Conference on Computer Vision and Pattern Recog- lifelong place recognition. In 2020 IEEE/CVF Conference
nition, 2008. 4 on Computer Vision and Pattern Recognition (CVPR), pages
[26] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. 2623–2632, 2020. 2, 3
Revisiting oxford and paris: Large-scale image retrieval [40] Tobias Weyand, A. Araújo, Bingyi Cao, and Jack Sim.
benchmarking. In CVPR, 2018. 4 Google landmarks dataset v2 – a large-scale benchmark for
[27] Jérôme Revaud, Jon Almazán, R. S. Rezende, and instance-level recognition and retrieval. In IEEE Conference
César Roberto de Souza. Learning with average precision: on Computer Vision and Pattern Recognition, pages 2572–
Training image retrieval with a listwise loss. 2019 IEEE/CVF 2581, 2020. 1
International Conference on Computer Vision (ICCV), pages
5106–5115, 2019. 3
[28] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and
Marcin Dymczyk. From coarse to fine: Robust hierarchical
localization at large scale. In CVPR, 2019. 1, 4
[29] Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L.
Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson,

You might also like