UAV Geo-Localization For Navigation A Survey
UAV Geo-Localization For Navigation A Survey
ABSTRACT During the flight, Unmanned Aerial Vehicles (UAVs) usually exploit internal sensors to determine
their position. The most useful and used one is the Global Positioning System (GPS), or, more in general,
any Global Navigation Satellite System (GNSS). Modern GPSs provide the correct device’s location with a
few meters of displacement, especially in a scenario with good weather and open sky. However, the lack of
these optimal conditions highly impacts the accuracy. Moreover, in restricted areas or fields of war, several
anti-drone techniques are applied to limit UAVs capabilities. Without proper counter solutions, UAVs cannot
continue their task and sometimes are not even able to come back since they are not aware of their position.
During the last years, plenty of techniques have been developed to provide UAVs with a knowledge of their
location that is not strictly connected to the availability of the GPS sensor. This research field is commonly
called Geo-Localization and can be considered one of the hot topics of UAV research. Moreover, research is
going further, trying to provide UAVs with fully autonomous navigation systems that do not use hijackable
sensors. This survey aims to provide a quick guide to the newest and more promising methodologies for UAV
Geo-Localization for Navigation tasks, showing the differences and the related application fields.
which are not designed for military use, have an average obtained without a real UAV, which is not always possible (e.g.,
accuracy of a few meters [26]. Although adverse weather in a war field), and are always geo-referenced. However, they
and obstructed views of the sky can reduce this accuracy, the are from a vertical view perspective, which usually differs
displacement rarely exceeds 30 meters in urban environments from the oblique view of UAVs. Given that, orthomosaic
with many tall buildings [26]. However, GPS sensors can transformations or similar procedures are often required to
be compromised during flights due to various internal and find the best alignment between them. As an alternative,
external factors. In restricted zones or conflict areas, anti- the ground truth acquisition(s) can derive from a (some)
drone tactics like GPS hijacking and spoofing are applied to previous flight(s) of the same UAV or comparable device.
disrupt UAV operations. Consequently, UAVs may be unable This is often the best solution, thanks to similar acquisition
to complete their missions or return to base due to a loss of conditions and parameters, but this is not always possible in
positional information. In this context, significant efforts have adversarial contexts (e.g., the already mentioned fields of war).
been made in recent years to develop techniques that enable Regardless of the kind of acquisition for the ground truth, the
UAVs to determine their location without relying on GPS. This UAV’s previous knowledge must at least include information
research field, which has drawn significant interest recently, regarding the area between its starting location and the target.
is commonly called Geo-Localization. This knowledge also usually includes a possible wide offset
The main purpose of this review is to explore the main on both sides of the chosen path to grant the localization, even
solutions for Geo-Localization, and to introduce useful if some problems let the UAV off course. The most common
concepts to properly understand their usage in Navigation acquisition sources for the UAVs in the Geo-Localization task
tasks. This literature review provides an overview of the are RGB cameras, even though some approaches can also
most relevant literature strategies, describing the main exploit LiDAR or IR sensors. The Geo-Location task can be
challenges and some possible solutions. The survey continues formalized as an automatic understanding of the UAV location
as follows: Section II provides a common background on only using the ground truth and one or more images acquired in
Geo-Localization methods and introduces some concepts a given instant during the flight. This task could be graphically
of Navigation tasks, with a strong focus on onboard UAV represented as in Figure 1.
autonomous systems. In addition, this section details these
two classes of algorithms, presents the main differences B. NAVIGATION (GUIDANCE)
between them, and shows the main application fields. The Navigation task (also known as the Guidance task)
Section III presents the available datasets, benchmarks, and includes all the techniques that support or allow for an
metrics usually used to evaluate Geo-Localization approaches. autonomous flight of a UAV in the case of GPS absence or
Section IV reports some highly selected state-of-the-art UAV- denial. More specifically, this task involves all the methods
based Geo-Localization strategies. Section V details the to recognize whether the UAV is off course or on the right
newest and more promising UAV-based Guidance/Navigation path. Differently from the Geo-Localization task described
proposals. Section VI summarizes the presented approaches in Section II-A, the Navigation task also includes the
and depicts some comparative tables. Section VII draws some procedures allowing a UAV to maintain the route to reach
conclusions and illustrates future research directions. the destination. Such procedures are usually assisted by
Inertial Measurement Units (IMU) sensors, which typically
consist of accelerometers, gyroscopes, and magnetometers,
II. AUTONOMOUS UAV GUIDANCE: GEO-LOCALIZATION providing real-time data such as UAV orientation, velocity,
AND NAVIGATION and acceleration. In this regard, two factors are worth noticing.
A. GEO-LOCALIZATION The former is that a Navigation system can use all the
The Geo-Localization task aims to provide the UAV with information about the entire flight, all the data from other
its location without using a GNSS. As already mentioned in sensors (including the IMU), and all the past visited locations
Section I, the use of the GPS sensor is the most reliable and during the current flight. The latter is that a Navigation system
easy solution to the problem, but in some specific contexts, performs the Geo-Localization task several times during the
such as war fields, in which GPS frequencies can be shut flight (according to the camera characteristics, the UAV speed,
down, or in environments with tall buildings/trees, which can and the required accuracy). In this context, it is worth pointing
hinder the GPS positioning, its reliability can drastically fail. out that more accurate localizations are easier to achieve due to
Most approaches exploit the RGB camera to overcome this the possible use of the previous flight information and the IMU
limitation, and some also involve Inertial Measurement Unit data, even if the Navigation task could appear more complex
(IMU) sensors. In general, this task requires acquiring prior than Geo-Localization. These two pieces of information are
knowledge of the area of interest, which is used as ground helpful because they drastically reduce the Search Area (see
truth. These acquisitions must also include geo-reference Figure 1) since it is possible to estimate the UAV position
data for each image to create a reliable ground truth. For more effectively. In other words, Navigation can be thought of
this purpose, the most common source is satellite images. as a sequence of Geo-Localization tasks. The system uses the
They have considerable advantages: they are usually freely actual location of the UAV and its past Geo-Localizations to
accessible, cover large areas, are of high quality, can be determine whether it is following the intended path. If the UAV
FIGURE 1. Graphical representation of the Geo-Localization task. A generic UAV flies toward a target location and loses the GPS signal.
By looking around, it finds a registered Point-of-Interest (PoI) that allows for understanding the actual position and correcting the flight plan.
deviates from its intended course, the system takes corrective UAV route and environmental sensors to dynamically adjust
action to put it back on track. Figure 2 visually represents the the flight plan. By optimizing the path, PO also minimizes
Guidance/Navigation task. risks and resource usage, which is crucial in scenarios where
In the literature, approaches dealing with this task some- endurance, battery life, or mission duration are critical.
times overlap with complete navigation systems proposals, To summarize, Navigation provides the basic ‘‘where
which are also influenced by other factors, especially the am I and where am I going’’ framework. AN enhances this
embedded equipment, the UAV dimensions, and other framework with sophisticated decision-making capabilities
specific requirements. Moreover, strategies dealing with that allow the UAV to operate independently. PO is a
Geo-Localization can also be applied and included in a sub-task of AN refining the UAV’s trajectory for optimum
Navigation system. Such techniques can be facilitated by efficiency and safety. Together, these elements enable a UAV
the presence of the previous parameters obtained along the to perform complex missions in challenging environments
current flight. For this reason, only a bunch of proposals where traditional navigation aids such as GPS might not
facing this task have been reported in the corresponding be available. This integration is vital for advancing UAV
Section V, to provide a general idea behind the strategies in technology towards fully autonomous operations where human
this context. In this context Autonomous Navigation and Path intervention is minimized, and operational efficiency and
Optimization are interconnected and critical components for safety are maximized.
achieving effective autonomous flight operations, especially
in environments where GPS is unavailable or unreliable. III. METRICS AND BENCHMARKS
Autonomous Navigation (AN) [34] extends the concept of Most proposals dealing with Geo-Localization use common
Navigation by incorporating the ability of the UAV to make benchmarks to test their performance, allowing for a fair
independent decisions regarding its flight path without human comparison among the different strategies. For this reason, this
input. AN systems use a combination of sensors, data, and section describes in detail the most relevant freely available
algorithms to perceive the environment, assess operational datasets for this task. For works testing their procedure
parameters, and execute decisions. This task is more complex with homemade private datasets, they will be described in
as it must adapt to dynamic conditions and execute real-time conjunction with the proposal itself. It is also worth pointing
problem-solving strategies to manage the flight path. out that most of these datasets are synthetic, i.e., the images
Path Optimization (PO) [2] is a specific sub-task of AN that are not acquired with a real drone in the analyzed zone(s), but
focuses on finding the most efficient or effective route between they are synthesized from a satellite view by changing the
two points. However, the choice is not limited to the shortest orientation to simulate a coherent drone view. However, even
path but also considers other factors such as energy efficiency, if it does not seem to fit a real-life scenario, it is possible to
obstacle avoidance, safety margins, and compliance with flight notice that such images are like those really acquired by a
regulations. PO algorithms process data from the ongoing UAV in the same zone, as shown in Figure 3.
FIGURE 2. Graphical representation of the Guidance/Navigation task. A generic UAV flies toward a target location. It looks around at fixed or
variable time intervals to find at least one registered PoI. According to the view angle of the PoI, the navigation algorithm can understand
whether the UAV is on route or if it is slightly off route but not enough to require intervention, but also when it is highly off route and
intervene to correct the flight plan.
Another consideration is that most benchmark datasets deal R@K % provides the percentage of tests in which the system is
with the Satellite vs. UAV scenario and vice versa, i.e., in a able to retrieve at least one correct match within the first K %
cross-view context. This is coherent with the usual usage of elements in L. For example, if the system contains 1500 images
Geo-Localization procedures. However, it is also limiting since used as previous knowledge and K = 1, the R@K % is the
different modality scenarios (i.e., IR vs. RGB, LiDAR vs. percentage of tests in which the system is able to retrieve at
RGB, and many others) are also worth considering, mainly least one correct match within the first 15 elements in L. In the
due to the spread and the reduction in the cost of these kinds literature experiments, K is usually set to 1, and this metric is
of acquisition sensors. To the best of our knowledge, only sometimes referred to as R@topK %.
the GRAL dataset can be exploited to test strategies in this
context. c: AVERAGE PRECISION (AP)
It calculates the average Precision values across different
A. METRICS instances. Precision is determined by a formula that measures
This section briefly describes the evaluation metrics presented the proportion of correctly identified positive instances, i.e.,
in the state-of-the-art proposals. These metrics are well-known True Positives (TP), against all instances classified as positive,
to evaluate the effectiveness in numerous application i.e., TPs and False Positives (FP), as for:
areas [15]. Also, for the Geo-Localization task, they have
been accepted as a standard by the scientific community [46]. TP
P=
Since they have undergone a slight modification according to TP + FP
the explanation provided as follows.
d: PRECISION@K
a: RECALL@K (R@K) indicates the Precision achieved by a system considering a TP
Let |D| be the dataset size, 1 ≤ K ≤ |D|, and L the ordered- any time that at least one correct match is provided among the
by-score list of the retrieved images provided by the system. first k results.
The R@K provides the percentage of tests in which the system
is able to retrieve at least one correct match within the first K e: ROOT MEAN SQUARE ERROR (RMSE)
element in L. Literature experiments usually set K to 1, 5, 10. It measures the differences between values predicted by
a model or an estimator and the observed values. In the
b: RECALL@K % (R@K %) Geo-Localization context, it is used to evaluate, on average,
Let |D| be the dataset size, 1 ≤ K ≤ |D|, and L the ordered-by- how far from the estimated locations and the corresponding
score list of the retrieved images provided by the system. The geo-referenced ground truth are. It is usually expressed in
FIGURE 3. Comparison between images captured by real drones and synthesized from satellites. Such an example is taken
from the University-1652 dataset [49], but is common for most of the literature benchmarks.
To evaluate the performance on this dataset, the authors are presented at R@1, R@5, R@10, and R@1%. Compared
provide a protocol consisting of a training set containing to other datasets, the main advantage of CVACT is that each
701 buildings of 33 Universities and a test set including image is provided with accurate GPS tags. Hence, metric
701 buildings of the remaining 39 Universities. The results location accuracy can be evaluated more straightforwardly.
are provided in terms of AP, R@1, R@5, R@10, and R@1%.
4) VO&HAYS DATASET
2) CVUSA DATASET The Vo&Hays dataset [41] contains a set of street-view
The CVUSA dataset [45] consists of 1588655 geo-tagged pairs panorama and overhead images from Google Maps collected
of ground-level and aerial images. Ground-level geo-tagged in the USA. The data acquisition begins by randomly selecting
pictures were gathered via Google Street View and Flickr.1 street-view panorama images from Google Maps. Then, each
Google Street View images have been taken in randomly panorama is cropped several times to reach a fixed size. The
selected areas within the United States. For each location, relevant overhead image at the finest scale is later fetched
the authors collect a panoramic image and two perspective from Google Maps. This produces aligned pairs of street-view
images from viewpoints, separated by 180◦ along the roadway. and overhead images, including geo-tags and depth estimates.
For Flickr, the researchers created a 100 × 100 grid out of The procedure has been repeated in 11 cities and produced
the United States and downloaded up to 150 photographs more than one million pairs of images.
from each grid cell (from 2012 onwards, sorted by the Flickr In the literature, whenever this dataset is used, 900k cross-
‘‘interesting’’ score). This binning phase ensures a more view image pairs from 8 cities are chosen to train the network,
uniform sampling distribution because Flickr photographs while the remaining 3 cities (around 70k images per city) are
are overrepresented in metropolitan areas. Indoor images used as 3 sets for testing. To measure the ranking performance
have been filtered out using the procedure proposed in [44] on this dataset, R@1% has been chosen as the main metric.
to keep only outdoor ones. As a result, 551851 Flickr and
1036804 Street View images were collected this way. For 5) UTIAS DATASET
each ground-level image’s location, the authors synthesized The UTIAS dataset [29] consists of six traversals along a
an 800 × 800 aerial image centered on it from Bing Maps fixed path of 1132m over areas with roads and buildings,
at various spatial scales (zoom scales 14, 16, and 18). After as well as large areas of grass and trees. Each traversal contains
accounting for overlap, this approach yields 1588655 geo- 8992 images that capture the specific lighting conditions at
tagged image-matched pairs and 879318 distinct aerial image different times of day: sunrise, morning, noon, afternoon,
locations. Figure 4 shows some examples of matched aerial evening, and sunset. The dataset includes all the UAV-collected
and ground-level photos from the dataset. images, the UAV pose, and the corresponding satellite image.
One of the main advantages of this cross-view dataset is Image registration’s longitude, latitude, and heading were
the highly differentiated locations across the United States, estimated by localizing images captured in Google Earth’s
allowing for more generalized feature learning. satellite images. The UAV data has been acquired using a
DJI Matrice 600 Pro multirotor with a 3-axis DJI Ronin-MX
3) CVACT DATASET gimbal. Stereo images are produced at 10 FPS via a StereoLabs
The CVACT dataset [24] is a densely sampled cross-view ZED camera. The vehicle poses for ground truth are provided
image Geo-Localization dataset that provides 35532 ground/ via the RTK-GPS system and IMU.
satellite image pairs covering Canberra, Australia. The street- Figure 7 shows an example of the different lighting
view panoramas were collected from Google Street View conditions. Since the UAV flies at a 40m altitude with a
over 300 square miles (about 483km2 ) at zoom level 2. The steady heading, the camera is directed at the nadir. There
image resolution of the panoramas is 1664 × 832 with a is an unknown offset between the RTK-GPS and the Google
180-degree vertical Field of View (FoW). For each panorama, Earth frames. So, 10% of the successful image registrations
the matchable satellite image at the GPS position was obtained are used to align them. These registrations are then omitted
from Google Maps at a zoom factor of 20. The image from all error calculations.
resolution of satellite images is 1200×1200 after removing the The test protocol suggests evaluating the results in terms of
Google watermark, whereas the ground resolution for satellite the RMSE metric over the different lighting conditions.
images is 0.12 meters per pixel. Some image pair examples
are provided in Figure 5. 6) AERIAL TEMPLATE MATCHING DATASET
To evaluate the performance, the dataset provides a The Aerial Template Matching dataset [28] consists of three
validation set with 8884 image pairs named CVACT _val and orthomosaics generated by photos collected by UAV, and it
a testing set with 92802 image pairs named CVACT _test. For is freely downloadable on Github.2 The dataset contains data
the former, each query image only has one matching image in from 3 different areas with 3 different terrains, as shown in
the gallery, while, for the latter, a query image may correspond Figure 8. The first area, NUST Islamabad (Figure 8a), contains
to several true matched images in the set. Moreover, results 1200 images and presents complex terrain patterns with
1 https://www.flickr.com/ 2 https://github.com/m-hamza-mughal/aerial-template-matching-dataset
FIGURE 4. Examples of matched ground-level and aerial from the CVUSA dataset [45].
7) NJ DATASET
The NJ dataset [16] is a collection of images taken for the
GPS-Denied UAV Localization task. The data is gathered
from the imagery available on the United States Geographical
Survey Earth Explorer. The authors gather data in New Jersey,
USA, from a geographical area of 5.9 × 7.5km. Figure 9
shows some representative images from the dataset. The
researchers chose the location since it presents a mix of urban,
suburban, and rural content, capturing low and high textures.
The imagery is gathered during 2006, 2008, 2010, 2013, 2015,
and 2017, across spring, summer, and fall. There are ten large
images, each of 7582 × 5946 pixels at a resolution of 1 meter
per pixel. The authors suggest evaluating the dataset using
several metrics, including Average Localization Error, which
measures the distance error between the UAV’s estimated
position and its true position; Corner Error, which assesses
FIGURE 5. Ground-to-Aerial image pairs in CVACT [24].
the alignment accuracy of learned features by calculating the
percent of image width error; and 2D Euclidean Error and
Altitude Error, whose quantify horizontal and vertical distance
buildings and water bodies but without the density of an urban errors, respectively.
area. The second, DHA Rawalpindi (Figure 8b), comprises
480 images and presents a sparsely populated residential area 8) GRAL DATASET
with water bodies and greenery. The third, instead, Gujar Khan The GRAL dataset [27] contains over 550000 location-coupled
district (Figure 8c) is a densely populated urban area and pairs of ground RGB and depth images collected from aerial
contains 372 images. The images were collected by a real UAV, LIDAR point clouds. Although the primary purpose of this
a DJI Phantom 4 Pro. The acquisitions have been performed dataset is to evaluate cross-modal localization, it also allows
during three different periods of the day to maximize variance for evaluating matching under challenging cross-view settings.
in illumination conditions. Exploiting overlapping regions, Figure 10 shows some reference pictures in the dataset.
the images corresponding to each area were stitched to The data has been collected near Princeton (New Jersey,
form the three orthomosaics. To automatically localize the USA) in an area of 143km2 that exhibits various urban,
points, a maximum of 16 point-to-point correspondences suburban, and rural topographical features. The dataset
are linked between each image and orthomosaic. Since the contains multiple sceneries, including forests, mountains, open
images had GPS coordinates, every pixel of the corresponding fields, highways, downtown areas, buildings, and roads. The
orthomosaic has the correct geo-tags. data were collected in two phases to ensure that each ground
Even if the dataset only contains data from three locations, RGB image is paired with a single distinct depth image from
the main strength is that the data comes from a real UAV. the air LIDAR. First, geolocalized ground RGB images of the
The authors recommend evaluating the dataset with matching selected area were created by densely sampling 60000 GPS
accuracy, its calculation is based on the proportion of correctly locations from the Google Street View. It is worth pointing
aligned image points, validated through a 90% overlap out that Google Street View only allows for capturing street
between predicted and labeled bounding boxes. They also images. As a consequence, RGB images are unavailable for
suggest assessing GPS localization errors by measuring the many places in the selected area. Each RBG image is 640×480,
FIGURE 6. An example of images from [41]. Miami city panorama images’ (left). The corresponding produced street-view and overhead pairs
(right).
FIGURE 7. An example image from each of the 6 lighting conditions in [29] and a corresponding image rendered from Google Earth is shown.
The shadows in the Google Earth images closely resemble those in the morning lighting condition. The shadows in the afternoon and evening
appear on the opposite side of objects compared to Google Earth ones.
with a horizontal field of view of 60◦ , and a slope of 0◦ . architecture named GeoNet to tackle the cross-view image
In the second phase, a LIDAR scan of the site from USGS Geo-Localization problem.
is used to create a Digital Elevation Model (DEM). From The proposed GeoNet architecture, shown in Figure 11,
the DEM, location-linked LIDAR depth images are collected consists of a two-branch Siamese network taking a pair
for each street view image. All LIDAR depth images contain of cross-view images as input. Each network branch
RGB imagery from 1.7m above the ground, and the final entails a ResNetX module and a GeoCaps module. The
data collection includes 12 headings (from 0◦ to 360◦ at 30◦ ResNetX module ensures stable gradient propagation in
intervals) for each location. A Digital Surface Model (DSM) deep Convolutional Neural Networks (CNNs) and learns
is used to correct elevations and increase the quality. The robust intermediate feature maps, which are then input to the
depth images with no height correction, no corresponding GeoCaps module. This latter module consists of two layers of
RGB, or where more than 60% of black pixels were removed. capsules: the PrimaryCaps and GeoCaps, in which the images
Therefore, some misalignment between spatially RGB and are decomposed into vectors encapsulating information
LIDAR depth images is still possible due to inadequate such as the existence probability of a specific scene and
instrument accuracy (GPS and IMU) and calibration issues. spatial hierarchy details such as color and position. The
To evaluate the results on this dataset, the authors suggest PrimaryCaps layer stacks multiple conventional convolutional
using 20% of the area as validation images, 10% as test images, layers into capsules and transmits the resulting output
and the rest for training. In total, the collected dataset contains features to the GeoCaps layers through a dynamic routing
557627 site-coupled pairs with 417998 for training, 89787 for algorithm.
validation, and 49842 for testing. The work proposes two versions of GeoNet. Specifically,
GeoNet-I contains two network branches with different model
weights, and GeoNet-II contains two capsule branches with
IV. UAV GEO-LOCALIZATION STRATEGIES
the same model weights. These two networks have been
This section presents some relevant literature proposals
evaluated with several public cross-view datasets, namely
dealing with Geo-Localization.
CVUSA, CVACT, and Vo&Hays, respectively described in
Section III-B2, Section III-B3, and Section III-B4. GeoNet-II
A. GEOGRAPHIC SEMANTIC NETWORK FOR CROSS-VIEW achieved better performance, with a R@1%, R@1, R@5, and
IMAGE GEO-LOCALIZATION R@10 respectively of 98.7%, 58.9%, 81.8%, and 88.3% on
The proposal in [52], an updated version of the proposal CVUSA. On CVACT and Vo&Hays, instead, the GeoNet-II
discussed in [37], presents a novel end-to-end network reached a R@1% respectively of 95.8% and 76.8%.
FIGURE 8. Geo-tagged orthomosaics of three different areas in [28]. (a) A geographical patch from NUST, Islamabad, covering about
0.52km2 area. (b) The area in DHA, Rawalpindi, consisting of sparsely populated terrain covering up to 0.64km2 area. (c) The densely
populated urban area of Gujar Khan District in Pakistan, which has an area of 0.66km2 .
FIGURE 9. Some examples from NJ Dataset [16]. The source image patches (top) and template images (bottom) are taken from separate large
orthographic satellite images. The dataset includes many images from both urban and low-texture rural areas.
B. CROSS-VIEW MATCHING NETWORK FOR IMAGE-BASED and ground image. In the latter, the extracted local features
GROUND-TO-AERIAL GEO-LOCALIZATION go through two Fully Connected (FC) layers, where the
The proposal in [20] tackles the Geo-Localization problem by first has independent weights and the second has shared
introducing two variants of a Siamese-based network structure weights for both network branches. Then, these transformed
named CVM-Net (-I and -II), which extract local features features go through NetVLAD layers with shared weights to
from cross-view images to create global descriptors that are obtain the descriptors. Finally, the authors employ a weighted
invariant to significant changes in viewpoint. soft-margin ranking loss as the objective function to train both
The proposed CVM-Net architectures, shown in Figure 12, networks.
consist of two network branches of the same architecture The authors evaluated both networks on the CVUSA dataset,
that are designed to receive ground-level and satellite images described in Section III-B2 and attained R@1% scores of
separately. In both architectures, each branch deals with 91.4% and 87.2%, respectively.
local feature extraction and global descriptor generation parts.
The local features are extracted from an input image using C. HYBRID PERSPECTIVE MAPPING: ALIGN METHOD FOR
a VGG16 model [36] and then fed into a NetVLAD [4] CROSS-VIEW IMAGE-BASED GEO-LOCALIZATION
layer to generate global descriptors by aggregating local The proposal in [42] presents a new method to address the
features to their respective cluster centroids. After training, problem of cross-view image-based Geo-Localization by
each centroid generated from the satellite view is linked developing a hybrid perspective mapping algorithm method
to the unique centroid of the ground view for cross-view that aligns a ground-level image with an aerial image by
matching. As mentioned, the proposal offers two versions of considering the projection relationship between them.
CVM-Net: CVM-Net-I and CVM-Net-II. The former employs In the proposed approach, perspective mapping and polar
an independent NetVLAD layer for each network branch transform are applied respectively to a ground-level panoramic
to generate the respective global descriptors of a satellite image’s covisibility and non-covisibility areas to obtain a
FIGURE 10. Example pairs of ground RGB images and aerial LIDAR depth images from the GRAL Dataset [27]. RGB images are collected from
Google Street View, and depth images are collected by rendering aerial LIDAR point clouds from USGS.
bird’s eye view that matches a satellite image. In this context, D. SMDT: CROSS-VIEW GEO-LOCALIZATION WITH IMAGE
the covisibility area in an image represents structures, such as ALIGNMENT AND TRANSFORMER
streets and pavements, that appear concurrently at ground and The proposal in [39] introduces a cross-view matching method
satellite views. In contrast, the non-covisibility area contains for Geo-Localization comprised of image alignment and
vertical structures, such as building roofs, whose exteriors Transformer.
can be seen only in one of the cross-view images. The polar The SMDT framework comprises four modules: Seman-
transform allows modifying the distance of pixels in the tic segmentation, Mixed perspective-polar mapping, Dual
non-covisibilty area to the bird’s-eye image center. Hence, Conditional Generative Adversarial Network (CGAN), and
learning semantic relations is facilitated as the resemblance Transformer. The segmentation technique, first proposed
of the vertical structures in the bird’s-eye image is maintained. in [51], splits ground images’ content into covisibility and
The proposal exploits a Siamese network model with a non-convisibility areas based on classes (such as sky, tree,
similar structure of CVM-Net [20], as shown in Figure 13. building, road, sidewalk, and car) and generates class-specific
VGG-16 [36] extracts local features from an image, which are masks. Since aerial images do not have skies and cars for
aggregated through a NetVLAD layer [4] to generate a global matching, the masks obtained are used to create augmented
descriptor. The global descriptors of the cross-view image samples that do not consider the masks of these two classes.
pairs then pass through a FC layer to be mapped in the same In the second module, a VGG-16 [36] model, consisting of
feature space. A weighted soft-margin triplet loss is utilized 13 convolutional layers and a NetVLAD [4] with 64 clusters,
to compare the cross-view image pairs in this space. is utilized to implement polar and perspective mapping on
The model has been evaluated on CVUSA and CVACT, aerial images. The third module introduces a dual CGAN
described in Section III-B2 and Section III-B3. It achieved structure based on the proposal in [21]. Based on the
a R@1, R@5, R@10, and R@1%, respectively of 34.87%, segmented images, it synthesizes aerial images with a ground
55.81%, 67.29%, and 89.76% on CVUSA, and 2.06%, 5.31%, view style. The network architecture of the dual CGAN is
8.49%, and 33.43% on CVACT. shown in Figure 14. Finally, the fourth module adopts the
FIGURE 13. The Siamese architecture proposed in [42]. After the mapping
with the hybrid perspective method, the ground-level image is given as
input for the network. VGG16 extracts local features, and NetVLAD
aggregates global features. The two branches do not share any weight.
CVUSA and CVACT, where SAFA increases the R@1 from to train the baseline model. It is defined as follows:
89.84% to 90.16% on the former and from 81.03% to 82.40%
on the latter. LPN also increases its performance, moving from ps = softmax (Wshare × Fs (xs ))
85.79% to 90.16% on CVUSA and from 79.99% to 82.02% Ls = − log (ps (c))
on CVACT. pd = softmax (Wshare × Fd (xd ))
Ld = − log (pd (c))
F. UNIVERSITY-1652: A MULTI-VIEW MULTI-SOURCE
BENCHMARK FOR DRONE-BASED GEO-LOCALIZATION where xs , xd are two images of the location c, respectively,
The proposal in [49] presents the University-1652 Dataset, from the satellite-view and the drone-view image, Wshare is
introduced in Section III-B1. It also proposes a method for the the weight of the last classification layer, p(c) is the predicted
Geo-Localization task as the first benchmark for this dataset. probability of the correct class c. Unlike the conventional
The proposed method utilizes a two-branch CNN archi- classification loss, the shared weight Wshare provides a soft
tecture, as shown in Figure 17, to learn the relationships constraint on the high-level features. After optimizing the
between different views and minimize the differences. Each models, the different feature spaces are aligned with the
location is treated as a separate class for the classifier since classification space. The cosine distance is used to determine
the dataset provides multiple images for each of them. The the similarity between a query image and the candidate images
idea is to encourage the network to create a shared feature in the gallery.
space by sharing images from various sources. To this aim, the The system is tested to assess the localization quality for
chosen networks are two pre-trained ResNet models [18] on both the Satellite → UAV and UAV → Satellite scenarios.
ImageNet. However, the classification layer is replaced with a In a single-query setting, the model achieves an R@1, R@10,
512-dimension FC layer followed by a classification layer after and AP of 74.47%, 83.88%, and 59.45% for Satellite → UAV
the pooling layer. The network exploits Instance Loss [50], scenario; 58.23%, 84.52%, and 62.91%, respectively, for the
initially developed for image-language bi-directional retrieval, UAV → Satellite scenario.
FIGURE 20. Overview of the architecture proposed in [53]. It comprises various components, including Layer Normalization
(LN), Multi-head Self-Attention modules with regular (W-MSA), Multi-head Self-Attention modules with Shifted Windowing
configurations (SW-MSA), and Multi-Layer Perceptron (MLP).
The proposed approach has been tested on the University- cluster. Simultaneously, weights are assigned to each edge,
1652 dataset, described in III-B1, both in Satellite → UAV reflecting the similarity between linked node pairs. Each edge
and UAV → Satellite contexts. In a single-query setting, the weight aij , which measures the similarity between reference
system achieves an R@1 and an AP of 89.73%, 84.94% buildings i and j, is represented as:
for Satellite → UAV scenario and 85.50%, 87.53% for dij2
UAV → Satellite scenario. It also compares Triplet Loss with
aij = e 2σ 2 + α(si + sj )
other loss functions to demonstrate the increase in terms of
performance. where dij2 is the distance between i and j GPS locations in
Cartesian coordinates, and si is the similarity between query
J. CROSS-VIEW IMAGE MATCHING FOR building and reference building i based on their building
GEO-LOCALIZATION IN URBAN ENVIRONMENTS matching score. Geo-Localization aims to choose a maximum
The methodology outlined in [40] uses deep learning tech- of one reference building from each cluster to maximize the
niques to introduce a cross-view image-matching framework overall weight. Dominant sets are employed to address this
tailored for Geo-Localization purposes. Its objective is to challenge. For a non-empty subset V ⊆ S, i ∈ S, and j ∈ / S,
autonomously identify, characterize, and align semantic the total weight of S is expressed as:
content within cross-view images. X
According to the pipeline depicted in Figure 23, rather W (S) = WS (i)
i∈S
than relying on the matching of local features, the study
employs a cross-view matching strategy centered around where WS (i) is a weight defined recursively and assigned to
buildings, which presents a higher semantic significance and each node i ∈ S. Then, the replicator dynamics algorithm is
robustness to variations in viewpoints. Initially, the presence used to select a Dominant Set, which is used as the basis for
of buildings in both query and reference images is determined deriving the final Geo-Localization. The latter is obtained by
utilizing the Faster R-CNN [31] algorithm. Subsequently, computing the mean GPS location from the reference buildings
a Siamese network [13] is employed to acquire deep feature that have been selected within the Dominant Set.
representations to discern between matched and unmatched The strategy has been tested with the dataset introduced
pairs of buildings within cross-view images. The idea consists in [47], comprising four pairs of street view and bird’s eye
of creating a mapping capable of associating buildings from view images per GPS location in the vicinity of downtown
distinct perspectives into a feature space where matched Pittsburgh, Orlando, and a section of Manhattan, sourced from
pairs are closer while unmatched pairs are distant. During the Google Street View dataset. Annotations were applied to
the training phase for feature representation, the goal is to identify the corresponding buildings, facilitating the training
minimize the Euclidean distance between matched pairs in the of a deep network for building matching purposes. Precision-
feature space, ideally approaching zero, while maximizing recall curves are depicted for test image pairs to evaluate
the distance between unmatched pairs. During the testing building matching efficacy. The fine-tuned model exhibits an
phase, k nearest neighbors are determined from the reference average precision of 32%, a notable improvement over the
images based on Euclidean matching scores for each identified 11% achieved by the pre-trained model.
building in the query image. Subsequently, an undirected,
edge-weighted graph G = (V , E) is built without self-loops, K. SSA-NET: SPATIAL SCALE ATTENTION NETWORK FOR
encompassing all selected reference buildings and their nearest IMAGE-BASED GEO-LOCALIZATION
neighbors for each query building, forming clusters as depicted The work in [48] proposes an approach for Geo-Localization
in Figure 24. Each chosen reference building is denoted as a utilizing spatial layouts. This solution is versatile and able to
node, with edges connecting pairs of nodes not within the same deal with both cross-view scenarios, such as UAV and Satellite
FIGURE 21. The framework of proposed FSRA in [14]. The heatmap segmentation module (light green) rearranges and
evenly distributes the heatmap data based on their distribution to segment distinct content features. The heatmap
alignment branch (light blue) extracts feature vectors from each segmented region and conducts classification
supervision for each vector. The Triplet Loss is utilized in each branch to reduce the distance between similar feature
content and enable end-to-end learning. Moreover, the system also incorporates a global branch (light purple) based
on the transformer.
FIGURE 22. The transformer-based strong baseline framework presented in [14]. The [cls_token] output marked with ∗ is
utilized as the global feature f . The Classifier Layer consists of a linear layer, ReLU activation function, batch
normalization, and dropout. The ID Loss refers to the CrossEntropy loss, which does not incorporate label-smoothing.
views, and cross-modality scenarios, e.g., images from RGB This module is specifically engineered to integrate spatial
camera and LiDAR sensor. layout information into the feature representation, enhancing
To achieve this goal, the strategy introduces a novel deep image-matching capabilities. In detail, the module captures the
network, depicted in Figure 25, designed to encapsulate the relative positional dynamics among significant object features
spatial configuration of scenes within the feature representa- by autonomously pinpointing prominent correspondences
tion. The network architecture is built upon a dual-branch while minimizing the impact of irrelevant ones. This is
Siamese framework aimed at acquiring a unified feature achieved by only relying on a self-attentive mechanism.
representation from pairs of images. Unlike the conventional Initially, the module reduces feature dimensionality through
Siamese network, its branches do not share weights due to 11 convolutions, followed by a max-pooling operator to
the disparate nature of the input data sources. Each branch identify the most pertinent features. Subsequently, it utilizes
of the model consists of a VGG-16 CNN, with the classifier a multi-scale spatial layout importance generator to establish
layers removed. The output of the final layer is then passed a position embedding map, thus ensuring that object features
through the innovative Spatial-Scale Attention (SSA) module. across various scales receive tailored levels of attention. Given
FIGURE 24. An example of Geo-Localization using the dominant set proposed in [40].
FIGURE 27. Graphical Abstract of the proposal in [38]. The first two columns represent the first
module, while the third column is the second one.
performance. Figure 28 shows the overall architecture of the of input images on system performance across all scenarios.
proposed strategy. However, it is worth noting that the magnitude of improvement
The proposed methodology has been tested utilizing diminishes gradually after reaching a certain threshold.
the University-1652 dataset, described in Section III-B1.
Evaluation occurred in both single-query and multi-query
setups. In the former, the system processed one satellite image M. ASSISTING UAV LOCALIZATION VIA DEEP CONTEXTUAL
alongside one drone view, while in the latter, it handled IMAGE MATCHING
a single satellite image along with multiple drone views The research presented in [28] focuses on aligning a pre-stored
captured from varying heights and angles. The obtained results orthomosaic map with the front-view orientation of a UAV.
validate the efficacy of localization in both Satellite → UAV It explores the possibility of utilizing onboard cameras and
and UAV → Satellite scenarios. In the single-query mode, pre-stored geo-referenced imagery.
the system achieved R@1, R@5, R@10, and AP scores of The proposed strategy involves an end-to-end trainable
83.27%, 90.32%, 95.52%, and 87.32%, respectively, in the architecture conducting feature learning and template localiza-
first scenario, and 91.78%, 93.35%, 96.45%, and 82.18% in tion simultaneously. This is achieved by imposing probabilistic
the second. In the multi-query mode, the system obtained constraints on densely correlated feature maps of different
R@1 and AP scores of 93.73% and 88.49% in the first dimensions. The proposed aerial image localization network
scenario and 91.63% and 90.84% in the second. Additionally, is designed to learn feature points with a neighborhood
the study presents intermediate results, demonstrating the consensus, enabling the refinement of matches between
effectiveness of each methodology component and showcasing template images and the pre-stored orthomosaic. The initial
the incremental improvements they achieved. Furthermore, step utilizes a ResNet-101 [18] to extract convolutional
it highlights the significant impact of increasing the number features. These feature maps contain both local and global
FIGURE 29. Overview of the system proposed in [28]. The template image IT and the orthomosaic IM serve as inputs for convolutional
feature extractor processes. These processes generate feature maps M and T, which are then used to calculate the correlation tensor using
Soft Mutual Nearest Neighbor Filtering. Subsequently, the correlation tensor is processed by a 4D convolutional network, and probabilistic
constraints are applied over the processed correlation matrix to calculate points of correspondence feature matches. Finally, through the
final FCN, these points predict the point-to-point correspondence between IT and IM , projecting the template over the orthomosaic.
information, which are used to construct a correlation keypoint matching accuracy of 92.66% and an average error
matrix holding the feature matches for each extracted in positioning the drone coordinates of 3.594 m2 .
feature point. Subsequently, this correlation matrix is fed
into a trainable network to learn how to establish more V. UAV NAVIGATION STRATEGIES
reliable correspondences. Moreover, probabilistic constraints This section presents some examples of strategies dealing with
are incorporated into these established correspondences to the Navigation task. Since a complete review of this task is
associate each feature point in the source image with the out of the main scope of this survey, only a bunch of proposals
corresponding points in the orthomosaic. The same process are described in the following.
is applied to the orthomosaic feature points. As depicted
in Figure 30, for performing the clustering using Highest A. BRM LOCALIZATION: UAV LOCALIZATION IN
Correlation, a soft-argmax layer is applied to the generated GNSS-DENIED ENVIRONMENTS
probability maps to extract the best match indices. These The guidance strategy proposed in [12] enables UAVs to
indices are then passed through a FCN, acting as a regressor, navigate along a predetermined path without relying on GNSS
to estimate the point-to-point correspondences between the assistance by only requiring the initial position, an IMU sensor,
source and target images. Since all components of the pipeline and an RGB camera.
are differentiable, the network can be trained end-to-end. The paper introduces a new Building Ratio Map (BRM)
Figure 29 shows the overall architecture. localization method that compares UAV images with an
The system has been evaluated using the Aerial Template existing numerical map. The approach entails an offline and
Matching Dataset, described in Section III-B6, achieving a an online stage. In the former, the numerical map is created
TABLE 1. Summary table for the proposals tested on the University-1652 benchmark [49]. In red, the best results for each category of test.
TABLE 2. Summary table for the proposals tested on the CVACT benchmark [24]. In red, the best results for each category of test.
TABLE 3. Summary table for the proposals tested on the CVUSA benchmark [45]. In red, the best results for each category of test.
benchmark dataset used in the experimental part to maintain a no other competitor described in this survey. Since these
fair comparison. Table 1, Table 2, and Table 3 summarize the datasets present similar characteristics, some proposals test
main characteristics of approaches dealing with University- their approaches using more than one of them, possibly
1652, CVACT, and CVUSA, respectively. Table 4 includes demonstrating their robustness. For all that works, they are
all the proposals that are either tested on homemade/non- included in all the corresponding tables.
public datasets (and consequently not fairly comparable with As reported in Table 1, several kinds of architectures were
other works) or those tested on a public benchmark but with tested on the University-1652 benchmark, including CNN,
TABLE 4. Summary table for the proposals’ results that are not tested on common datasets. They are not directly comparable since they are tested on
different benchmarks. The table also includes the two Navigation approaches described in Section V.
Siamese CNN, CGAN, and Transformer. At the state-of- of R@1 presents a difference of about 10 percentual points.
the-art, the most promising strategy is the one proposed In the case of the proposals summarized in Table 4, most of
in [38], which proposes an architecture made of PPT, CGAN, them used Siamese-based architectures, which confirms that
and an LPN based on ResNet50 and outperforms the other this kind of architecture is promising in this field, especially if
competitors in 5 out of the 8 metrics. Other promising results combined with Transformers. A general consideration of the
have been achieved by [14], which, even if it exploits one of proposals presented in this survey is the network resolution,
the older versions of the Vision Transformer networks, is still which for most of the works is 256 × 256.
able to achieve competitive results. As will also be pointed For the Navigation task, the state-of-the-art results are
out in the following discussion, Vision Transformer-based really promising, especially considering the 96 − 100%
networks are nowadays replacing the other networks as the of registration success (i.e., correct identification of the
most promising for several vision-related tasks. In the case of landmark) and a RMSE lower than 2.5 meters, which in most
the CVACT benchmark, the Siamese-based architectures seem of the scenarios is comparable if not better than the standard
to be the most used for the Geo-Localization task. As reported precision of a GPS sensor. However, it is also worth pointing
in Table 2, the best results have been achieved by [39], which out that the approaches retrieved from the literature rarely
faces the problem with a Dual CGAN combined with VGG16, try to test their strategy in a mixed scenario, in which the
NetVLAD, and a Transformer architecture for classification. training is performed over a dataset and the test over another.
This confirms the high accuracy and robustness of this kind To the best of our knowledge, only one proposal faced this
of network. As shown in Table 3, analogous results have context [48] with really poor results. For this reason, it is
been reported for CVUSA, in which the same architecture unclear if this comes from the poor generalizability of the
beat all the other competitors. In this case, it provides a proposed architecture or if it is intrinsically related to the
R@1 of 95%, which can be considered a solid result for the challenge of the task. Unfortunately, no other state-of-the-art
task, especially considering that in a Navigation task, some works confirm this aspect.
other useful information can also be exploited to improve the
Geo-Localization. In general, according to the literature, it is VII. CONCLUSION
possible to notice the good potential of both CGAN-based and Geo-Localization task is still a hot topic in Computer Vision,
Transformer-based strategies for the Geo-Localization task. especially thanks to the recent spread of UAVs and the
This demonstrates the adaptability of these kinds of networks consequent price reduction. It can be used as a spot operation,
also in this research field. An honorable mention is due to to check the position of a device, especially in GSP-denied
the proposal in [23], which has been tested over all of these areas (e.g., fields of war), or as a part of a Navigation system
three main benchmarks, even with average results, showing to establish the correctness of the route. As for many different
some kind of flexibility. This is especially true considering fields in Computer Vision, new technologies and advanced
that the proposal was meant to provide a support strategy deep network techniques provided a huge boost in accuracy
for two well-known state-of-the-art architectures. According and speed without an increase in required resources. The latter
to the results of the work that tested their strategies on the is especially relevant to creating embedded navigation systems
CVUSA benchmarks, it looks like a much easier testbed than that can operate directly on the UAVs. This helps especially in
CVACT and University-1652. In fact, the performance in terms contexts where the communication between the drone and
the base can be hijacked by hostile forces. Moreover, the [9] D. Avola, L. Cinque, A. Fagioli, G. Foresti, and A. Mecca, ‘‘Ultrasound
benchmarks available for Geo-Localization and Navigation medical imaging techniques: A survey,’’ ACM Comput. Surv., vol. 54, no. 3,
pp. 1–38, Apr. 2021.
tasks provide realistic contexts. In fact, the knowledge base [10] M. Bianchi and T. D. Barfoot, ‘‘UAV localization using autoencoded satellite
is collected by satellites, so it is safe and does not require images,’’ IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 1761–1768, Apr. 2021.
previous missions, and the tests are performed on UAV- [11] S. A. Carneiro, G. P. da Silva, S. J. F. Guimaraes, and H. Pedrini, ‘‘Fight
collected samples, which represent the ideal operative context. detection in video sequences based on multi-stream convolutional neural
networks,’’ in Proc. 32nd SIBGRAPI Conf. Graph., Patterns Images
This review explores the literature contributing to the Geo- (SIBGRAPI), Oct. 2019, pp. 8–15.
Localization task, detailing the main challenges. It presents [12] J. Choi and H. Myung, ‘‘BRM localization: UAV localization in GNSS-
the key concepts needed to understand their application denied environments based on matching of numerical map and UAV
images,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2020,
in Navigation tasks, specifically for UAV autonomous pp. 4537–4544.
systems. It begins by providing a common background on [13] S. Chopra, R. Hadsell, and Y. LeCun, ‘‘Learning a similarity metric
Geo-Localization methods and Navigation, highlighting their discriminatively, with application to face verification,’’ in Proc. IEEE
differences and application fields. It continues reviewing the Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1,
Jun. 2005, pp. 539–546.
available datasets, benchmarks, and metrics commonly used to [14] M. Dai, J. Hu, J. Zhuang, and E. Zheng, ‘‘A transformer-based feature
evaluate Geo-Localization and Navigation approaches. Then, segmentation and region alignment method for UAV-view geo-localization,’’
it presents a selection of state-of-the-art UAV-based Geo- IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4376–4389,
Jul. 2022.
Localization strategies, showcasing the latest advancements
[15] J. Dessain, ‘‘Machine learning models predicting returns: Why most popular
and methodologies. Additionally, it describes some of the performance metrics are misleading and proposal for an efficient metric,’’
newest and most promising UAV-based Navigation and Exp. Syst. Appl., vol. 199, Aug. 2022, Art. no. 116970.
Guidance proposals. Finally, to provide a clear comparison [16] H. Goforth and S. Lucey, ‘‘GPS-denied UAV localization using pre-existing
satellite imagery,’’ in Proc. Int. Conf. Robot. Autom. (ICRA), May 2019,
among the approaches, it summarizes them into comparative pp. 2974–2980.
tables, allowing for an easy assessment of their relative [17] K. Hartmann and K. Giles, ‘‘UAV exploitation: A new domain for cyber
strengths and weaknesses. This comparison may serve power,’’ in Proc. 8th Int. Conf. Cyber Conflict (CyCon), May 2016,
as a valuable resource for researchers interested in the pp. 205–221.
[18] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
field. According to the findings of this study, CGAN and recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Transformer-based techniques stood out in their overall Jun. 2016, pp. 770–778.
performance. These results follow the current trend of the DL, [19] X. Hou, L. Shen, K. Sun, and G. Qiu, ‘‘Deep feature consistent variational
where the listed and other affine methods seem to be some of autoencoder,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
Mar. 2017, pp. 1133–1141.
the most promising approaches in different application areas. [20] S. Hu, M. Feng, R. M. H. Nguyen, and G. H. Lee, ‘‘CVM-net: Cross-
However, we could expect an introduction of optimization view matching network for image-based ground-to-aerial geo-localization,’’
strategies to allow the deployment of these architectures on in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7258–7267.
board soon.
[21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
REFERENCES Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[1] S. Ahn, H. Kang, and J. Lee, ‘‘Aerial-satellite image matching framework [22] A. Kolesnikov, A. Dosovitskiy, D. Weissenborn, G. Heigold, J. Uszkoreit,
for UAV absolute visual localization using contrastive learning,’’ in Proc. L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner,
21st Int. Conf. Control, Autom. Syst. (ICCAS), Oct. 2021, pp. 143–146. and X. Zhai, ‘‘An image is worth 16×16 words: Transformers for image
[2] A. A. Saadi, A. Soukane, Y. Meraihi, A. B. Gabis, S. Mirjalili, and recognition at scale,’’ 2021, arXiv:2010.11929.
A. Ramdane-Cherif, ‘‘UAV path planning using optimization approaches: [23] J. Lin, Z. Zheng, Z. Zhong, Z. Luo, S. Li, Y. Yang, and N. Sebe,
A survey,’’ Arch. Comput. Methods Eng., vol. 29, no. 6, pp. 4233–4284, ‘‘Joint representation learning and keypoint detection for cross-view geo-
Oct. 2022. localization,’’ IEEE Trans. Image Process., vol. 31, pp. 3780–3792, 2022.
[3] S. Antonelli, D. Avola, L. Cinque, D. Crisostomi, G. L. Foresti, F. Galasso, [24] L. Liu and H. Li, ‘‘Lending orientation to neural networks for cross-view
M. R. Marini, A. Mecca, and D. Pannone, ‘‘Few-shot object detection: A geo-localization,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
survey,’’ ACM Comput. Surv., vol. 54, no. 11s, pp. 1–37, Sep. 2022. (CVPR), Jun. 2019, pp. 5617–5626.
[4] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, ‘‘NetVLAD: [25] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B.
CNN architecture for weakly supervised place recognition,’’ in Proc. IEEE Guo, ‘‘Swin transformer: Hierarchical vision transformer using shifted
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5297–5307. windows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[5] D. Avola, L. Cinque, G. L. Foresti, N. Martinel, D. Pannone, and C. Piciarelli, pp. 9992–10002.
Low-Level Feature Detectors and Descriptors for Smart Image and Video
[26] K. Merry and P. Bettinger, ‘‘Smartphone GPS accuracy study in an urban
Analysis: A Comparative Study. Berlin, Germany: Springer, 2018, pp. 7–29.
environment,’’ PLoS ONE, vol. 14, no. 7, Jul. 2019, Art. no. e0219890.
[6] D. Avola, I. Cannistraci, M. Cascio, L. Cinque, A. Diko, A. Fagioli,
[27] N. C. Mithun, K. Sikka, H.-P. Chiu, S. Samarasekera, and R. Kumar,
G. L. Foresti, R. Lanzino, M. Mancini, A. Mecca, and D. Pannone, ‘‘A novel
‘‘RGB2LiDAR: Towards solving large-scale cross-modal visual localiza-
GAN-based anomaly detection and localization method for aerial video
tion,’’ in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 934–954.
surveillance at low altitude,’’ Remote Sens., vol. 14, no. 16, p. 4110,
Aug. 2022. [28] M. H. Mughal, M. J. Khokhar, and M. Shahzad, ‘‘Assisting UAV localization
[7] D. Avola, L. Cinque, M. De Marsico, A. Fagioli, G. L. Foresti, M. Mancini, via deep contextual image matching,’’ IEEE J. Sel. Topics Appl. Earth
and A. Mecca, ‘‘Signal enhancement and efficient DTW-based comparison Observ. Remote Sens., vol. 14, pp. 2445–2457, 2021.
for wearable gait recognition,’’ Comput. Secur., vol. 137, Feb. 2024, [29] B. Patel, T. D. Barfoot, and A. P. Schoellig, ‘‘Visual localization with Google
Art. no. 103643. Earth images for robust global pose estimation of UAVs,’’ in Proc. IEEE
[8] D. Avola, L. Cinque, A. Di Mambro, A. Diko, A. Fagioli, G. L. Foresti, Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 6491–6497.
M. R. Marini, A. Mecca, and D. Pannone, ‘‘Low-altitude aerial video [30] K. Regmi and M. Shah, ‘‘Bridging the domain gap for ground-to-aerial
surveillance via one-class SVM anomaly detection from textural features image matching,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
in UAV images,’’ Information, vol. 13, no. 1, p. 2, Dec. 2021. Oct. 2019, pp. 470–479.
[31] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time DANILO AVOLA (Member, IEEE) received the
object detection with region proposal networks,’’ IEEE Trans. Pattern Anal. M.Sc. degree in computer science from the
Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. Sapienza University of Rome, Italy, in 2002, and
[32] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks the Ph.D. degree in molecular and ultrastructural
for biomedical image segmentation,’’ in Medical Image Computing imaging from the University of L’Aquila, Italy,
and Computer-Assisted Intervention—MICCAI, N. Navab, J. Hornegger, in 2014. Since 2024, he has been an Associate
W. M. Wells, and A. F. Frangi, Eds., Cham, Switzerland: Springer, 2015,
Professor with the Department of Computer
pp. 234–241.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Science, Sapienza University of Rome. He has
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘ImageNet co-founded and leads the Prometheus Laboratory
large scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115, and a co-founder of 4AI, a university startup
no. 3, pp. 211–252, Dec. 2015. focused on pioneering new methodologies in artificial intelligence. Previously,
[34] J. Sandino, F. Vanegas, F. Gonzalez, and F. Maire, ‘‘Autonomous UAV he was an Assistant Professor and the Research and Development Scientific
navigation for active perception of targets in uncertain and cluttered Director of the Computer Vision Laboratory (VisionLab), Sapienza University.
environments,’’ in Proc. IEEE Aerosp. Conf., Mar. 2020, pp. 1–12. As a Principal Investigator, he directs several strategic research initiatives with
[35] Y. Shi, L. Liu, X. Yu, and H. Li, ‘‘Spatial-aware feature aggregation for Sapienza, including Wi-Fi Sensing for Person Re-Identification and Human
image based cross-view geo-localization,’’ in Proc. Adv. Neural Inf. Process. Synthesis, Emotion Transference in Humanoids via EEG, UAV Navigation
Syst., 2019, pp. 1–11. by View, and LieToMe Systems. His research interests include artificial
[36] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
intelligence (including machine learning and deep learning), computer
large-scale image recognition,’’ 2014, arXiv:1409.1556.
[37] B. Sun, C. Chen, Y. Zhu, and J. Jiang, ‘‘GEOCAPSNET: Ground to aerial vision, Wi-Fi sensing, EEG signal analysis, human–computer interaction,
view image geo-localization using capsule network,’’ in Proc. IEEE Int. human-behavior recognition, human-action recognition, biometric analysis,
Conf. Multimedia Expo (ICME), Jul. 2019, pp. 742–747. bioinformatics, optimized neural architectures, deception detection, VR/AR
[38] X. Tian, J. Shao, D. Ouyang, and H. T. Shen, ‘‘UAV-satellite view synthesis systems, drones, and robotics. He is an Active Member of several professional
for cross-view geo-localization,’’ IEEE Trans. Circuits Syst. Video Technol., organizations, including IAPR, CVPL, ACM, AIxIA, and EurAI.
vol. 32, no. 7, pp. 4804–4815, Jul. 2022.
[39] X. Tian, J. Shao, D. Ouyang, A. Zhu, and F. Chen, ‘‘SMDT: Cross-view
geo-localization with image alignment and transformer,’’ in Proc. IEEE Int.
Conf. Multimedia Expo (ICME), Jul. 2022, pp. 1–6.
[40] Y. Tian, C. Chen, and M. Shah, ‘‘Cross-view image matching for geo-
localization in urban environments,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 1998–2006.
[41] N. N. Vo and J. Hays, ‘‘Localizing and orienting street views using overhead LUIGI CINQUE (Senior Member, IEEE) received
imagery,’’ in Computer Vision—ECCV 2016 (Lecture Notes in Computer the M.Sc. degree in physics from the University
Science), vol. 9905, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., of Napoli, Italy, in 1983. From 1984 to 1990,
Cham, Switzerland: Springer, 2016, pp. 494–509, doi: 10.1007/978-3-319- he was with the Laboratory of Artificial Intelligence
46448-0_30. (Alenia S.p.A), working on the development
[42] J. Wang, Y. Yang, M. Pan, M. Zhang, M. Zhu, and M. Fu, ‘‘Hybrid of expert systems and knowledge-based vision
perspective mapping: Align method for cross-view image-based geo- systems. He is a Full Professor of computer
localization,’’ in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), Sep. 2021, science with the Sapienza University of Rome,
pp. 3040–3046. Italy. Some of the techniques, he has proposed
[43] T. Wang, Z. Zheng, C. Yan, J. Zhang, Y. Sun, B. Zheng, and Y. Yang, ‘‘Each have found applications in the field of video-based
part matters: Local patterns facilitate cross-view geo-localization,’’ IEEE
surveillance systems, autonomous vehicle, road traffic control, human
Trans. Circuits Syst. Video Technol., vol. 32, no. 2, pp. 867–879, Feb. 2022.
[44] S. Workman and N. Jacobs, ‘‘On the location dependence of convolutional behavior understanding, and visual inspection. He is the author of more
neural network features,’’ in Proc. IEEE Conf. Comput. Vis. Pattern than 200 papers in national and international journals, and conference
Recognit. Workshops (CVPRW), Jun. 2015, pp. 70–78. proceedings. His first scientific interests cover image processing, object
[45] S. Workman, R. Souvenir, and N. Jacobs, ‘‘Wide-area image geolocalization recognition, image analysis, with a particular emphasis on content-based
with aerial reference imagery,’’ in Proc. IEEE Int. Conf. Comput. Vis. retrieval in visual digital archives, and advanced man-machine interaction
(ICCV), Dec. 2015, pp. 3961–3969. assisted by computer vision. Currently, his main interests include distributed
[46] Q. Ye, J. Luo, and Y. Lin, ‘‘A coarse-to-fine visual geo-localization method systems for the analysis and interpretation of video sequences and target
for GNSS-denied UAV with oblique-view imagery,’’ ISPRS J. Photogramm. tracking. He is a member of ACM, IAPR, and CVPL. He served on scientific
Remote Sens., vol. 212, pp. 306–322, Jun. 2024. committees of international conferences (e.g., CVPR, ICME, and ICPR) and
[47] A. R. Zamir and M. Shah, ‘‘Image geo-localization based on MultipleN-
symposia. He serves as a Reviewer for many international journals (e.g.,
earest neighbor feature matching UsingGeneralized graphs,’’ IEEE Trans.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE
Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1546–1558, Aug. 2014.
[48] X. Zhang, X. Meng, H. Yin, Y. Wang, Y. Yue, Y. Xing, and Y. Zhang, ‘‘SSA- TRANSACTIONS ON CIRCUIT AND SYSTEMS, IEEE TRANSACTIONS ON SYSTEMS, MAN,
net: Spatial scale attention network for image-based geo-localization,’’ IEEE AND CYBERNETICS: SYSTEMS, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY,
Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022. IEEE TRANSACTIONS ON MEDICAL IMAGING, and Image and Vision Computing).
[49] Z. Zheng, Y. Wei, and Y. Yang, ‘‘University-1652: A multi-view multi-
source benchmark for drone-based geo-localization,’’ in Proc. 28th ACM
Int. Conf. Multimedia, Oct. 2020, pp. 1395–1403.
[50] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, ‘‘Dual-
path convolutional image-text embeddings with instance loss,’’ ACM
Trans. Multimedia Comput., Commun., Appl., vol. 16, no. 2, pp. 1–23,
May 2020. EMAD EMAM received the M.Sc. degree in
[51] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ‘‘Scene
control engineering from the Sapienza University
parsing through ADE20K dataset,’’ in Proc. IEEE Conf. Comput. Vis.
of Rome, Italy, where he is currently pursuing
Pattern Recognit. (CVPR), Jul. 2017, pp. 5122–5130.
[52] Y. Zhu, B. Sun, X. Lu, and S. Jia, ‘‘Geographic semantic network for cross- the Ph.D. degree in computer science. He is a
view image geo-localization,’’ IEEE Trans. Geosci. Remote Sens., vol. 60, Senior Research Engineer with the Prometheus
2022, Art. no. 4704315. Laboratory. His research interests include machine
[53] J. Zhuang, X. Chen, M. Dai, W. Lan, Y. Cai, and E. Zheng, ‘‘A semantic learning, deep learning, computer vision, simul-
guidance and transformer-based matching method for UAVs and satellite taneous localization and mapping (SLAM) with
images for UAV geo-localization,’’ IEEE Access, vol. 10, pp. 34277–34287, UAVs, human-action recognition, optimized neural
2022. architectures, and robotics.
FEDERICO FONTANA (Student Member, IEEE) MARCO RAOUL MARINI (Member, IEEE)
received the bachelor’s and master’s degrees in received the combined B.Sc. and M.Sc. (cum
computer science from the Sapienza University of laude) degrees in computer science from the
Rome, Italy, where he is currently pursuing the Sapienza University of Rome, Italy, in 2015,
Ph.D. degree with the Department of Computer and the Ph.D. degree in computer science from
Science. His research interests include efficient the Computer Vision Laboratory, Department of
deep learning, computer vision, binary neural Computer Science, Sapienza University of Rome,
networks, and pruning. in 2019. He is a member of the Computer Vision
Laboratory, Department of Computer Science,
Sapienza University of Rome. His research interests
include human behavior analysis, virtual and augmented reality, multimodal
interaction, natural interaction, machine learning, and deep learning. All these
interests are focused on the development of systems applied to the problem
of behavior understanding. The main application area is the eXtended Reality
(XR). He recently focused on interactions, locomotion, and human-centered
analysis in VR and brain–computer interface (BCI) integration. He is a
member of CVPL.
Open Access funding provided by ‘Università degli Studi di Roma ''La Sapienza'' 2’
within the CRUI CARE Agreement