0% found this document useful (0 votes)
17 views26 pages

UAV Geo-Localization For Navigation A Survey

This document presents a survey on UAV Geo-Localization for navigation, highlighting the reliance on GPS and the challenges faced in adverse conditions. It explores various methodologies developed to enable UAVs to determine their location autonomously, without GPS, and discusses the integration of Geo-Localization with navigation systems. The survey also reviews datasets, benchmarks, and state-of-the-art strategies in UAV-based navigation and guidance, aiming to enhance operational efficiency and safety in complex environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views26 pages

UAV Geo-Localization For Navigation A Survey

This document presents a survey on UAV Geo-Localization for navigation, highlighting the reliance on GPS and the challenges faced in adverse conditions. It explores various methodologies developed to enable UAVs to determine their location autonomously, without GPS, and discusses the integration of Geo-Localization with navigation systems. The survey also reviews datasets, benchmarks, and state-of-the-art strategies in UAV-based navigation and guidance, aiming to enhance operational efficiency and safety in complex environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Received 9 August 2024, accepted 29 August 2024, date of publication 5 September 2024, date of current version 16 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3455096

UAV Geo-Localization for Navigation: A Survey


DANILO AVOLA 1 , (Member, IEEE), LUIGI CINQUE 1 , (Senior Member, IEEE), EMAD EMAM 1 ,
FEDERICO FONTANA1 , (Student Member, IEEE), GIAN LUCA FORESTI 2 , (Senior Member, IEEE),
MARCO RAOUL MARINI 1 , (Member, IEEE), ALESSIO MECCA 1 , (Member, IEEE),
AND DANIELE PANNONE 1 , (Member, IEEE)
1 Department of Computer Science, Sapienza University of Rome, 00198 Rome, Italy
2 Department of of Mathematics, Computer Science and Physics, University of Udine, 33013 Udine, Italy
Corresponding author: Alessio Mecca ([email protected])
This work was supported in part by the ‘‘Smart unmannEd AeRial vehiCles for Human likE monitoRing (SEARCHER)’’ Project of the
Italian Ministry of Defence within the Piano Nazionale della Ricerca Militare (PNRM) 2020 Program under Grant PNRM a2020.231; in part
by the Brain–Computer Interface (BCI) Based System for Transferring Human Emotions Inside Unmanned Aerial Vehicles (UAVs) Sapienza
University Research Projects under Grant RM1221816C1CF63B; in part by the ‘‘EYE-FI.AI: going bEYond computEr vision paradigm using
wi-FI signals in AI systems’’ Project of the Italian Ministry of Universities and Research (MUR) within the Progetti di Rilevante Interesse
Nazionale (PRIN) 2022 Program (CUP: B53D23012950001) under Grant 2022AL45R2; in part by the Made in Italy–Circular and
Sustainable (MICS) Extended Partnership; and in part by the Next-Generation EU (Italian Piano Nazione di Ripresa e Resilienza
(PNRR)–M4 C2, Invest 1.3–D.D. 1551.11-10-2022, PE00000004) under Grant CUP MICS B53C22004130001.

ABSTRACT During the flight, Unmanned Aerial Vehicles (UAVs) usually exploit internal sensors to determine
their position. The most useful and used one is the Global Positioning System (GPS), or, more in general,
any Global Navigation Satellite System (GNSS). Modern GPSs provide the correct device’s location with a
few meters of displacement, especially in a scenario with good weather and open sky. However, the lack of
these optimal conditions highly impacts the accuracy. Moreover, in restricted areas or fields of war, several
anti-drone techniques are applied to limit UAVs capabilities. Without proper counter solutions, UAVs cannot
continue their task and sometimes are not even able to come back since they are not aware of their position.
During the last years, plenty of techniques have been developed to provide UAVs with a knowledge of their
location that is not strictly connected to the availability of the GPS sensor. This research field is commonly
called Geo-Localization and can be considered one of the hot topics of UAV research. Moreover, research is
going further, trying to provide UAVs with fully autonomous navigation systems that do not use hijackable
sensors. This survey aims to provide a quick guide to the newest and more promising methodologies for UAV
Geo-Localization for Navigation tasks, showing the differences and the related application fields.

INDEX TERMS UAV, geo-localization, navigation, guidance.

I. INTRODUCTION fields received a boost in terms of applicability thanks to the


During the last years, Computer Vision (CV) drastically recent spread of Unmanned Aerial Vehicles (UAVs) and the
improved its accuracy due to the advances in processing consequent reduced costs. This is especially true for military
techniques, especially thanks to the novel deep learning applications. In this context, the use of UAVs drastically
approaches. Nowadays, it is possible to accomplish tasks that changed the war’s offensive and defensive strategies [17],
were not feasible in the past. CV-based approaches can be raising new challenges for both tasks.
applied to a plethora of fields, such as medical imaging [9], During flights, UAVs usually rely on internal sensors to
object detection and recognition [5], [8], even with only determine their spatial location. The most commonly used
a few or no samples available [3], surveillance [6], [11], sensor is the Global Positioning System (GPS), or more
biometrics [7], and others. Some of these classical research generally, any Global Navigation Satellite System (GNSS).
Modern GPS devices are able to accurately pinpoint locations
The associate editor coordinating the review of this manuscript and within a few meters, especially with favorable weather
approving it for publication was Guillermo Valencia-Palomo . conditions and clear skies. Even smartphone GPS systems,
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
125332 VOLUME 12, 2024
D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

which are not designed for military use, have an average obtained without a real UAV, which is not always possible (e.g.,
accuracy of a few meters [26]. Although adverse weather in a war field), and are always geo-referenced. However, they
and obstructed views of the sky can reduce this accuracy, the are from a vertical view perspective, which usually differs
displacement rarely exceeds 30 meters in urban environments from the oblique view of UAVs. Given that, orthomosaic
with many tall buildings [26]. However, GPS sensors can transformations or similar procedures are often required to
be compromised during flights due to various internal and find the best alignment between them. As an alternative,
external factors. In restricted zones or conflict areas, anti- the ground truth acquisition(s) can derive from a (some)
drone tactics like GPS hijacking and spoofing are applied to previous flight(s) of the same UAV or comparable device.
disrupt UAV operations. Consequently, UAVs may be unable This is often the best solution, thanks to similar acquisition
to complete their missions or return to base due to a loss of conditions and parameters, but this is not always possible in
positional information. In this context, significant efforts have adversarial contexts (e.g., the already mentioned fields of war).
been made in recent years to develop techniques that enable Regardless of the kind of acquisition for the ground truth, the
UAVs to determine their location without relying on GPS. This UAV’s previous knowledge must at least include information
research field, which has drawn significant interest recently, regarding the area between its starting location and the target.
is commonly called Geo-Localization. This knowledge also usually includes a possible wide offset
The main purpose of this review is to explore the main on both sides of the chosen path to grant the localization, even
solutions for Geo-Localization, and to introduce useful if some problems let the UAV off course. The most common
concepts to properly understand their usage in Navigation acquisition sources for the UAVs in the Geo-Localization task
tasks. This literature review provides an overview of the are RGB cameras, even though some approaches can also
most relevant literature strategies, describing the main exploit LiDAR or IR sensors. The Geo-Location task can be
challenges and some possible solutions. The survey continues formalized as an automatic understanding of the UAV location
as follows: Section II provides a common background on only using the ground truth and one or more images acquired in
Geo-Localization methods and introduces some concepts a given instant during the flight. This task could be graphically
of Navigation tasks, with a strong focus on onboard UAV represented as in Figure 1.
autonomous systems. In addition, this section details these
two classes of algorithms, presents the main differences B. NAVIGATION (GUIDANCE)
between them, and shows the main application fields. The Navigation task (also known as the Guidance task)
Section III presents the available datasets, benchmarks, and includes all the techniques that support or allow for an
metrics usually used to evaluate Geo-Localization approaches. autonomous flight of a UAV in the case of GPS absence or
Section IV reports some highly selected state-of-the-art UAV- denial. More specifically, this task involves all the methods
based Geo-Localization strategies. Section V details the to recognize whether the UAV is off course or on the right
newest and more promising UAV-based Guidance/Navigation path. Differently from the Geo-Localization task described
proposals. Section VI summarizes the presented approaches in Section II-A, the Navigation task also includes the
and depicts some comparative tables. Section VII draws some procedures allowing a UAV to maintain the route to reach
conclusions and illustrates future research directions. the destination. Such procedures are usually assisted by
Inertial Measurement Units (IMU) sensors, which typically
consist of accelerometers, gyroscopes, and magnetometers,
II. AUTONOMOUS UAV GUIDANCE: GEO-LOCALIZATION providing real-time data such as UAV orientation, velocity,
AND NAVIGATION and acceleration. In this regard, two factors are worth noticing.
A. GEO-LOCALIZATION The former is that a Navigation system can use all the
The Geo-Localization task aims to provide the UAV with information about the entire flight, all the data from other
its location without using a GNSS. As already mentioned in sensors (including the IMU), and all the past visited locations
Section I, the use of the GPS sensor is the most reliable and during the current flight. The latter is that a Navigation system
easy solution to the problem, but in some specific contexts, performs the Geo-Localization task several times during the
such as war fields, in which GPS frequencies can be shut flight (according to the camera characteristics, the UAV speed,
down, or in environments with tall buildings/trees, which can and the required accuracy). In this context, it is worth pointing
hinder the GPS positioning, its reliability can drastically fail. out that more accurate localizations are easier to achieve due to
Most approaches exploit the RGB camera to overcome this the possible use of the previous flight information and the IMU
limitation, and some also involve Inertial Measurement Unit data, even if the Navigation task could appear more complex
(IMU) sensors. In general, this task requires acquiring prior than Geo-Localization. These two pieces of information are
knowledge of the area of interest, which is used as ground helpful because they drastically reduce the Search Area (see
truth. These acquisitions must also include geo-reference Figure 1) since it is possible to estimate the UAV position
data for each image to create a reliable ground truth. For more effectively. In other words, Navigation can be thought of
this purpose, the most common source is satellite images. as a sequence of Geo-Localization tasks. The system uses the
They have considerable advantages: they are usually freely actual location of the UAV and its past Geo-Localizations to
accessible, cover large areas, are of high quality, can be determine whether it is following the intended path. If the UAV

VOLUME 12, 2024 125333


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 1. Graphical representation of the Geo-Localization task. A generic UAV flies toward a target location and loses the GPS signal.
By looking around, it finds a registered Point-of-Interest (PoI) that allows for understanding the actual position and correcting the flight plan.

deviates from its intended course, the system takes corrective UAV route and environmental sensors to dynamically adjust
action to put it back on track. Figure 2 visually represents the the flight plan. By optimizing the path, PO also minimizes
Guidance/Navigation task. risks and resource usage, which is crucial in scenarios where
In the literature, approaches dealing with this task some- endurance, battery life, or mission duration are critical.
times overlap with complete navigation systems proposals, To summarize, Navigation provides the basic ‘‘where
which are also influenced by other factors, especially the am I and where am I going’’ framework. AN enhances this
embedded equipment, the UAV dimensions, and other framework with sophisticated decision-making capabilities
specific requirements. Moreover, strategies dealing with that allow the UAV to operate independently. PO is a
Geo-Localization can also be applied and included in a sub-task of AN refining the UAV’s trajectory for optimum
Navigation system. Such techniques can be facilitated by efficiency and safety. Together, these elements enable a UAV
the presence of the previous parameters obtained along the to perform complex missions in challenging environments
current flight. For this reason, only a bunch of proposals where traditional navigation aids such as GPS might not
facing this task have been reported in the corresponding be available. This integration is vital for advancing UAV
Section V, to provide a general idea behind the strategies in technology towards fully autonomous operations where human
this context. In this context Autonomous Navigation and Path intervention is minimized, and operational efficiency and
Optimization are interconnected and critical components for safety are maximized.
achieving effective autonomous flight operations, especially
in environments where GPS is unavailable or unreliable. III. METRICS AND BENCHMARKS
Autonomous Navigation (AN) [34] extends the concept of Most proposals dealing with Geo-Localization use common
Navigation by incorporating the ability of the UAV to make benchmarks to test their performance, allowing for a fair
independent decisions regarding its flight path without human comparison among the different strategies. For this reason, this
input. AN systems use a combination of sensors, data, and section describes in detail the most relevant freely available
algorithms to perceive the environment, assess operational datasets for this task. For works testing their procedure
parameters, and execute decisions. This task is more complex with homemade private datasets, they will be described in
as it must adapt to dynamic conditions and execute real-time conjunction with the proposal itself. It is also worth pointing
problem-solving strategies to manage the flight path. out that most of these datasets are synthetic, i.e., the images
Path Optimization (PO) [2] is a specific sub-task of AN that are not acquired with a real drone in the analyzed zone(s), but
focuses on finding the most efficient or effective route between they are synthesized from a satellite view by changing the
two points. However, the choice is not limited to the shortest orientation to simulate a coherent drone view. However, even
path but also considers other factors such as energy efficiency, if it does not seem to fit a real-life scenario, it is possible to
obstacle avoidance, safety margins, and compliance with flight notice that such images are like those really acquired by a
regulations. PO algorithms process data from the ongoing UAV in the same zone, as shown in Figure 3.

125334 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 2. Graphical representation of the Guidance/Navigation task. A generic UAV flies toward a target location. It looks around at fixed or
variable time intervals to find at least one registered PoI. According to the view angle of the PoI, the navigation algorithm can understand
whether the UAV is on route or if it is slightly off route but not enough to require intervention, but also when it is highly off route and
intervene to correct the flight plan.

Another consideration is that most benchmark datasets deal R@K % provides the percentage of tests in which the system is
with the Satellite vs. UAV scenario and vice versa, i.e., in a able to retrieve at least one correct match within the first K %
cross-view context. This is coherent with the usual usage of elements in L. For example, if the system contains 1500 images
Geo-Localization procedures. However, it is also limiting since used as previous knowledge and K = 1, the R@K % is the
different modality scenarios (i.e., IR vs. RGB, LiDAR vs. percentage of tests in which the system is able to retrieve at
RGB, and many others) are also worth considering, mainly least one correct match within the first 15 elements in L. In the
due to the spread and the reduction in the cost of these kinds literature experiments, K is usually set to 1, and this metric is
of acquisition sensors. To the best of our knowledge, only sometimes referred to as R@topK %.
the GRAL dataset can be exploited to test strategies in this
context. c: AVERAGE PRECISION (AP)
It calculates the average Precision values across different
A. METRICS instances. Precision is determined by a formula that measures
This section briefly describes the evaluation metrics presented the proportion of correctly identified positive instances, i.e.,
in the state-of-the-art proposals. These metrics are well-known True Positives (TP), against all instances classified as positive,
to evaluate the effectiveness in numerous application i.e., TPs and False Positives (FP), as for:
areas [15]. Also, for the Geo-Localization task, they have
been accepted as a standard by the scientific community [46]. TP
P=
Since they have undergone a slight modification according to TP + FP
the explanation provided as follows.
d: PRECISION@K
a: RECALL@K (R@K) indicates the Precision achieved by a system considering a TP
Let |D| be the dataset size, 1 ≤ K ≤ |D|, and L the ordered- any time that at least one correct match is provided among the
by-score list of the retrieved images provided by the system. first k results.
The R@K provides the percentage of tests in which the system
is able to retrieve at least one correct match within the first K e: ROOT MEAN SQUARE ERROR (RMSE)
element in L. Literature experiments usually set K to 1, 5, 10. It measures the differences between values predicted by
a model or an estimator and the observed values. In the
b: RECALL@K % (R@K %) Geo-Localization context, it is used to evaluate, on average,
Let |D| be the dataset size, 1 ≤ K ≤ |D|, and L the ordered-by- how far from the estimated locations and the corresponding
score list of the retrieved images provided by the system. The geo-referenced ground truth are. It is usually expressed in

VOLUME 12, 2024 125335


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 3. Comparison between images captured by real drones and synthesized from satellites. Such an example is taken
from the University-1652 dataset [49], but is common for most of the literature benchmarks.

terms of meters, and it is given by the following formula: 1) UNIVERSITY-1652 DATASET


v The University-1652 dataset [49] contains data from 1652 uni-
u n versity buildings worldwide collected in three different ways,
uX (ŷi − yi )2
RMSE = t i.e., synthetic drones, satellites, and ground cameras. It was
n
i=1 initially designed for two tasks: drone-view target localization
and drone navigation. However, it can also be used for
Geo-Localization since it can be considered a submodule of
B. DATASETS the drone navigation task. In this context, the ground camera
This section details the most common benchmarks to test images are usually ignored. In more detail, the images from
approaches dealing with the Geo-Localization task. It is worth each building are provided with related metadata, i.e., names
pointing out that most of these datasets only entail RBG data, and university affiliation (taken from Wikipedia) and geo-
with the exception of GRAL Dataset that also includes LIDAR reference (taken from Google Maps). The satellite view is
acquisitions. Moreover, all of them include acquisitions from taken by synthesizing images from Google Maps coordinates.
satellites and the corresponding acquisitions captured by with UAV view images are synthesized on the exact coordinates
UAVs. In general, including different acquisition sensors using the 3D models provided by Google Earth. Ground view
could be beneficial in some specific contexts. For example, images are captured using Google Maps’ street view. This
an Infrared camera can be used in low-light contexts (e.g., dataset presents two main advantages. First, it contains data
nocturnal operations), but are not indicated in the presence from three different views with geo-referenced information
of strong light source (e.g., a sunny day). A LIDAR, instead, from 1652 buildings. Second, it provides different views of
can be useful to create distance maps that in some contexts the same building with various distances and orientations,
provide a more accurate reconstruction of the scene, but with an average of 71 images per location. As for the
usually have a significant weight and high costs. In conclusion, cons, the dataset only deals with buildings, limiting the
considering the limited hardware capabilities of UAVs and localization study. Moreover, the images are not taken with
also the limitation due to maximum payload, it is not always real UAVs, even if the differences are small, as shown
possible or worth to add extra sensors. in Figure 3.
125336 VOLUME 12, 2024
D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

To evaluate the performance on this dataset, the authors are presented at R@1, R@5, R@10, and R@1%. Compared
provide a protocol consisting of a training set containing to other datasets, the main advantage of CVACT is that each
701 buildings of 33 Universities and a test set including image is provided with accurate GPS tags. Hence, metric
701 buildings of the remaining 39 Universities. The results location accuracy can be evaluated more straightforwardly.
are provided in terms of AP, R@1, R@5, R@10, and R@1%.
4) VO&HAYS DATASET
2) CVUSA DATASET The Vo&Hays dataset [41] contains a set of street-view
The CVUSA dataset [45] consists of 1588655 geo-tagged pairs panorama and overhead images from Google Maps collected
of ground-level and aerial images. Ground-level geo-tagged in the USA. The data acquisition begins by randomly selecting
pictures were gathered via Google Street View and Flickr.1 street-view panorama images from Google Maps. Then, each
Google Street View images have been taken in randomly panorama is cropped several times to reach a fixed size. The
selected areas within the United States. For each location, relevant overhead image at the finest scale is later fetched
the authors collect a panoramic image and two perspective from Google Maps. This produces aligned pairs of street-view
images from viewpoints, separated by 180◦ along the roadway. and overhead images, including geo-tags and depth estimates.
For Flickr, the researchers created a 100 × 100 grid out of The procedure has been repeated in 11 cities and produced
the United States and downloaded up to 150 photographs more than one million pairs of images.
from each grid cell (from 2012 onwards, sorted by the Flickr In the literature, whenever this dataset is used, 900k cross-
‘‘interesting’’ score). This binning phase ensures a more view image pairs from 8 cities are chosen to train the network,
uniform sampling distribution because Flickr photographs while the remaining 3 cities (around 70k images per city) are
are overrepresented in metropolitan areas. Indoor images used as 3 sets for testing. To measure the ranking performance
have been filtered out using the procedure proposed in [44] on this dataset, R@1% has been chosen as the main metric.
to keep only outdoor ones. As a result, 551851 Flickr and
1036804 Street View images were collected this way. For 5) UTIAS DATASET
each ground-level image’s location, the authors synthesized The UTIAS dataset [29] consists of six traversals along a
an 800 × 800 aerial image centered on it from Bing Maps fixed path of 1132m over areas with roads and buildings,
at various spatial scales (zoom scales 14, 16, and 18). After as well as large areas of grass and trees. Each traversal contains
accounting for overlap, this approach yields 1588655 geo- 8992 images that capture the specific lighting conditions at
tagged image-matched pairs and 879318 distinct aerial image different times of day: sunrise, morning, noon, afternoon,
locations. Figure 4 shows some examples of matched aerial evening, and sunset. The dataset includes all the UAV-collected
and ground-level photos from the dataset. images, the UAV pose, and the corresponding satellite image.
One of the main advantages of this cross-view dataset is Image registration’s longitude, latitude, and heading were
the highly differentiated locations across the United States, estimated by localizing images captured in Google Earth’s
allowing for more generalized feature learning. satellite images. The UAV data has been acquired using a
DJI Matrice 600 Pro multirotor with a 3-axis DJI Ronin-MX
3) CVACT DATASET gimbal. Stereo images are produced at 10 FPS via a StereoLabs
The CVACT dataset [24] is a densely sampled cross-view ZED camera. The vehicle poses for ground truth are provided
image Geo-Localization dataset that provides 35532 ground/ via the RTK-GPS system and IMU.
satellite image pairs covering Canberra, Australia. The street- Figure 7 shows an example of the different lighting
view panoramas were collected from Google Street View conditions. Since the UAV flies at a 40m altitude with a
over 300 square miles (about 483km2 ) at zoom level 2. The steady heading, the camera is directed at the nadir. There
image resolution of the panoramas is 1664 × 832 with a is an unknown offset between the RTK-GPS and the Google
180-degree vertical Field of View (FoW). For each panorama, Earth frames. So, 10% of the successful image registrations
the matchable satellite image at the GPS position was obtained are used to align them. These registrations are then omitted
from Google Maps at a zoom factor of 20. The image from all error calculations.
resolution of satellite images is 1200×1200 after removing the The test protocol suggests evaluating the results in terms of
Google watermark, whereas the ground resolution for satellite the RMSE metric over the different lighting conditions.
images is 0.12 meters per pixel. Some image pair examples
are provided in Figure 5. 6) AERIAL TEMPLATE MATCHING DATASET
To evaluate the performance, the dataset provides a The Aerial Template Matching dataset [28] consists of three
validation set with 8884 image pairs named CVACT _val and orthomosaics generated by photos collected by UAV, and it
a testing set with 92802 image pairs named CVACT _test. For is freely downloadable on Github.2 The dataset contains data
the former, each query image only has one matching image in from 3 different areas with 3 different terrains, as shown in
the gallery, while, for the latter, a query image may correspond Figure 8. The first area, NUST Islamabad (Figure 8a), contains
to several true matched images in the set. Moreover, results 1200 images and presents complex terrain patterns with
1 https://www.flickr.com/ 2 https://github.com/m-hamza-mughal/aerial-template-matching-dataset

VOLUME 12, 2024 125337


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 4. Examples of matched ground-level and aerial from the CVUSA dataset [45].

deviation of the localized region’s center pixel from actual GPS


coordinates, capturing both average and maximum errors to
highlight positioning precision. For intersensorial registration,
the evaluation focuses on the ability to match UAV images to
satellite-generated orthomosaics accurately.

7) NJ DATASET
The NJ dataset [16] is a collection of images taken for the
GPS-Denied UAV Localization task. The data is gathered
from the imagery available on the United States Geographical
Survey Earth Explorer. The authors gather data in New Jersey,
USA, from a geographical area of 5.9 × 7.5km. Figure 9
shows some representative images from the dataset. The
researchers chose the location since it presents a mix of urban,
suburban, and rural content, capturing low and high textures.
The imagery is gathered during 2006, 2008, 2010, 2013, 2015,
and 2017, across spring, summer, and fall. There are ten large
images, each of 7582 × 5946 pixels at a resolution of 1 meter
per pixel. The authors suggest evaluating the dataset using
several metrics, including Average Localization Error, which
measures the distance error between the UAV’s estimated
position and its true position; Corner Error, which assesses
FIGURE 5. Ground-to-Aerial image pairs in CVACT [24].
the alignment accuracy of learned features by calculating the
percent of image width error; and 2D Euclidean Error and
Altitude Error, whose quantify horizontal and vertical distance
buildings and water bodies but without the density of an urban errors, respectively.
area. The second, DHA Rawalpindi (Figure 8b), comprises
480 images and presents a sparsely populated residential area 8) GRAL DATASET
with water bodies and greenery. The third, instead, Gujar Khan The GRAL dataset [27] contains over 550000 location-coupled
district (Figure 8c) is a densely populated urban area and pairs of ground RGB and depth images collected from aerial
contains 372 images. The images were collected by a real UAV, LIDAR point clouds. Although the primary purpose of this
a DJI Phantom 4 Pro. The acquisitions have been performed dataset is to evaluate cross-modal localization, it also allows
during three different periods of the day to maximize variance for evaluating matching under challenging cross-view settings.
in illumination conditions. Exploiting overlapping regions, Figure 10 shows some reference pictures in the dataset.
the images corresponding to each area were stitched to The data has been collected near Princeton (New Jersey,
form the three orthomosaics. To automatically localize the USA) in an area of 143km2 that exhibits various urban,
points, a maximum of 16 point-to-point correspondences suburban, and rural topographical features. The dataset
are linked between each image and orthomosaic. Since the contains multiple sceneries, including forests, mountains, open
images had GPS coordinates, every pixel of the corresponding fields, highways, downtown areas, buildings, and roads. The
orthomosaic has the correct geo-tags. data were collected in two phases to ensure that each ground
Even if the dataset only contains data from three locations, RGB image is paired with a single distinct depth image from
the main strength is that the data comes from a real UAV. the air LIDAR. First, geolocalized ground RGB images of the
The authors recommend evaluating the dataset with matching selected area were created by densely sampling 60000 GPS
accuracy, its calculation is based on the proportion of correctly locations from the Google Street View. It is worth pointing
aligned image points, validated through a 90% overlap out that Google Street View only allows for capturing street
between predicted and labeled bounding boxes. They also images. As a consequence, RGB images are unavailable for
suggest assessing GPS localization errors by measuring the many places in the selected area. Each RBG image is 640×480,

125338 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 6. An example of images from [41]. Miami city panorama images’ (left). The corresponding produced street-view and overhead pairs
(right).

FIGURE 7. An example image from each of the 6 lighting conditions in [29] and a corresponding image rendered from Google Earth is shown.
The shadows in the Google Earth images closely resemble those in the morning lighting condition. The shadows in the afternoon and evening
appear on the opposite side of objects compared to Google Earth ones.

with a horizontal field of view of 60◦ , and a slope of 0◦ . architecture named GeoNet to tackle the cross-view image
In the second phase, a LIDAR scan of the site from USGS Geo-Localization problem.
is used to create a Digital Elevation Model (DEM). From The proposed GeoNet architecture, shown in Figure 11,
the DEM, location-linked LIDAR depth images are collected consists of a two-branch Siamese network taking a pair
for each street view image. All LIDAR depth images contain of cross-view images as input. Each network branch
RGB imagery from 1.7m above the ground, and the final entails a ResNetX module and a GeoCaps module. The
data collection includes 12 headings (from 0◦ to 360◦ at 30◦ ResNetX module ensures stable gradient propagation in
intervals) for each location. A Digital Surface Model (DSM) deep Convolutional Neural Networks (CNNs) and learns
is used to correct elevations and increase the quality. The robust intermediate feature maps, which are then input to the
depth images with no height correction, no corresponding GeoCaps module. This latter module consists of two layers of
RGB, or where more than 60% of black pixels were removed. capsules: the PrimaryCaps and GeoCaps, in which the images
Therefore, some misalignment between spatially RGB and are decomposed into vectors encapsulating information
LIDAR depth images is still possible due to inadequate such as the existence probability of a specific scene and
instrument accuracy (GPS and IMU) and calibration issues. spatial hierarchy details such as color and position. The
To evaluate the results on this dataset, the authors suggest PrimaryCaps layer stacks multiple conventional convolutional
using 20% of the area as validation images, 10% as test images, layers into capsules and transmits the resulting output
and the rest for training. In total, the collected dataset contains features to the GeoCaps layers through a dynamic routing
557627 site-coupled pairs with 417998 for training, 89787 for algorithm.
validation, and 49842 for testing. The work proposes two versions of GeoNet. Specifically,
GeoNet-I contains two network branches with different model
weights, and GeoNet-II contains two capsule branches with
IV. UAV GEO-LOCALIZATION STRATEGIES
the same model weights. These two networks have been
This section presents some relevant literature proposals
evaluated with several public cross-view datasets, namely
dealing with Geo-Localization.
CVUSA, CVACT, and Vo&Hays, respectively described in
Section III-B2, Section III-B3, and Section III-B4. GeoNet-II
A. GEOGRAPHIC SEMANTIC NETWORK FOR CROSS-VIEW achieved better performance, with a R@1%, R@1, R@5, and
IMAGE GEO-LOCALIZATION R@10 respectively of 98.7%, 58.9%, 81.8%, and 88.3% on
The proposal in [52], an updated version of the proposal CVUSA. On CVACT and Vo&Hays, instead, the GeoNet-II
discussed in [37], presents a novel end-to-end network reached a R@1% respectively of 95.8% and 76.8%.

VOLUME 12, 2024 125339


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 8. Geo-tagged orthomosaics of three different areas in [28]. (a) A geographical patch from NUST, Islamabad, covering about
0.52km2 area. (b) The area in DHA, Rawalpindi, consisting of sparsely populated terrain covering up to 0.64km2 area. (c) The densely
populated urban area of Gujar Khan District in Pakistan, which has an area of 0.66km2 .

FIGURE 9. Some examples from NJ Dataset [16]. The source image patches (top) and template images (bottom) are taken from separate large
orthographic satellite images. The dataset includes many images from both urban and low-texture rural areas.

B. CROSS-VIEW MATCHING NETWORK FOR IMAGE-BASED and ground image. In the latter, the extracted local features
GROUND-TO-AERIAL GEO-LOCALIZATION go through two Fully Connected (FC) layers, where the
The proposal in [20] tackles the Geo-Localization problem by first has independent weights and the second has shared
introducing two variants of a Siamese-based network structure weights for both network branches. Then, these transformed
named CVM-Net (-I and -II), which extract local features features go through NetVLAD layers with shared weights to
from cross-view images to create global descriptors that are obtain the descriptors. Finally, the authors employ a weighted
invariant to significant changes in viewpoint. soft-margin ranking loss as the objective function to train both
The proposed CVM-Net architectures, shown in Figure 12, networks.
consist of two network branches of the same architecture The authors evaluated both networks on the CVUSA dataset,
that are designed to receive ground-level and satellite images described in Section III-B2 and attained R@1% scores of
separately. In both architectures, each branch deals with 91.4% and 87.2%, respectively.
local feature extraction and global descriptor generation parts.
The local features are extracted from an input image using C. HYBRID PERSPECTIVE MAPPING: ALIGN METHOD FOR
a VGG16 model [36] and then fed into a NetVLAD [4] CROSS-VIEW IMAGE-BASED GEO-LOCALIZATION
layer to generate global descriptors by aggregating local The proposal in [42] presents a new method to address the
features to their respective cluster centroids. After training, problem of cross-view image-based Geo-Localization by
each centroid generated from the satellite view is linked developing a hybrid perspective mapping algorithm method
to the unique centroid of the ground view for cross-view that aligns a ground-level image with an aerial image by
matching. As mentioned, the proposal offers two versions of considering the projection relationship between them.
CVM-Net: CVM-Net-I and CVM-Net-II. The former employs In the proposed approach, perspective mapping and polar
an independent NetVLAD layer for each network branch transform are applied respectively to a ground-level panoramic
to generate the respective global descriptors of a satellite image’s covisibility and non-covisibility areas to obtain a

125340 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 10. Example pairs of ground RGB images and aerial LIDAR depth images from the GRAL Dataset [27]. RGB images are collected from
Google Street View, and depth images are collected by rendering aerial LIDAR point clouds from USGS.

FIGURE 11. The GeoNet architecture proposed in [52].

bird’s eye view that matches a satellite image. In this context, D. SMDT: CROSS-VIEW GEO-LOCALIZATION WITH IMAGE
the covisibility area in an image represents structures, such as ALIGNMENT AND TRANSFORMER
streets and pavements, that appear concurrently at ground and The proposal in [39] introduces a cross-view matching method
satellite views. In contrast, the non-covisibility area contains for Geo-Localization comprised of image alignment and
vertical structures, such as building roofs, whose exteriors Transformer.
can be seen only in one of the cross-view images. The polar The SMDT framework comprises four modules: Seman-
transform allows modifying the distance of pixels in the tic segmentation, Mixed perspective-polar mapping, Dual
non-covisibilty area to the bird’s-eye image center. Hence, Conditional Generative Adversarial Network (CGAN), and
learning semantic relations is facilitated as the resemblance Transformer. The segmentation technique, first proposed
of the vertical structures in the bird’s-eye image is maintained. in [51], splits ground images’ content into covisibility and
The proposal exploits a Siamese network model with a non-convisibility areas based on classes (such as sky, tree,
similar structure of CVM-Net [20], as shown in Figure 13. building, road, sidewalk, and car) and generates class-specific
VGG-16 [36] extracts local features from an image, which are masks. Since aerial images do not have skies and cars for
aggregated through a NetVLAD layer [4] to generate a global matching, the masks obtained are used to create augmented
descriptor. The global descriptors of the cross-view image samples that do not consider the masks of these two classes.
pairs then pass through a FC layer to be mapped in the same In the second module, a VGG-16 [36] model, consisting of
feature space. A weighted soft-margin triplet loss is utilized 13 convolutional layers and a NetVLAD [4] with 64 clusters,
to compare the cross-view image pairs in this space. is utilized to implement polar and perspective mapping on
The model has been evaluated on CVUSA and CVACT, aerial images. The third module introduces a dual CGAN
described in Section III-B2 and Section III-B3. It achieved structure based on the proposal in [21]. Based on the
a R@1, R@5, R@10, and R@1%, respectively of 34.87%, segmented images, it synthesizes aerial images with a ground
55.81%, 67.29%, and 89.76% on CVUSA, and 2.06%, 5.31%, view style. The network architecture of the dual CGAN is
8.49%, and 33.43% on CVACT. shown in Figure 14. Finally, the fourth module adopts the

VOLUME 12, 2024 125341


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 12. Overview of CVM-Nets proposed in [20].

FIGURE 13. The Siamese architecture proposed in [42]. After the mapping
with the hybrid perspective method, the ground-level image is given as
input for the network. VGG16 extracts local features, and NetVLAD
aggregates global features. The two branches do not share any weight.

Visual Transformer network (ViT) proposed in [22], as shown


in Figure 15, to minimize the uncertainties caused by geometry
misalignments.
FIGURE 14. Network architecture of the dual CGAN proposed in [39].
The authors evaluated the network on CVUSA and
CVACT, described in Section III-B2 and Section III-B3. The
proposed network achieved a R@1, R@5, R@10, and R@1%,
respectively of 95.06%, 98.97%, 99.25%, and 99.87% on network includes a ResNet-50 [18] with pre-trained weights
CVUSA, and 85.52%, 94.97%, 96.28%, and 98.96% on on ImageNet [33], which contains 5 stages. As shown in
CVACT. Figure 16, the USAM is inserted behind stage 1 and stage 2.
In addition, the original classifier of ResNet-50 is replaced by a
E. JOINT REPRESENTATION LEARNING AND KEYPOINT FC layer, a batch normalization layer (BN), and a classification
DETECTION FOR CROSS-VIEW GEO-LOCALIZATION layer (CL). Finally, Instance Loss [49] is utilized to train the
The proposal in [23] introduces RK-Net to learn Represen- model.
tation and detect Keypoints in a single network to tackle the The model has been evaluated on University-1652, CVUSA,
cross-view Geo-Localization problem without requiring extra and CVACT, respectively described in Section III-B1,
annotations. Section III-B2, and Section III-B3. In this context, USAM
The proposed RK-Net framework, shown in Figure 16, was added to other state-of-the-art methods, specifically
introduces a novel Unit Subtraction Attention Module Local Pattern Network (LPN) [43] and SAFA [35], achieving
(USAM). The main function of the USAM is to extract improved results. On University-1652, by using the USAM,
the keypoints from the feature map by Unit Subtraction the R@1 of SAFA jumps from 68.27% to 70.89%, and the one
Convolution (USC). USC replaces the element-wise matrix with LPN from 74.18% to 77.07%. The AP increases from
multiplication in traditional convolution conducted on feature 72.06% to 74.56% for the former method and from 77.39%
maps with subtraction. The backbone of the proposed to 80.09% for the latter. Similar results have been achieved on

125342 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 15. Structure of the transformer framework adopted in [39].

FIGURE 16. Overview of the proposed RK-Net framework in [23].

CVUSA and CVACT, where SAFA increases the R@1 from to train the baseline model. It is defined as follows:
89.84% to 90.16% on the former and from 81.03% to 82.40%
on the latter. LPN also increases its performance, moving from ps = softmax (Wshare × Fs (xs ))
85.79% to 90.16% on CVUSA and from 79.99% to 82.02% Ls = − log (ps (c))
on CVACT. pd = softmax (Wshare × Fd (xd ))
Ld = − log (pd (c))
F. UNIVERSITY-1652: A MULTI-VIEW MULTI-SOURCE
BENCHMARK FOR DRONE-BASED GEO-LOCALIZATION where xs , xd are two images of the location c, respectively,
The proposal in [49] presents the University-1652 Dataset, from the satellite-view and the drone-view image, Wshare is
introduced in Section III-B1. It also proposes a method for the the weight of the last classification layer, p(c) is the predicted
Geo-Localization task as the first benchmark for this dataset. probability of the correct class c. Unlike the conventional
The proposed method utilizes a two-branch CNN archi- classification loss, the shared weight Wshare provides a soft
tecture, as shown in Figure 17, to learn the relationships constraint on the high-level features. After optimizing the
between different views and minimize the differences. Each models, the different feature spaces are aligned with the
location is treated as a separate class for the classifier since classification space. The cosine distance is used to determine
the dataset provides multiple images for each of them. The the similarity between a query image and the candidate images
idea is to encourage the network to create a shared feature in the gallery.
space by sharing images from various sources. To this aim, the The system is tested to assess the localization quality for
chosen networks are two pre-trained ResNet models [18] on both the Satellite → UAV and UAV → Satellite scenarios.
ImageNet. However, the classification layer is replaced with a In a single-query setting, the model achieves an R@1, R@10,
512-dimension FC layer followed by a classification layer after and AP of 74.47%, 83.88%, and 59.45% for Satellite → UAV
the pooling layer. The network exploits Instance Loss [50], scenario; 58.23%, 84.52%, and 62.91%, respectively, for the
initially developed for image-language bi-directional retrieval, UAV → Satellite scenario.

VOLUME 12, 2024 125343


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

the image retrieval module aims to predict the candidate patch


with the most mutual areas, and the image matching modules
identify the center pixel from satellite images between the
aerial and candidate patches.
The system is tested on a private dataset created using
Google Earth, which includes 1300 images with a resolution
of 720 × 720 and covers a square area of 200m × 200m.
The dataset is augmented through color jittering (HSL) to
simulate various visual representations influenced by external
conditions, such as weather, light conditions, and seasonal
changes. The presented results evaluate the accuracy of the
UAV → Satellite Geo-Localization and image alignment in
a single-query setting, indicating an RMSE of 36.4m and a
Precision@3 of 67.00%.

H. A SEMANTIC GUIDANCE AND TRANSFORMER-BASED


MATCHING METHOD FOR UAVS AND SATELLITE IMAGES
FIGURE 17. Overview of the approach on (University-1652: A Multi-view
Multi-source Benchmark). FOR UAV GEO-LOCALIZATION
The proposal in [53] outlines a Geo-Localization approach that
relies on cross-view image matching. The method matches
G. AERIAL-SATELLITE IMAGE MATCHING FRAMEWORK vertically captured satellite images with geo-referenced
FOR UAV ABSOLUTE VISUAL LOCALIZATION USING information against front-facing perspective photos.
CONTRASTIVE LEARNING The proposed approach employs two Transformers with
The proposal in [1] presents an aerial-satellite imagery shared weights as its backbone and a new Semantic Guidance
framework for UAV visual Geo-Localization. Module (SGM) module, depicted in Figure 19. The Swin-
The strategy proposes a CNN-based Siamese Neural Tiny [25], consisting of stacked self-attention layers, serves
Network (SNN) with a contrastive learning approach. The as the backbone network for drone and satellite imagery. The
framework comprises three modules: an image retrieval base network is enhanced by adding the SGM module, which
module, an image matching module, and a feature extractor divides the backbone’s outputs into two types of features,
incorporating an SNN built on top of a CNN. By utilizing as depicted in Figure 20. The classification process begins
aerial and satellite image patches of the surrounding area, with the average pooling outputs from satellite and drone
the framework can determine the global coordinates of the views before being combined with a final FC layer. After the
UAV. Figure 18 shows an overview of the proposed system. backbone, the SGM module operates as follows. Let’s denote
j
The design incorporates a contrastive learning approach the input feature map of the SGM as Mi , with a size of 64×768.
j
to establish a dependable visual representation for two SGM sums Mi along the channel direction, which is expressed
primary objectives: image matching and image similarity as:
measurement for retrieval. The SNN is trained using a triplet 768
j
X
loss strategy. This loss function aims to decrease the distance Mi = Mi i ∈ [0, 63]
between positive samples (P) and anchor samples (A) while j=0
simultaneously increasing the distance between negative
samples (N) and anchor samples. It can be mathematically After the above operations, the size of Mi become 64 × 1.
formulated as follows: Normalization operation is performed on Mi as follows:
Mi − Minimum (Mi )
L(A, P, N ) = max(∥f (A) − f (P)∥2 Mi =
Maximum (Mi ) − Minimum (Mi )
− ∥f (A) − f (N )∥2 + α, 0)
The operations of Maximum and Minimum obtain the
where α is a margin between positive and negative pairs, set maximum and minimum values in Mi , respectively. Based on
as 1, and f is an embedding. Image pairs with the mutual these results, SGM computes the gradient between adjacent
area are selected as positive samples to train the network positions and divides the feature map into various regions. The
with triplet loss. Additionally, two images with mutual areas entire process can be expressed as follows:
are chosen, and the image with the smaller mutual area is 
Mi+1 − Mi

identified as negative. The CNN and image retrieval module iposition = argmax
Mi
train low-level features and image similarity metrics through
this process. The FC layers are then trained for image matching Then, the feature map is partitioned into the foreground
by a supervised learning technique that predicts the center (architecture) and the background (environment), serving as
pixel in each patch while keeping other weights frozen. Finally, the foundation for extracting contextual information.

125344 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 18. The overview of the approach proposed in [1].

The proposed strategy consists of feature extraction and


supervised classification learning. The feature extraction part
is based on the Vision Transformer architecture and is shown
in Figure 22. It takes an image x ∈ RH ×W ×C as input, where
W , H , C are its width, height, and channels.
n In the first step, o
x
is splitted into N fixed-size patches xpi | i = 1, 2, · · · , N
and flatten. An extra learnable embedding token, xcls is
merged into spatial information to extract robust features. The
output [cls_token] represents a global feature representation f .
Next, position information is inserted into each patch through
learnable position embedding. This creates an input sequence
expressed as:
h      i
Z0 = xcls ; F xp1 ; F xp2 ; · · · ; F xpN + P

FIGURE 19. Overview of the SGM module proposed in [53]. Numbers 1


and 2 represent the two types of extracted features. where Z0 indicates input sequence embeddings, F is a linear
projection that maps the patches to D dimensions, and P ∈
R(N +1)×D is the position embeddings. By incorporating an
The proposed approach was evaluated using the University- attention mechanism, each transformer’s layer can analyze
1652 dataset, described in Section III-B1. In single-query the global context, which surpasses the restriction of the
mode, the system achieved R@1 and AP scores of 88.16% receptive field of the CNN. This mechanism eliminates
and 81.81% for Satellite → UAV scenario and 82.14 and the need for down-sampling operations. The work introduces
84.72% for UAV → Satellite scenario. In multi-query mode, the novel FSRA, which aims to segment at the patch level
the system achieved R@1 and AP scores of 89.23% and and align features at the region level, even in scenarios where
90.95% in the UAV → Satellite scenario. Additionally, the images have position deviations or scale changes. The FSRA
paper proposed a larger and slower model based on Swin-large. comprises two main components: the Heatmap Segmentation
In single-query mode, this model achieved scores of 90.16% Module (HSM) and the Heatmap Alignment Branch (HAB).
and 85.28% for Satellite → UAV scenario and 85.44% and The HSM splits the feature map into blocks based on heat
87.60% for UAV → Satellite scenario. distribution, assigning them numbers starting from 1 to n to
achieve instance segmentation at the patch level. The HAB
I. A TRANSFORMER-BASED FEATURE SEGMENTATION AND uses the HSM’s output to cut out parts corresponding to
REGION ALIGNMENT METHOD FOR UAV-VIEW different views, calculate the loss, and allow the network to
GEO-LOCALIZATION learn heat distribution rules. As shown in Figure Figure 21,
The proposal in [14] presents a novel approach for Geo- which provides an overview of the proposed architecture,
Localization, called Feature Segmentation and Region a Triplet Loss is used for images between different viewpoints
Alignment (FSRA), based on Vision Transformers. to narrow the perspective.

VOLUME 12, 2024 125345


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 20. Overview of the architecture proposed in [53]. It comprises various components, including Layer Normalization
(LN), Multi-head Self-Attention modules with regular (W-MSA), Multi-head Self-Attention modules with Shifted Windowing
configurations (SW-MSA), and Multi-Layer Perceptron (MLP).

The proposed approach has been tested on the University- cluster. Simultaneously, weights are assigned to each edge,
1652 dataset, described in III-B1, both in Satellite → UAV reflecting the similarity between linked node pairs. Each edge
and UAV → Satellite contexts. In a single-query setting, the weight aij , which measures the similarity between reference
system achieves an R@1 and an AP of 89.73%, 84.94% buildings i and j, is represented as:
for Satellite → UAV scenario and 85.50%, 87.53% for dij2
UAV → Satellite scenario. It also compares Triplet Loss with
aij = e 2σ 2 + α(si + sj )
other loss functions to demonstrate the increase in terms of
performance. where dij2 is the distance between i and j GPS locations in
Cartesian coordinates, and si is the similarity between query
J. CROSS-VIEW IMAGE MATCHING FOR building and reference building i based on their building
GEO-LOCALIZATION IN URBAN ENVIRONMENTS matching score. Geo-Localization aims to choose a maximum
The methodology outlined in [40] uses deep learning tech- of one reference building from each cluster to maximize the
niques to introduce a cross-view image-matching framework overall weight. Dominant sets are employed to address this
tailored for Geo-Localization purposes. Its objective is to challenge. For a non-empty subset V ⊆ S, i ∈ S, and j ∈ / S,
autonomously identify, characterize, and align semantic the total weight of S is expressed as:
content within cross-view images. X
According to the pipeline depicted in Figure 23, rather W (S) = WS (i)
i∈S
than relying on the matching of local features, the study
employs a cross-view matching strategy centered around where WS (i) is a weight defined recursively and assigned to
buildings, which presents a higher semantic significance and each node i ∈ S. Then, the replicator dynamics algorithm is
robustness to variations in viewpoints. Initially, the presence used to select a Dominant Set, which is used as the basis for
of buildings in both query and reference images is determined deriving the final Geo-Localization. The latter is obtained by
utilizing the Faster R-CNN [31] algorithm. Subsequently, computing the mean GPS location from the reference buildings
a Siamese network [13] is employed to acquire deep feature that have been selected within the Dominant Set.
representations to discern between matched and unmatched The strategy has been tested with the dataset introduced
pairs of buildings within cross-view images. The idea consists in [47], comprising four pairs of street view and bird’s eye
of creating a mapping capable of associating buildings from view images per GPS location in the vicinity of downtown
distinct perspectives into a feature space where matched Pittsburgh, Orlando, and a section of Manhattan, sourced from
pairs are closer while unmatched pairs are distant. During the Google Street View dataset. Annotations were applied to
the training phase for feature representation, the goal is to identify the corresponding buildings, facilitating the training
minimize the Euclidean distance between matched pairs in the of a deep network for building matching purposes. Precision-
feature space, ideally approaching zero, while maximizing recall curves are depicted for test image pairs to evaluate
the distance between unmatched pairs. During the testing building matching efficacy. The fine-tuned model exhibits an
phase, k nearest neighbors are determined from the reference average precision of 32%, a notable improvement over the
images based on Euclidean matching scores for each identified 11% achieved by the pre-trained model.
building in the query image. Subsequently, an undirected,
edge-weighted graph G = (V , E) is built without self-loops, K. SSA-NET: SPATIAL SCALE ATTENTION NETWORK FOR
encompassing all selected reference buildings and their nearest IMAGE-BASED GEO-LOCALIZATION
neighbors for each query building, forming clusters as depicted The work in [48] proposes an approach for Geo-Localization
in Figure 24. Each chosen reference building is denoted as a utilizing spatial layouts. This solution is versatile and able to
node, with edges connecting pairs of nodes not within the same deal with both cross-view scenarios, such as UAV and Satellite

125346 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 21. The framework of proposed FSRA in [14]. The heatmap segmentation module (light green) rearranges and
evenly distributes the heatmap data based on their distribution to segment distinct content features. The heatmap
alignment branch (light blue) extracts feature vectors from each segmented region and conducts classification
supervision for each vector. The Triplet Loss is utilized in each branch to reduce the distance between similar feature
content and enable end-to-end learning. Moreover, the system also incorporates a global branch (light purple) based
on the transformer.

FIGURE 22. The transformer-based strong baseline framework presented in [14]. The [cls_token] output marked with ∗ is
utilized as the global feature f . The Classifier Layer consists of a linear layer, ReLU activation function, batch
normalization, and dropout. The ID Loss refers to the CrossEntropy loss, which does not incorporate label-smoothing.

views, and cross-modality scenarios, e.g., images from RGB This module is specifically engineered to integrate spatial
camera and LiDAR sensor. layout information into the feature representation, enhancing
To achieve this goal, the strategy introduces a novel deep image-matching capabilities. In detail, the module captures the
network, depicted in Figure 25, designed to encapsulate the relative positional dynamics among significant object features
spatial configuration of scenes within the feature representa- by autonomously pinpointing prominent correspondences
tion. The network architecture is built upon a dual-branch while minimizing the impact of irrelevant ones. This is
Siamese framework aimed at acquiring a unified feature achieved by only relying on a self-attentive mechanism.
representation from pairs of images. Unlike the conventional Initially, the module reduces feature dimensionality through
Siamese network, its branches do not share weights due to 11 convolutions, followed by a max-pooling operator to
the disparate nature of the input data sources. Each branch identify the most pertinent features. Subsequently, it utilizes
of the model consists of a VGG-16 CNN, with the classifier a multi-scale spatial layout importance generator to establish
layers removed. The output of the final layer is then passed a position embedding map, thus ensuring that object features
through the innovative Spatial-Scale Attention (SSA) module. across various scales receive tailored levels of attention. Given

VOLUME 12, 2024 125347


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 23. The pipeline proposed in [40].

FIGURE 24. An example of Geo-Localization using the dominant set proposed in [40].

Additionally, the authors conducted an analysis on the positive


impact of the SSA module, revealing a significant performance
enhancement compared to the baseline network version,
particularly evident in the R@1 metric, which increased from
65.74% to 91.52%.

L. UAV-SATELLITE VIEW SYNTHESIS FOR CROSS-VIEW


GEO-LOCALIZATION
The proposal in [38] introduces an end-to-end Geo-
Localization methodology employing cross-view image
matching. Within this framework, vertically captured satellite
images with geo-referencing information are matched with
images captured from a frontal perspective, e.g., UAVs’ view.
FIGURE 25. The Siamese architecture proposed in [48].
The proposed methodology comprises two distinct modules,
visually depicted in Figure 27, to achieve this aim. The first
module focuses on cross-view synthesis and comprises two
an input feature map F ∈ RH ×W ×C , the spatial embedding
components. In the former, the oblique view captured by the
map F ′ is computed as: F ′ = M (F) ⊗ F, where:
UAV is processed by a Perspective PPT to be converted into
M (F) = δ(f 1×1 (f 3×3 (S(F)), f 5×5 (S(F)), f 7×7 (S(F))), a vertical view. This transformation aligns the UAV image
c
S(F) = fmax f 1×1 (F), with the real satellite image, establishing geometric spatial
correspondence between the vertical and oblique perspectives.
c
with fmax representing the max pooling operator, f n×n a However, although the result of this process is acceptable, the
n × n convolution, δ the sigmoid activation function, and PPT method does not consider the scene content, leading to
⊗ the element-wise multiplication. Figure 26 depicts the noticeable appearance distortions in the transformed image.
SSA module. Finally, a Weighted Soft-Margin Triplet Loss is The second component addresses these issues and enhances
used as the objective function, as for other Geo-Localization image quality by using CGAN. This network includes a
proposals [20], [30]. Generator G, which takes the inverted PPT UAV image
The model has been tested on three public benchmarks: as input and produces a realistic vertical-view UAV image.
CVUSA, CVACT, and GRAL, respectively described in Additionally, a Discriminator D evaluates the authenticity
Section III-B2, Section III-B3, and Section III-B8. Evaluation of generated and actual satellite images, discerning between
metrics include R@1%, R@1, R@5, and R@10. Across true and false representations. The Generator adopts a U-Net
CVUSA, the model achieved remarkable results of 99.71%, architecture [32], while the Discriminator follows a PatchGAN
91.52%, 97.69%, and 98.57%, respectively. Similarly, classifier [21]. The second module of the architecture, namely
on CVACT, it achieved 98.36%, 84.23%, 94.33%, and the Geo-Localization module, integrates a modified version
95.95%. For the GRAL dataset, the strategy achieves 40.5%, of the LPN [43], featuring a ResNet50 as its backbone.
63.8%, and 71.5%, respectively for R@1, R@5, and R@10. The LPN employs a square-ring partition strategy to capture
These findings confirm the considerable challenge of the complementary spatial features based on distance from the
cross-modality task compared to its cross-view counterpart. image center, thus enhancing cross-view Geo-Localization

125348 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 26. Spatial-Scale Attention (SSA) module proposed in [48].

FIGURE 27. Graphical Abstract of the proposal in [38]. The first two columns represent the first
module, while the third column is the second one.

performance. Figure 28 shows the overall architecture of the of input images on system performance across all scenarios.
proposed strategy. However, it is worth noting that the magnitude of improvement
The proposed methodology has been tested utilizing diminishes gradually after reaching a certain threshold.
the University-1652 dataset, described in Section III-B1.
Evaluation occurred in both single-query and multi-query
setups. In the former, the system processed one satellite image M. ASSISTING UAV LOCALIZATION VIA DEEP CONTEXTUAL
alongside one drone view, while in the latter, it handled IMAGE MATCHING
a single satellite image along with multiple drone views The research presented in [28] focuses on aligning a pre-stored
captured from varying heights and angles. The obtained results orthomosaic map with the front-view orientation of a UAV.
validate the efficacy of localization in both Satellite → UAV It explores the possibility of utilizing onboard cameras and
and UAV → Satellite scenarios. In the single-query mode, pre-stored geo-referenced imagery.
the system achieved R@1, R@5, R@10, and AP scores of The proposed strategy involves an end-to-end trainable
83.27%, 90.32%, 95.52%, and 87.32%, respectively, in the architecture conducting feature learning and template localiza-
first scenario, and 91.78%, 93.35%, 96.45%, and 82.18% in tion simultaneously. This is achieved by imposing probabilistic
the second. In the multi-query mode, the system obtained constraints on densely correlated feature maps of different
R@1 and AP scores of 93.73% and 88.49% in the first dimensions. The proposed aerial image localization network
scenario and 91.63% and 90.84% in the second. Additionally, is designed to learn feature points with a neighborhood
the study presents intermediate results, demonstrating the consensus, enabling the refinement of matches between
effectiveness of each methodology component and showcasing template images and the pre-stored orthomosaic. The initial
the incremental improvements they achieved. Furthermore, step utilizes a ResNet-101 [18] to extract convolutional
it highlights the significant impact of increasing the number features. These feature maps contain both local and global

VOLUME 12, 2024 125349


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 28. Overall system architecture proposed in [38].

FIGURE 29. Overview of the system proposed in [28]. The template image IT and the orthomosaic IM serve as inputs for convolutional
feature extractor processes. These processes generate feature maps M and T, which are then used to calculate the correlation tensor using
Soft Mutual Nearest Neighbor Filtering. Subsequently, the correlation tensor is processed by a 4D convolutional network, and probabilistic
constraints are applied over the processed correlation matrix to calculate points of correspondence feature matches. Finally, through the
final FCN, these points predict the point-to-point correspondence between IT and IM , projecting the template over the orthomosaic.

information, which are used to construct a correlation keypoint matching accuracy of 92.66% and an average error
matrix holding the feature matches for each extracted in positioning the drone coordinates of 3.594 m2 .
feature point. Subsequently, this correlation matrix is fed
into a trainable network to learn how to establish more V. UAV NAVIGATION STRATEGIES
reliable correspondences. Moreover, probabilistic constraints This section presents some examples of strategies dealing with
are incorporated into these established correspondences to the Navigation task. Since a complete review of this task is
associate each feature point in the source image with the out of the main scope of this survey, only a bunch of proposals
corresponding points in the orthomosaic. The same process are described in the following.
is applied to the orthomosaic feature points. As depicted
in Figure 30, for performing the clustering using Highest A. BRM LOCALIZATION: UAV LOCALIZATION IN
Correlation, a soft-argmax layer is applied to the generated GNSS-DENIED ENVIRONMENTS
probability maps to extract the best match indices. These The guidance strategy proposed in [12] enables UAVs to
indices are then passed through a FCN, acting as a regressor, navigate along a predetermined path without relying on GNSS
to estimate the point-to-point correspondences between the assistance by only requiring the initial position, an IMU sensor,
source and target images. Since all components of the pipeline and an RGB camera.
are differentiable, the network can be trained end-to-end. The paper introduces a new Building Ratio Map (BRM)
Figure 29 shows the overall architecture. localization method that compares UAV images with an
The system has been evaluated using the Aerial Template existing numerical map. The approach entails an offline and
Matching Dataset, described in Section III-B6, achieving a an online stage. In the former, the numerical map is created

125350 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 31. As proposed in [12], the Building Ratio Feature is a technique


that involves segmenting the building area from the UAV image,
represented by the black area. The features are extracted by computing the
building area ratio in circular regions, each covering the area up to its
center.

FIGURE 30. The proposed clustering method based on Highest Correlation,


outlined in [28], involves the following steps. Once the feature maps γ M
and γ T are computed, they are arranged in descending order based on the
correlation values of feature points indexed in the last dimension of the
tensor. Following this sorting process, the top k points are chosen, altering
the shape of the tensor. The darker circles indicate points with the highest
correlation values, while lighter circles depict decreased correlation.

from satellite images and is utilized as the ground truth for


the localization task. This map contains the correct values
for the Building Ratio Features (BRFs). They are computed
by dividing an image Ii into n circular regions, denoted as
Sik (k = 1, . . . , n), which represent concentric areas centered
on Ii with a radius of rk = 2h · n+1−k n . The area of the
buildings within each Sik is defined as Bki , and it is used
Bk
to calculate the BRF as fik = ki . Therefore, each image Ii
si
has n BRFs that are rotation invariant, enabling estimation
of the position candidate group without knowledge of the
position and orientation. Figure 31 provides an example of the
procedure. During the online stage, the UAV’s Flight Control FIGURE 32. The Navigation workflow proposed in [12]. The algorithm
Unit (FCU) provides IMU data and RGB images. The system utilizes the flight controller unit, camera supply, IMU, and image data to
segments the UAV image, extracts the BRFs, and matches compute local odometry. The UAV images are pre-processed using a
pre-trained network to extract building images. Then, the building ratio
them with the ground truth numerical map. Meanwhile, the matching algorithm is executed to estimate the global position of the UAV.
relative odometry is computed with knowledge of the initial
position, IMU data, and RGB images. The procedure starts by
utilizing the pre-processed numerical map to identify location
candidates that meet the necessary conditions. The UAV’s National Geographic Information Institute of Korea. However,
global position is estimated after the candidate convergence. the UAV data collected with DJI Matrice 100 and used in the
Once the matching provides the actual location, the estimated evaluation are not publicly available, hindering a possible
odometry is refined using the data from the ground truth. comparison. The system achieved a RMSE of 7.53m and
This process is repeated regularly to maintain alignment with 12.01m, respectively in scenarios with and without knowing
the desired path. The proposed system can estimate location the initial location.
without precise knowledge of the initial coordinates and can
restart localization even if the position is lost. However, it has B. UAV LOCALIZATION USING AUTOENCODED SATELLITE
only been tested in areas with buildings, and their presence IMAGES
appears necessary for the proposed strategy. The proposed The approach outlined in [10] presents a rapid and reliable
workflow is illustrated in Figure 32. method for UAV Navigation that exploits subsequent
The proposed system has been evaluated in South Korea Geo-Localizations using satellite images obtained from
over a 1075m path. The satellite data used in the experiments is Google Earth. The strategy relies on an autoencoder capable
publicly available and downloadable from KakaoMap and the of compressing the satellite images into a low-dimensional

VOLUME 12, 2024 125351


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FIGURE 33. Graphical Abstract of the proposal in [10]. The procedure


involves six steps: the first two are conducted offline, while the remaining
four are applied directly on the UAV.

vector representation, thus expediting the UAV computation


process.
The graphical overview of the concept is illustrated in
Figure 33. The strategy comprises six steps. The first two
steps, which represent the most computationally intensive part
of the process and are only performed once, are conducted
offline. Instead, the remaining four are directly executed on FIGURE 34. The symmetrical encoder/decoder network structure proposed
in [10].
the UAV. In the first step, images of the analyzed path are
generated from Google Earth at intervals of 0.5 meters, while
also considering a lateral offset of 5 meters on both sides. This
sums up to 42 images per meter of the path. In the second step, to compute the covariance, also using the original weight
an autoencoder is trained for each path using these rendered values. The strategy, based on the covariance computed on
images following a strategy based on the proposal in [19]. The weights, also enables the rejection of outliers, which are
autoencoder consists of 6 layers, as depicted in Figure 34. The represented by single narrow peaks located away from the
first 5 layers perform 2D convolutions with a stride of two, edges of the area covered by the reference images.
followed by a batch normalization layer. The sixth layer is The procedure was evaluated using the UTIAS Dataset,
linear and maps the output of the final convolutional layer to a described in Section III-B5. The results are presented in terms
bottleneck vector of dimension 1000, which was empirically of Registration Success (RS), which indicates the UAV’s
chosen as the best compromise to achieve the desired accuracy. ability to follow the desired path. In this context, the strategy
Following training, the encoded and reduced data are loaded achieved an RS ranging from 96% to 100% depending on
onto the UAV. During the UAV flight, live images are acquired the light conditions. The computation time per frame was
and encoded similarly to the training images. However, in this also recorded at 0.11 seconds, enabling real-time operations.
case, a decoder is also employed. The structure of the decoder In cases where RS was achieved, the authors also provided
mirrors the one of the encoder, as shown in Figure 34. For the RMSE, representing the distance from the corresponding
recognition purposes, live images are compared to a subset of mapped point. This value was consistently equal to or less
the encoded reference images, and the weights are computed than 2.17 meters.
using a basic inner-product kernel as follows:
T VI. SUMMARY TABLE AND DISCUSSION
w = Yge y
This section summarizes the results achieved by all the
where Tge is a matrix containing the stacked Google Earth proposals presented in this survey. According to the literature,
reference images, and y represents the live image passed the most used datasets are University-1652, CVUSA, and
through the trained autoencoder. These weights are then used CVACT. The works have been divided according to the

125352 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

TABLE 1. Summary table for the proposals tested on the University-1652 benchmark [49]. In red, the best results for each category of test.

TABLE 2. Summary table for the proposals tested on the CVACT benchmark [24]. In red, the best results for each category of test.

TABLE 3. Summary table for the proposals tested on the CVUSA benchmark [45]. In red, the best results for each category of test.

benchmark dataset used in the experimental part to maintain a no other competitor described in this survey. Since these
fair comparison. Table 1, Table 2, and Table 3 summarize the datasets present similar characteristics, some proposals test
main characteristics of approaches dealing with University- their approaches using more than one of them, possibly
1652, CVACT, and CVUSA, respectively. Table 4 includes demonstrating their robustness. For all that works, they are
all the proposals that are either tested on homemade/non- included in all the corresponding tables.
public datasets (and consequently not fairly comparable with As reported in Table 1, several kinds of architectures were
other works) or those tested on a public benchmark but with tested on the University-1652 benchmark, including CNN,

VOLUME 12, 2024 125353


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

TABLE 4. Summary table for the proposals’ results that are not tested on common datasets. They are not directly comparable since they are tested on
different benchmarks. The table also includes the two Navigation approaches described in Section V.

Siamese CNN, CGAN, and Transformer. At the state-of- of R@1 presents a difference of about 10 percentual points.
the-art, the most promising strategy is the one proposed In the case of the proposals summarized in Table 4, most of
in [38], which proposes an architecture made of PPT, CGAN, them used Siamese-based architectures, which confirms that
and an LPN based on ResNet50 and outperforms the other this kind of architecture is promising in this field, especially if
competitors in 5 out of the 8 metrics. Other promising results combined with Transformers. A general consideration of the
have been achieved by [14], which, even if it exploits one of proposals presented in this survey is the network resolution,
the older versions of the Vision Transformer networks, is still which for most of the works is 256 × 256.
able to achieve competitive results. As will also be pointed For the Navigation task, the state-of-the-art results are
out in the following discussion, Vision Transformer-based really promising, especially considering the 96 − 100%
networks are nowadays replacing the other networks as the of registration success (i.e., correct identification of the
most promising for several vision-related tasks. In the case of landmark) and a RMSE lower than 2.5 meters, which in most
the CVACT benchmark, the Siamese-based architectures seem of the scenarios is comparable if not better than the standard
to be the most used for the Geo-Localization task. As reported precision of a GPS sensor. However, it is also worth pointing
in Table 2, the best results have been achieved by [39], which out that the approaches retrieved from the literature rarely
faces the problem with a Dual CGAN combined with VGG16, try to test their strategy in a mixed scenario, in which the
NetVLAD, and a Transformer architecture for classification. training is performed over a dataset and the test over another.
This confirms the high accuracy and robustness of this kind To the best of our knowledge, only one proposal faced this
of network. As shown in Table 3, analogous results have context [48] with really poor results. For this reason, it is
been reported for CVUSA, in which the same architecture unclear if this comes from the poor generalizability of the
beat all the other competitors. In this case, it provides a proposed architecture or if it is intrinsically related to the
R@1 of 95%, which can be considered a solid result for the challenge of the task. Unfortunately, no other state-of-the-art
task, especially considering that in a Navigation task, some works confirm this aspect.
other useful information can also be exploited to improve the
Geo-Localization. In general, according to the literature, it is VII. CONCLUSION
possible to notice the good potential of both CGAN-based and Geo-Localization task is still a hot topic in Computer Vision,
Transformer-based strategies for the Geo-Localization task. especially thanks to the recent spread of UAVs and the
This demonstrates the adaptability of these kinds of networks consequent price reduction. It can be used as a spot operation,
also in this research field. An honorable mention is due to to check the position of a device, especially in GSP-denied
the proposal in [23], which has been tested over all of these areas (e.g., fields of war), or as a part of a Navigation system
three main benchmarks, even with average results, showing to establish the correctness of the route. As for many different
some kind of flexibility. This is especially true considering fields in Computer Vision, new technologies and advanced
that the proposal was meant to provide a support strategy deep network techniques provided a huge boost in accuracy
for two well-known state-of-the-art architectures. According and speed without an increase in required resources. The latter
to the results of the work that tested their strategies on the is especially relevant to creating embedded navigation systems
CVUSA benchmarks, it looks like a much easier testbed than that can operate directly on the UAVs. This helps especially in
CVACT and University-1652. In fact, the performance in terms contexts where the communication between the drone and

125354 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

the base can be hijacked by hostile forces. Moreover, the [9] D. Avola, L. Cinque, A. Fagioli, G. Foresti, and A. Mecca, ‘‘Ultrasound
benchmarks available for Geo-Localization and Navigation medical imaging techniques: A survey,’’ ACM Comput. Surv., vol. 54, no. 3,
pp. 1–38, Apr. 2021.
tasks provide realistic contexts. In fact, the knowledge base [10] M. Bianchi and T. D. Barfoot, ‘‘UAV localization using autoencoded satellite
is collected by satellites, so it is safe and does not require images,’’ IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 1761–1768, Apr. 2021.
previous missions, and the tests are performed on UAV- [11] S. A. Carneiro, G. P. da Silva, S. J. F. Guimaraes, and H. Pedrini, ‘‘Fight
collected samples, which represent the ideal operative context. detection in video sequences based on multi-stream convolutional neural
networks,’’ in Proc. 32nd SIBGRAPI Conf. Graph., Patterns Images
This review explores the literature contributing to the Geo- (SIBGRAPI), Oct. 2019, pp. 8–15.
Localization task, detailing the main challenges. It presents [12] J. Choi and H. Myung, ‘‘BRM localization: UAV localization in GNSS-
the key concepts needed to understand their application denied environments based on matching of numerical map and UAV
images,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2020,
in Navigation tasks, specifically for UAV autonomous pp. 4537–4544.
systems. It begins by providing a common background on [13] S. Chopra, R. Hadsell, and Y. LeCun, ‘‘Learning a similarity metric
Geo-Localization methods and Navigation, highlighting their discriminatively, with application to face verification,’’ in Proc. IEEE
differences and application fields. It continues reviewing the Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1,
Jun. 2005, pp. 539–546.
available datasets, benchmarks, and metrics commonly used to [14] M. Dai, J. Hu, J. Zhuang, and E. Zheng, ‘‘A transformer-based feature
evaluate Geo-Localization and Navigation approaches. Then, segmentation and region alignment method for UAV-view geo-localization,’’
it presents a selection of state-of-the-art UAV-based Geo- IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4376–4389,
Jul. 2022.
Localization strategies, showcasing the latest advancements
[15] J. Dessain, ‘‘Machine learning models predicting returns: Why most popular
and methodologies. Additionally, it describes some of the performance metrics are misleading and proposal for an efficient metric,’’
newest and most promising UAV-based Navigation and Exp. Syst. Appl., vol. 199, Aug. 2022, Art. no. 116970.
Guidance proposals. Finally, to provide a clear comparison [16] H. Goforth and S. Lucey, ‘‘GPS-denied UAV localization using pre-existing
satellite imagery,’’ in Proc. Int. Conf. Robot. Autom. (ICRA), May 2019,
among the approaches, it summarizes them into comparative pp. 2974–2980.
tables, allowing for an easy assessment of their relative [17] K. Hartmann and K. Giles, ‘‘UAV exploitation: A new domain for cyber
strengths and weaknesses. This comparison may serve power,’’ in Proc. 8th Int. Conf. Cyber Conflict (CyCon), May 2016,
as a valuable resource for researchers interested in the pp. 205–221.
[18] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
field. According to the findings of this study, CGAN and recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Transformer-based techniques stood out in their overall Jun. 2016, pp. 770–778.
performance. These results follow the current trend of the DL, [19] X. Hou, L. Shen, K. Sun, and G. Qiu, ‘‘Deep feature consistent variational
where the listed and other affine methods seem to be some of autoencoder,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
Mar. 2017, pp. 1133–1141.
the most promising approaches in different application areas. [20] S. Hu, M. Feng, R. M. H. Nguyen, and G. H. Lee, ‘‘CVM-net: Cross-
However, we could expect an introduction of optimization view matching network for image-based ground-to-aerial geo-localization,’’
strategies to allow the deployment of these architectures on in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7258–7267.
board soon.
[21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
REFERENCES Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[1] S. Ahn, H. Kang, and J. Lee, ‘‘Aerial-satellite image matching framework [22] A. Kolesnikov, A. Dosovitskiy, D. Weissenborn, G. Heigold, J. Uszkoreit,
for UAV absolute visual localization using contrastive learning,’’ in Proc. L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner,
21st Int. Conf. Control, Autom. Syst. (ICCAS), Oct. 2021, pp. 143–146. and X. Zhai, ‘‘An image is worth 16×16 words: Transformers for image
[2] A. A. Saadi, A. Soukane, Y. Meraihi, A. B. Gabis, S. Mirjalili, and recognition at scale,’’ 2021, arXiv:2010.11929.
A. Ramdane-Cherif, ‘‘UAV path planning using optimization approaches: [23] J. Lin, Z. Zheng, Z. Zhong, Z. Luo, S. Li, Y. Yang, and N. Sebe,
A survey,’’ Arch. Comput. Methods Eng., vol. 29, no. 6, pp. 4233–4284, ‘‘Joint representation learning and keypoint detection for cross-view geo-
Oct. 2022. localization,’’ IEEE Trans. Image Process., vol. 31, pp. 3780–3792, 2022.
[3] S. Antonelli, D. Avola, L. Cinque, D. Crisostomi, G. L. Foresti, F. Galasso, [24] L. Liu and H. Li, ‘‘Lending orientation to neural networks for cross-view
M. R. Marini, A. Mecca, and D. Pannone, ‘‘Few-shot object detection: A geo-localization,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
survey,’’ ACM Comput. Surv., vol. 54, no. 11s, pp. 1–37, Sep. 2022. (CVPR), Jun. 2019, pp. 5617–5626.
[4] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, ‘‘NetVLAD: [25] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B.
CNN architecture for weakly supervised place recognition,’’ in Proc. IEEE Guo, ‘‘Swin transformer: Hierarchical vision transformer using shifted
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5297–5307. windows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[5] D. Avola, L. Cinque, G. L. Foresti, N. Martinel, D. Pannone, and C. Piciarelli, pp. 9992–10002.
Low-Level Feature Detectors and Descriptors for Smart Image and Video
[26] K. Merry and P. Bettinger, ‘‘Smartphone GPS accuracy study in an urban
Analysis: A Comparative Study. Berlin, Germany: Springer, 2018, pp. 7–29.
environment,’’ PLoS ONE, vol. 14, no. 7, Jul. 2019, Art. no. e0219890.
[6] D. Avola, I. Cannistraci, M. Cascio, L. Cinque, A. Diko, A. Fagioli,
[27] N. C. Mithun, K. Sikka, H.-P. Chiu, S. Samarasekera, and R. Kumar,
G. L. Foresti, R. Lanzino, M. Mancini, A. Mecca, and D. Pannone, ‘‘A novel
‘‘RGB2LiDAR: Towards solving large-scale cross-modal visual localiza-
GAN-based anomaly detection and localization method for aerial video
tion,’’ in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 934–954.
surveillance at low altitude,’’ Remote Sens., vol. 14, no. 16, p. 4110,
Aug. 2022. [28] M. H. Mughal, M. J. Khokhar, and M. Shahzad, ‘‘Assisting UAV localization
[7] D. Avola, L. Cinque, M. De Marsico, A. Fagioli, G. L. Foresti, M. Mancini, via deep contextual image matching,’’ IEEE J. Sel. Topics Appl. Earth
and A. Mecca, ‘‘Signal enhancement and efficient DTW-based comparison Observ. Remote Sens., vol. 14, pp. 2445–2457, 2021.
for wearable gait recognition,’’ Comput. Secur., vol. 137, Feb. 2024, [29] B. Patel, T. D. Barfoot, and A. P. Schoellig, ‘‘Visual localization with Google
Art. no. 103643. Earth images for robust global pose estimation of UAVs,’’ in Proc. IEEE
[8] D. Avola, L. Cinque, A. Di Mambro, A. Diko, A. Fagioli, G. L. Foresti, Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 6491–6497.
M. R. Marini, A. Mecca, and D. Pannone, ‘‘Low-altitude aerial video [30] K. Regmi and M. Shah, ‘‘Bridging the domain gap for ground-to-aerial
surveillance via one-class SVM anomaly detection from textural features image matching,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
in UAV images,’’ Information, vol. 13, no. 1, p. 2, Dec. 2021. Oct. 2019, pp. 470–479.

VOLUME 12, 2024 125355


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

[31] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time DANILO AVOLA (Member, IEEE) received the
object detection with region proposal networks,’’ IEEE Trans. Pattern Anal. M.Sc. degree in computer science from the
Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. Sapienza University of Rome, Italy, in 2002, and
[32] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks the Ph.D. degree in molecular and ultrastructural
for biomedical image segmentation,’’ in Medical Image Computing imaging from the University of L’Aquila, Italy,
and Computer-Assisted Intervention—MICCAI, N. Navab, J. Hornegger, in 2014. Since 2024, he has been an Associate
W. M. Wells, and A. F. Frangi, Eds., Cham, Switzerland: Springer, 2015,
Professor with the Department of Computer
pp. 234–241.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Science, Sapienza University of Rome. He has
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘ImageNet co-founded and leads the Prometheus Laboratory
large scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115, and a co-founder of 4AI, a university startup
no. 3, pp. 211–252, Dec. 2015. focused on pioneering new methodologies in artificial intelligence. Previously,
[34] J. Sandino, F. Vanegas, F. Gonzalez, and F. Maire, ‘‘Autonomous UAV he was an Assistant Professor and the Research and Development Scientific
navigation for active perception of targets in uncertain and cluttered Director of the Computer Vision Laboratory (VisionLab), Sapienza University.
environments,’’ in Proc. IEEE Aerosp. Conf., Mar. 2020, pp. 1–12. As a Principal Investigator, he directs several strategic research initiatives with
[35] Y. Shi, L. Liu, X. Yu, and H. Li, ‘‘Spatial-aware feature aggregation for Sapienza, including Wi-Fi Sensing for Person Re-Identification and Human
image based cross-view geo-localization,’’ in Proc. Adv. Neural Inf. Process. Synthesis, Emotion Transference in Humanoids via EEG, UAV Navigation
Syst., 2019, pp. 1–11. by View, and LieToMe Systems. His research interests include artificial
[36] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
intelligence (including machine learning and deep learning), computer
large-scale image recognition,’’ 2014, arXiv:1409.1556.
[37] B. Sun, C. Chen, Y. Zhu, and J. Jiang, ‘‘GEOCAPSNET: Ground to aerial vision, Wi-Fi sensing, EEG signal analysis, human–computer interaction,
view image geo-localization using capsule network,’’ in Proc. IEEE Int. human-behavior recognition, human-action recognition, biometric analysis,
Conf. Multimedia Expo (ICME), Jul. 2019, pp. 742–747. bioinformatics, optimized neural architectures, deception detection, VR/AR
[38] X. Tian, J. Shao, D. Ouyang, and H. T. Shen, ‘‘UAV-satellite view synthesis systems, drones, and robotics. He is an Active Member of several professional
for cross-view geo-localization,’’ IEEE Trans. Circuits Syst. Video Technol., organizations, including IAPR, CVPL, ACM, AIxIA, and EurAI.
vol. 32, no. 7, pp. 4804–4815, Jul. 2022.
[39] X. Tian, J. Shao, D. Ouyang, A. Zhu, and F. Chen, ‘‘SMDT: Cross-view
geo-localization with image alignment and transformer,’’ in Proc. IEEE Int.
Conf. Multimedia Expo (ICME), Jul. 2022, pp. 1–6.
[40] Y. Tian, C. Chen, and M. Shah, ‘‘Cross-view image matching for geo-
localization in urban environments,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 1998–2006.
[41] N. N. Vo and J. Hays, ‘‘Localizing and orienting street views using overhead LUIGI CINQUE (Senior Member, IEEE) received
imagery,’’ in Computer Vision—ECCV 2016 (Lecture Notes in Computer the M.Sc. degree in physics from the University
Science), vol. 9905, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., of Napoli, Italy, in 1983. From 1984 to 1990,
Cham, Switzerland: Springer, 2016, pp. 494–509, doi: 10.1007/978-3-319- he was with the Laboratory of Artificial Intelligence
46448-0_30. (Alenia S.p.A), working on the development
[42] J. Wang, Y. Yang, M. Pan, M. Zhang, M. Zhu, and M. Fu, ‘‘Hybrid of expert systems and knowledge-based vision
perspective mapping: Align method for cross-view image-based geo- systems. He is a Full Professor of computer
localization,’’ in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), Sep. 2021, science with the Sapienza University of Rome,
pp. 3040–3046. Italy. Some of the techniques, he has proposed
[43] T. Wang, Z. Zheng, C. Yan, J. Zhang, Y. Sun, B. Zheng, and Y. Yang, ‘‘Each have found applications in the field of video-based
part matters: Local patterns facilitate cross-view geo-localization,’’ IEEE
surveillance systems, autonomous vehicle, road traffic control, human
Trans. Circuits Syst. Video Technol., vol. 32, no. 2, pp. 867–879, Feb. 2022.
[44] S. Workman and N. Jacobs, ‘‘On the location dependence of convolutional behavior understanding, and visual inspection. He is the author of more
neural network features,’’ in Proc. IEEE Conf. Comput. Vis. Pattern than 200 papers in national and international journals, and conference
Recognit. Workshops (CVPRW), Jun. 2015, pp. 70–78. proceedings. His first scientific interests cover image processing, object
[45] S. Workman, R. Souvenir, and N. Jacobs, ‘‘Wide-area image geolocalization recognition, image analysis, with a particular emphasis on content-based
with aerial reference imagery,’’ in Proc. IEEE Int. Conf. Comput. Vis. retrieval in visual digital archives, and advanced man-machine interaction
(ICCV), Dec. 2015, pp. 3961–3969. assisted by computer vision. Currently, his main interests include distributed
[46] Q. Ye, J. Luo, and Y. Lin, ‘‘A coarse-to-fine visual geo-localization method systems for the analysis and interpretation of video sequences and target
for GNSS-denied UAV with oblique-view imagery,’’ ISPRS J. Photogramm. tracking. He is a member of ACM, IAPR, and CVPL. He served on scientific
Remote Sens., vol. 212, pp. 306–322, Jun. 2024. committees of international conferences (e.g., CVPR, ICME, and ICPR) and
[47] A. R. Zamir and M. Shah, ‘‘Image geo-localization based on MultipleN-
symposia. He serves as a Reviewer for many international journals (e.g.,
earest neighbor feature matching UsingGeneralized graphs,’’ IEEE Trans.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE
Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1546–1558, Aug. 2014.
[48] X. Zhang, X. Meng, H. Yin, Y. Wang, Y. Yue, Y. Xing, and Y. Zhang, ‘‘SSA- TRANSACTIONS ON CIRCUIT AND SYSTEMS, IEEE TRANSACTIONS ON SYSTEMS, MAN,
net: Spatial scale attention network for image-based geo-localization,’’ IEEE AND CYBERNETICS: SYSTEMS, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY,

Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022. IEEE TRANSACTIONS ON MEDICAL IMAGING, and Image and Vision Computing).
[49] Z. Zheng, Y. Wei, and Y. Yang, ‘‘University-1652: A multi-view multi-
source benchmark for drone-based geo-localization,’’ in Proc. 28th ACM
Int. Conf. Multimedia, Oct. 2020, pp. 1395–1403.
[50] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, ‘‘Dual-
path convolutional image-text embeddings with instance loss,’’ ACM
Trans. Multimedia Comput., Commun., Appl., vol. 16, no. 2, pp. 1–23,
May 2020. EMAD EMAM received the M.Sc. degree in
[51] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ‘‘Scene
control engineering from the Sapienza University
parsing through ADE20K dataset,’’ in Proc. IEEE Conf. Comput. Vis.
of Rome, Italy, where he is currently pursuing
Pattern Recognit. (CVPR), Jul. 2017, pp. 5122–5130.
[52] Y. Zhu, B. Sun, X. Lu, and S. Jia, ‘‘Geographic semantic network for cross- the Ph.D. degree in computer science. He is a
view image geo-localization,’’ IEEE Trans. Geosci. Remote Sens., vol. 60, Senior Research Engineer with the Prometheus
2022, Art. no. 4704315. Laboratory. His research interests include machine
[53] J. Zhuang, X. Chen, M. Dai, W. Lan, Y. Cai, and E. Zheng, ‘‘A semantic learning, deep learning, computer vision, simul-
guidance and transformer-based matching method for UAVs and satellite taneous localization and mapping (SLAM) with
images for UAV geo-localization,’’ IEEE Access, vol. 10, pp. 34277–34287, UAVs, human-action recognition, optimized neural
2022. architectures, and robotics.

125356 VOLUME 12, 2024


D. Avola et al.: UAV Geo-Localization for Navigation: A Survey

FEDERICO FONTANA (Student Member, IEEE) MARCO RAOUL MARINI (Member, IEEE)
received the bachelor’s and master’s degrees in received the combined B.Sc. and M.Sc. (cum
computer science from the Sapienza University of laude) degrees in computer science from the
Rome, Italy, where he is currently pursuing the Sapienza University of Rome, Italy, in 2015,
Ph.D. degree with the Department of Computer and the Ph.D. degree in computer science from
Science. His research interests include efficient the Computer Vision Laboratory, Department of
deep learning, computer vision, binary neural Computer Science, Sapienza University of Rome,
networks, and pruning. in 2019. He is a member of the Computer Vision
Laboratory, Department of Computer Science,
Sapienza University of Rome. His research interests
include human behavior analysis, virtual and augmented reality, multimodal
interaction, natural interaction, machine learning, and deep learning. All these
interests are focused on the development of systems applied to the problem
of behavior understanding. The main application area is the eXtended Reality
(XR). He recently focused on interactions, locomotion, and human-centered
analysis in VR and brain–computer interface (BCI) integration. He is a
member of CVPL.

ALESSIO MECCA (Member, IEEE) received the


M.Sc. and Ph.D. degrees in computer science from
the Sapienza University of Rome, in 2015 and 2018,
respectively. He was a member of the Biometric
Laboratory, from 2015 to 2019; and the Computer
Vision Laboratory (VisionLab), Department of
Computer Science, from 2019 to 2024. In 2024,
he became the Manager of the Information
GIAN LUCA FORESTI (Senior Member, IEEE)
Technology and Data Analyst Area, Unitelma
received the Laurea degree (cum laude) in
Sapienza. He continues research as a Collaborator
electronic engineering and the Ph.D. degree in
with VisionLab. His research interests include biometrics, with a focus on gait
computer science from the University of Genoa,
recognition, signal analysis and processing, machine and deep learning, event
Genoa, Italy, in 1990 and 1994, respectively. He has
recognition, person re-identification, UAV geo-localization and navigation
been a Visiting Professor of artificial vision with
strategies, and human–computer interaction.
the University of Klagenfurt, Klagenfurt, Austria,
since 2006. He is currently a Full Professor of
computer science with the University of Udine DANIELE PANNONE (Member, IEEE) received
and the Deputy Director of the Department of the M.Sc. and Ph.D. degrees in computer sci-
Mathematics, Computer Science and Physics. He is the author of more than ence from the Sapienza University of Rome,
300 papers published in international journals and international conferences, in 2015 and 2018, respectively. He has been
and he has been a co-editor of several books in the field of multimedia and a member of the Computer Vision Laboratory
video surveillance. His main interests include computer vision and image (VisionLab), Department of Computer Science,
processing, multisensor data and information fusion, pattern recognition, and from 2015 to 2024. As of 2020, he holds the posi-
neural networks. He is a member of IAPR and CVPL. In 2002, he received tion of an Assistant Professor with the Department
the Best IEEE Vehicular Electronics Paper. From 2000 to 2009, he was of Computer Science. In 2024, he has co-founded
the Appointed Italian Member of the NATO RTO Information System the Prometheus Laboratory. In 2024, he has
Technology Panel. He was the Finance Chair of the 11th IEEE Conference on co-founded 4AI, a Sapienza University startup focused on pioneering new
Image Processing (ICIP05) and the General Chair of the 16th International methodologies in artificial intelligence. His research interests include machine
Conference on Image Analysis and Processing (ICIAP11) and the Eighth IEEE and deep learning, event recognition, object detection, person re-identification,
International Conference on Advanced Video and Signal-Based Surveillance signal analysis and processing, bioinformatics, human–computer interaction,
(AVSS11). He has been a Guest Editor of PROCEEDINGS OF THE IEEE Special AR/VR, and robotics. He is a member of several professional organizations,
Issue on Video Communications, Processing and Understanding for Third including IAPR, CVPL, ACM, AIxIA, and EurAI.
Generation Surveillance Systems.

Open Access funding provided by ‘Università degli Studi di Roma ''La Sapienza'' 2’
within the CRUI CARE Agreement

VOLUME 12, 2024 125357

You might also like