0% found this document useful (0 votes)
19 views17 pages

Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual

A paper "From Pixels to Precision a Survey of Monocular Visual"

Uploaded by

daponte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual

A paper "From Pixels to Precision a Survey of Monocular Visual"

Uploaded by

daponte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

sensors

Article
From Pixels to Precision: A Survey of Monocular Visual
Odometry in Digital Twin Applications †
Arman Neyestani, Francesco Picariello , Imran Ahmed , Pasquale Daponte and Luca De Vito *

Department of Engineering, University of Sannio, 82100 Benevento, Italy; [email protected] (A.N.);


[email protected] (F.P.); [email protected] (I.A.); [email protected] (P.D.)
* Correspondence: [email protected]
† This paper is an extended version of our paper published in 2023 IEEE International Workshop on Metrology

for Living Environment (MetroLivEnv), Milano, Italy, 29–31 May 2023.

Abstract: This survey provides a comprehensive overview of traditional techniques and deep learning-
based methodologies for monocular visual odometry (VO), with a focus on displacement measure-
ment applications. This paper outlines the fundamental concepts and general procedures for VO
implementation, including feature detection, tracking, motion estimation, triangulation, and trajec-
tory estimation. This paper also explores the research challenges inherent in VO implementation,
including scale estimation and ground plane considerations. The scientific literature is rife with
diverse methodologies aiming to overcome these challenges, particularly focusing on the problem of
accurate scale estimation. This issue has been typically addressed through the reliance on knowledge
regarding the height of the camera from the ground plane and the evaluation of feature movements
on that plane. Alternatively, some approaches have utilized additional tools, such as LiDAR or depth
sensors. This survey of approaches concludes with a discussion of future research challenges and
opportunities in the field of monocular visual odometry.

Keywords: monocular; localization; feature based; odometry; survey; machine learning; deep
learning; measurement

Citation: Neyestani, A.; Picariello, F.;


1. Introduction
Ahmed, I.; Daponte, P.; De Vito, L. Exploring unknown environments is a complex challenge that has engaged researchers
From Pixels to Precision: A Survey of across various fields. The intricacies of navigating in uncharted territories require the
Monocular Visual Odometry in integration of multiple approaches and the development of sophisticated methodologies.
Digital Twin Applications. Sensors Among these, the measurement of accurate camera movements to update the digital twin
2024, 24, 1274. https://doi.org/ model of structures plays a significant role, which will be briefly explained in Section 1.1.
10.3390/s24041274 Modern navigation systems are often multi-modal, merging information collected from
Received: 24 January 2024 various methods to achieve enhanced precision. Within this complex interplay, Visual
Revised: 13 February 2024 Simultaneous Localization and Mapping (VSLAM) has emerged as a vital tool in computer
Accepted: 13 February 2024 vision, robotics, and augmented reality.
Published: 17 February 2024 VSLAM represents an innovative approach to navigation, addressing the inherent drift
problem through the intelligent combination of camera information with an environment
map. This map, updated incrementally as an agent such as a robot that moves through the
environment, facilitates the accurate and real-time estimation of the surroundings. The
Copyright: © 2024 by the authors. significance of this technology is further underscored by its reliance on the accuracy of
Licensee MDPI, Basel, Switzerland. geometrical measurements, which are pivotal to the localization system. This mechanism
This article is an open access article
aids in the consistent update of models, often evaluated through the periodic acquisitions
distributed under the terms and
of camera images, identification of model elements, and assessment of changes over time.
conditions of the Creative Commons
The intricate design of modern navigation systems is underscored by their reliance
Attribution (CC BY) license (https://
on the integration of various methods, a process akin to data fusion. This integration
creativecommons.org/licenses/by/
involves merging information from different sources to achieve greater accuracy. The role
4.0/).

Sensors 2024, 24, 1274. https://doi.org/10.3390/s24041274 https://www.mdpi.com/journal/sensors


Sensors 2024, 24, 1274 2 of 17

of Visual Simultaneous Localization and Mapping (VSLAM) is particularly significant in


this framework. Employed in fields like computer vision, robotics, and augmented reality,
VSLAM goes beyond merely combining camera visuals with environmental layouts. Its
true value emerges in the continuous refinement and updating of data, enabling robots
or agents to adeptly navigate through the ever-changing and unpredictable terrains of
unfamiliar settings.
VSLAM’s capabilities are broadened through the use of one or more video cameras to
reconstruct a 3D map of an often unknowable environment [1] and to gauge the egomo-
tion—defined as the 3D shifting within space—of the camera itself [2]. The video cameras
used in VSLAM systems are essential for applications, like markerless augmented reality
and autonomous robotic navigation. When compared with general SLAM that uses sensors
like Light Detection and Ranging (LiDAR), VSLAM’s reliance on video cameras brings
added advantages [3]. Video cameras are often smaller, less expensive, and carry rich visual
information, making them suitable for platforms with limited payloads and lower costs
than LiDar or an RGB-D camera [4,5].
Visual odometry (VO) and VSLAM are two closely related techniques that are used
to determine a robot or machine’s location and orientation through the analysis of corre-
sponding camera images. Both techniques can utilize a monocular camera, but they have
distinct characteristics and objectives [6–8].
VO is a technique primarily focused on the real-time tracking of a camera’s trajectory,
offering local or relative estimates of the position and orientation. This process is a part
of a broader category known as relative visual localization (RVL). RVL encompasses
methods like VO, which estimate the motion of robots (both rotation and translation) by
localizing themselves within an environment. This localization is achieved by analyzing the
differences between sequential frames captured by the camera. One of the key techniques
used in VO is Windowed optimization. Windowed optimization is a process that refines the
local estimation of the camera trajectory by considering a certain number of previous frames
or ‘window’ of frames. This approach helps to improve the accuracy of pose predictions
derived from the analysis of image sequences [9,10].
On the other hand, VSLAM delivers a global and consistent estimate of the path of a
device, a process often referred to as absolute visual localization (AVL). AVL provides a
pose of a vehicle that is often represented by a six-degrees-of-freedom (DoFs) pose vector
( x, y, z, φ, θ, ψ) [6]. VSLAM has the ability to reduce drift through techniques, like adjusting
the bundle and detecting loop closure [11]. The key difference is that VO is about relative
positioning without an understanding of the larger environment, while VSLAM involves
both mapping the environment and locating the device within that map.
Loop closure is a sub-algorithm of SLAM that identifies previously visited locations
and uses them to correct the accumulated errors in the robot’s pose estimation [12]. The
main goal in loop closure is to detect when the robot is observing a previously explored
scene so that additional constraints can be added to the map [13]. This is crucial in ensuring
the consistency of the map and the accuracy of the robot’s location. The similarities between
VO and VSLAM persist until a loop is closed, after which their functions diverge [2,14,15].
Furthermore, VSLAM’s capacity to continuously update the initial map of the envi-
ronment based on sensor measurements contributes to its adaptability, enabling it to reflect
changes, such as new objects or variations in lighting conditions. This makes VSLAM
a more comprehensive solution for mapping and localization tasks in dynamic environ-
ments. Monocular VO represents an essential component in the field of robotic navigation
and computer vision, enabling the real-time estimation of a camera’s trajectory within an
environment. Through the meticulous tracking of visual features in consecutive camera
frames, VO generates insights into the camera’s motion, a task that has both theoretical and
practical significance [15–17].
From a practical standpoint, VO has a wide range of applications. It is used in mobile
robots, self-driving cars, unmanned aerial vehicles, and other autonomous systems to
provide robust navigation and obstacle avoidance capabilities [18,19].
Sensors 2024, 24, 1274 3 of 17

1.1. Visual Odometry for Digital Twin


The accurate measurement of camera movements is crucial for updating the digital
twin model of structures. This process involves the use of VO and other techniques to
capture and analyze camera images, which are then used to update the digital twin model
(Figure 1). One method employed to achieve this involves multi-camera systems. Research
examining Blender’s application in designing camera-based measurement systems revealed
that it allows for the flexible and rapid modeling of camera positions for motion tracking,
which helps determine their optimal placements. This approach significantly cuts down
setup times in practical scenarios. The methodologies focus on building an entire virtual
camera, encompassing everything from the original camera sensor to the radiometric char-
acteristic of an actual camera [20]. The study focuses on developing virtual representations
of multi-camera measurement systems using Blender. It investigates whether these virtual
cameras in Blender can perceive and measure objects as effectively as real cameras in
similar conditions. Blender, an open-source software for three-dimensional animation, also
serves as a simulation tool in metrology. It allows for the creation of numerical models
instrumental in the design and enhancement of camera-based measurement systems.
In a separate study, the Digital Twin Tracking Dataset (DTTD) was introduced for
Extended-Range Object Tracking. This dataset, comprising scenes captured by a single
RGB-D camera tracked by a motion capture system, is tailored to pose estimation challenges
in digital twin applications [21].
Regarding geometric change detection in digital twins, an object’s pose is estimated
from its image and 3D shape data. This technique is crucial for pose estimation [22].
Likewise, for the digital twin modeling of composite structures, the Azure Kinect camera is
utilized to capture both depth and texture information [23]. Drone inspection imagery is
instrumental in forming operational digital twins for large structures, enabling the creation
and updating of digital twin models based on high-quality drone-captured images [24].
In summary, the precise measurement of camera movements is key in updating digital
twin models of structures. Techniques like monocular VO, multi-camera measurement,
and drone imagery contribute significantly to producing detailed and accurate digital
twin models.

Figure 1. The diagram depicts the camera maneuvers used to update the digital twin representation.
The starting position v(ti ) of the present path segment is depicted by a blue arrow at the moment ti .
The position for each snapshot is calculated using the parameters θt . Subsequently, the map points are
reprojected onto every snapshot, and the reprojection discrepancy r (Φ(i ), θt , t) is reduced to ascertain
the accurate path [25].
Sensors 2024, 24, 1274 4 of 17

This paper aims to provide a comprehensive examination of the state of the art in vari-
ous approaches to VO, with an emphasis on recent developments in the use of monocular
cameras. By presenting a thorough analysis, it contributes to a broader understanding of
this complex and rapidly evolving field. The remaining sections are structured as follows:
Section 2 outlines the fundamental concepts and general procedure for VO implemen-
tation, while Section 3 explores the research challenges inherent in VO implementation.
Section 6 discusses the positioning uncertainty assessment provided by monocular visual
odometry. The subsequent sections offer overviews of traditional methods and machine
learning-based approaches, culminating in a discussion of future research challenges and
opportunities. The articulation of these elements provides a solid foundation for scholars
and practitioners interested in navigating the rich and multifaceted landscape of VO and
VSLAM technologies.

2. Basics of Monocular Visual Odometry


From a theoretical perspective, VO is a complex problem that involves the intersec-
tion of multiple disciplines, including computer vision, robotics, and mathematics. It
requires the development and application of algorithms that can accurately track visual
features and estimate camera motion from a sequence of images [18]. This involves dealing
with challenges such as scale ambiguity in monocular systems, where the trajectory of a
monocular camera can only be recovered up to an unknown scale factor [19]. Theoretical
advancements in VO can contribute to a deeper understanding of these challenges and the
development of more effective solutions.
The foundational algorithm of VO, commencing after compensating for camera dis-
tortion based on parameters estimated during a calibration phase, can be conceptually
divided into several sequential steps, each of which contributes to the overarching objective
of motion and trajectory estimation:
1. Feature detection: In the initial phase of VO, the focus is on identifying and capturing
key visual features from the first camera frame, which are essential for tracking
movements across frames. This process, fundamental for the accurate monitoring
of camera movement, traditionally relies on algorithms like Harris, SIFT, ORB, and
BRISK to pinpoint precise and durable features, such as corners or edges. However,
it is crucial to expand beyond these to include line and planar features, which have
proven to be invaluable in enhancing the robustness and completeness of feature
detection and matching in monocular VO systems. These additions are essential for
capturing the full complexity and variety of real-world environments [26–29].
2. Feature tracking: Following feature detection, the VO algorithm focuses on tracking
these identified features across consecutive frames. This tracking establishes corre-
spondences between features in successive frames, creating a continuity that facilitates
motion analysis. Techniques such as KLT (Kanade–Lucas–Tomasi) tracking or op-
tical flow have proven effective in this context, enabling accurate alignment and
correspondence mapping [30].
3. Motion estimation: With the correspondences between features in consecutive frames
established, the next task is to estimate the camera’s motion. This process involves
mathematical techniques, such as determining the essential matrix or, if needed, the
fundamental matrix. These methods leverage the correspondences to ascertain the
relative motion between frames, providing a snapshot of how the camera’s position
changes over time [31].
4. Triangulation: Based on the estimated camera motion, the algorithm then moves to
determine the 3D positions of the tracked features by triangulation. This technique
involves estimating the spatial location of a point by measuring angles from two or
more distinct viewpoints. The result is a three-dimensional mapping of features that
adds depth and context to the analysis [32].
5. Trajectory estimation: The final step in the basic VO algorithm involves synthesizing the
previously gathered information to estimate the camera’s overall trajectory within the
Sensors 2024, 24, 1274 5 of 17

environment and map the surroundings. This composite task draws upon both the
estimated camera motion from step (iii) and the 3D positioning of the tracked features
from step (iv). Together, these elements coalesce into a coherent picture of the camera’s
path, contributing to a broader understanding of the spatial context [33].
In summary, the basic algorithm for VO is a multi-step process that artfully combines
detection, tracking, estimation trajectory, and triangulation to provide a nuanced under-
standing of camera motion within an unknown environment. By progressing through
these distinct yet interrelated phases, VO offers a versatile and valuable tool in the quest
to navigate and interpret complex spatial environments. Its contributions extend across
various domains, and its underlying methodologies continue to stimulate research and
innovation in both theoretical and applied contexts.

3. Research Challenges in Monocular Visual Odometry


Monocular visual odometry (VO) represents a sophisticated domain characterized
by exceptional achievements and compelling intricacy. The sources of uncertainty can
significantly affect the accuracy and reliability of positioning and navigation solutions
provided by VO systems. The advancements achieved in this field have substantially
contributed to the evolution of robotics, augmented reality, and navigation systems, yet
substantial challenges persist. These obstacles highlight the complex constitution of VO
and propel ongoing scholarly inquiry and innovation in the discipline.
• Feature Detection and Tracking: The efficacy of monocular VO hinges on the precise
detection and tracking of image features, which are critical measurements in the VO
process. Uncertainties in these measurements arise under conditions of low-texture
or nondescript environments, which can be exacerbated by inadequate lighting and
complex motion dynamics, challenging the robustness of feature-matching algorithms
and leading to measurement inaccuracies [34].
• Motion Estimation: Robust motion estimation is central to VO, with its accuracy
contingent upon the reliability of feature correspondence measurements. Uncertainty
in these measurements can occur due to outliers from incorrect feature matching
and drift resulting from cumulative errors in successive estimations, significantly
complicating the attainment of precise motion measurements [35].
• Non-static Scenes: The premise of VO algorithms typically involves the assumption
of static scenes, thereby simplifying the measurement process. However, uncertainty
is introduced in dynamic environments where moving objects induce variances in
the measurements, necessitating advanced methods to discern and correctly interpret
camera motion amidst these uncertainties.
• Camera Calibration: The accurate calibration of camera parameters is foundational
for obtaining precise VO measurements. Uncertainties in calibration—due to factors
such as environmental temperature changes, light conditions, lens distortions, or me-
chanical misalignments—can significantly distort measurement accuracy, impacting
the reliability of subsequent VO estimations [36].
• Scaling Challenges: In VO, the lack of an absolute reference frame introduces un-
certainty in scale measurements, a pivotal component for establishing the camera’s
absolute trajectory. Inaccuracies in these scale measurements can arise from ambigu-
ous geometries, limited visual cues, and the monocular nature of the data, which may
lead to scale drift and wrong trajectory computations [37].
• Ground Plane Considerations: The ground plane is often used as a reference in VO
measurements for scale estimation. However, uncertainties in these measurements can
be attributed to ambiguous ground features, variable lighting conditions that affect
feature visibility, and scaling complexities relative to object heights, challenging the
accuracy of VO scale measurements [38].
• Perspective Projection: The perspective projection in monocular VO introduces in-
herent uncertainties due to the transformation of 3D scenes into 2D images, leading
to challenges such as depth information loss and scale ambiguity. This projection
Sensors 2024, 24, 1274 6 of 17

results in the foreshortening and distortion of objects, complicating the estimation of


relative distances and sizes. Additionally, the overlapping of features in the 2D plane
can cause occlusions, disrupting the feature tracking crucial for motion estimation.
The projection of 3D points onto a 2D plane also introduces feature perspective errors,
especially when features are distant from the camera center or when the camera is
close to the scene.
• Timestamp Synchronization Uncertainty: This type of uncertainty arises when there
are discrepancies in the timing of the data capture and processing among different
components of a system, such as cameras, inertial measurement units (IMUs), and
LiDAR scanners. In systems that rely on precise timing for data integration and
analysis, such as visual–inertial navigation systems, this uncertainty can significantly
impact accuracy [9].
In summary, the field of monocular VO offers a rich landscape of technological pos-
sibilities, bounded by multifaceted challenges that span detection, estimation, scaling,
real-time processing, and more. In another aspect, noise sensitivity refers to the impact
of image noise on the performance of VO algorithms, which can degrade the accuracy of
feature extraction and matching, ultimately affecting the estimated camera trajectory [39].
An uncertainty assessment is essential for evaluating the reliability of the estimated camera
trajectory in VO. While traditional VO approaches often provide an analytical formula for
uncertainty, this remains an open challenge for machine learning-based VO methods [40].
Data synchronization is another important aspect in monocular VO, especially when
integrating data from multiple sensors, such as cameras and inertial measurement units
(IMUs) [41]. Proper synchronization ensures that the data from different sensors are
accurately aligned in time, allowing for more precise and reliable trajectory estimation. In
some cases, hardware synchronization is used to align the data from different sensors to a
common clock, ensuring accurate data fusion and improved VO performance [41].
Achieving real-time performance is imperative for VO applications, yet it poses a
challenge due to the computational intensity required for processing measurements. Uncer-
tainty in real-time performance metrics can stem from variable environmental conditions
that impact the speed and accuracy of feature detection and matching computations.
For example, imagine a self-driving car using VO for navigation. Achieving real-time
performance is crucial because the car needs to make immediate decisions based on its
surroundings. However, this is challenging due to the heavy computational load required
to process the camera’s measurements quickly and accurately.
These challenges not only define the current state of VO but also delineate the paths
for future research and exploration. By grappling with these complexities, the scientific
community continues to pave the way for more nuanced and powerful applications of
VO, extending its reach and impact across various domains. A summary of the various
approaches and their implications can be found in Table 1, offering a succinct overview of
the literature’s breadth and depth.

Table 1. A summary of the mentioned odometry techniques.

Environmental Open
Reference Sensor Type Method Key Points
Structure Source
Using LiDAR for camera feature tracks and keyframe-based
Bundle
[14] LiDAR Outdoor Yes motion estimation. Labeling is used for outlier rejection and
Adjustment
landmark weighting.
Ground
A ground plane and camera height-based divide-and-conquer
[38] Monocular Plane-Based Outdoor No
method. A scale correction strategy reduces scale drift in VO.
Deep Learning
A VO algorithm using a standard front end with camera
Feature tracking relative to triangulated landmarks optimizing the
[42] LiDAR Outdoor No
Extraction camera poses and landmark map with range sensor depth
information resolves monocular scale ambiguity and drift.
Sensors 2024, 24, 1274 7 of 17

Table 1. Cont.

Environmental Open
Reference Sensor Type Method Key Points
Structure Source
A VO system utilizing a downward-facing camera, feature
extraction, velocity-aware masking, and nonconvex
Feature
[43] Monocular Indoor No optimization, enhanced with LED illumination and a ToF
Extraction
sensor, for improved accuracy and efficiency in mobile
robot navigation.
Feature LVI-SAM achieves real-time state estimation and map
[44] LiDAR Outdoor Yes
Extraction building with high accuracy and robustness.
A multi-sensor odometry system for mobile platforms that
Feature
[45] LiDAR Outdoor–Indoor No integrates visual, LiDAR, and inertial data. Real time with
Extraction
fixed lag smoothing.
A method combining LiDAR depth with monocular visual
odometry, using photometric error minimization and
Feature
[46] LiDAR Outdoor No point-line feature refinement, alongside LiDAR-based
Extraction
segmentation for improved pose estimation and
drift reduction.
Feature The main innovation is a visual–inertial SLAM system that
[47] Monocular Outdoor Yes
Extraction uses MAP estimation even during IMU initialization.
The authors developed a lightweight scale recovery
Feature framework using an accurate ground plane estimate. The
[48] Monocular Outdoor No
Extraction framework includes ground point extraction and aggregation
algorithms for selecting high-quality ground points.
This paper presents VO using points and lines. Direct
Feature
[49] Monocular Indoor No methods choose pixels with enough gradients to minimize
Extraction
photometric errors.
The approach of this paper combines unsupervised deep
Deep Learning
[50] Monocular Outdoor No learning and scale recovery, which is trained with stereo
Based
image pairs but tested with monocular images.
The authors proposed a self-supervised monocular depth
Deep Learning estimation network for stereo videos, which aligns training
[3] Monocular Outdoor–Indoor No
Based image pairs with predictive brightness
transformation parameters.
A VO system called DL Hybrid is proposed, which uses DL
Deep Learning
[51] Monocular Outdoor No networks in image processing and geometric localization
Based
theory based on hybrid pose estimation methods.
The authors created a decoupled cascade structure and
Deep Learning residual-based posture refinement in an unsupervised VO
[52] Monocular Outdoor No
Based framework that estimates 3D camera positions by decoupling
the rotation, translation, and scale.
The suggested network in this work is built on supervised
Deep Learning learning-based approaches with a feature encoder and pose
[9] Monocular Outdoor No
Based regressor that takes multiple successive two grayscale picture
stacks for training and enforces composite pose restrictions.
Deep Learning A neural architecture that performs VO, object detection, and
[53] Monocular Outdoor Yes
Based instance segmentation in a single thread (SimVODIS).
The proposed method is called SelfVIO, which is a
Deep Learning self-supervised deep learning-based VO and depth map
[54] Monocular Outdoor Yes
Based recovery method using adversarial training and self-adaptive
visual sensor fusion.

4. Traditional Approaches
The scientific literature is rife with diverse methodologies aiming to overcome the
challenges outlined in the preceding section, particularly focusing on the problem of
accurate scale estimation. This issue has typically been addressed through the reliance on
knowledge regarding the height of the camera from the ground plane and the evaluation of
feature movements on that plane. Alternatively, some approaches have utilized additional
tools, such as LiDAR or depth sensors.
Within the domain of autonomous driving, precise vehicle motion estimation is a
crucial concern. Various powerful algorithms have been devised to address this need,
although most commonly, they depend on binocular imagery or LiDAR measurements.
Sensors 2024, 24, 1274 8 of 17

In the following paragraphs, an overview of some prominent works associated with the
scaling challenge is provided, highlighting different strategies and technologies.
Tian et al. [55] made a significant contribution by developing a lightweight scale recov-
ery framework for VO. This framework hinged on a ground plane estimate that excelled
in both accuracy and robustness. By employing a meticulous ground point extraction
technique, the framework ensured precision in the ground plane estimate. Subsequently,
these carefully selected points were aggregated through a local sliding window and an
innovative ground point aggregation algorithm. To translate the aggregated data into the
correct scale, a Random Sample Consensus (RANSAC)-based optimizer was employed.
This optimizer solved a least-squares problem, fine-tuning parameters to derive the correct
scale, and thus displaying the marriage of optimization techniques and spatial analysis.
The parameters for this fine-tuning are likely chosen based on experimental results to
achieve the best performance
H. Lee et al. [43] presented a VO system using a downward-facing camera. This system,
designed for mobile robots, integrates feature extraction, a novel velocity-aware masking
algorithm, and a nonconvex optimization problem to enhance pose estimation accuracy. It
employs cost-effective components, including an LED for illumination and a ToF sensor,
to improve feature tracking on various surfaces. The methodology combines efficient
feature selection with global optimization for motion estimation, demonstrating improved
accuracy and computational efficiency over the existing methods. The authors claimed the
experimental results validated its performance in diverse environments, showcasing its
potential for robust mobile robot navigation.
B. Fang et al. [46] proposed a method for enhancing monocular visual odometry
through the integration of LiDAR depth information, aiming to overcome inaccuracies in
feature-depth associations. The methodology involves a two-stage process: initial pose
estimation through photometric error minimization and pose refinement using point-line
features with photometric error minimization for more accurate estimation. It employs
ground and plane point segmentation from LiDAR data, optimizing frame-to-frame match-
ing based on these features, and incorporating multi-frame optimization to reduce drift and
enhance accuracy. Based on the authors’ claim, the approach demonstrates improved pose
estimation accuracy and robustness across diverse datasets, indicating its effectiveness in
real-world scenarios.
Chiodini et al. [42] expanded the improvement on scale estimation by demonstrating a
flexible sensor fusion strategy. By merging data from a variety of depth sensors, including
Time-of-Flight (ToF) cameras and 2D and 3D LiDARs, the authors crafted a method that
broke free from the constraints of sensor-specific algorithms that pervade much of the
literature. This universal applicability is particularly significant for mobile systems without
specific sensors. The proposed approach optimized camera poses and landmark maps
using depth information, clearing up the scale ambiguity and drift that can be encountered
in monocular perception.
LiDAR–monocular visual odometry (LiMo) was presented by Graeter et al. [14]. This
novel algorithm capitalizes on the integration of data from a monocular camera and LiDAR
sensor to gauge vehicle motion. By leveraging LiDAR data to estimate the motion scale and
provide additional depth information, LiMo enhances both the accuracy and robustness of
VO. Real-world datasets were utilized to evaluate the proposed algorithm, and it exhibited
marked improvements over other state-of-the-art methods. The potential applications of
LiMo in fields like autonomous driving and robotics underscore the relevance and impact
of this research.
To mitigate the influence of outliers on feature detection and matching and en-
hance motion estimation, other researchers introduced data fusion with inertial mea-
surements. This visual–inertial odometry (VIO) integrated system is exemplified in works
like Shan et al. [44], which brought together LiDAR, visual, and inertial measurements
in a tightly coupled LiDAR–visual–inertial (LVI) odometry system. This holistic fusion,
achieved through a novel smoothing and mapping algorithm, elevates the system’s accu-
Sensors 2024, 24, 1274 9 of 17

racy and robustness. The proposal also introduced an innovative technique for estimating
extrinsic calibration parameters, further optimizing performance for applications like
autonomous driving and robotics.
Wisth et al. [45] and ORB-SLAM3 [47] further illustrated the technological advances
in multi-sensor odometry systems and real-time operation in various environments. The
use of factor graphs, dense mapping systems, and various sensors such as IMUs, visual
sensors, and LiDAR highlights the multifaceted approaches to challenges in motion and
depth estimation.
Chuanliu Fan et al. [56] introduce a monocular dense mapping system for visual–
inertial odometry, optimizing IMU preintegration and applying a nonlinear optimization-
based approach to improve trajectory estimation (Figure 2) and 3D reconstruction under
challenging conditions. By marginalizing frames within a sliding window, it manages
the computational complexity and combines an IMU and visual data to enhance the
depth estimation and map reconstruction accuracy. The authors claimed the method
outperforms vision-only approaches, particularly in environments with dynamic objects or
weak textures, and demonstrates superior performance in comparison to existing odometry
systems through evaluations of public datasets.

Figure 2. This figure demonstrates the accuracy and effectiveness of the proposed nonlinear optimization-
based monocular dense mapping system of VIO [56].

Two additional pioneering works are by Huang et al. [48], who introduced a VIO
optimization-based online initialization and spatial–temporal calibration, and Zhou et al. [49],
who introduced ‘Dplvo: Direct point-line monocular visual odometry’. The former focuses
on an intricate calibration process that aligns and interpolates camera and IMU measure-
ment data without geographical or temporal information. In contrast, the latter presents
an innovative technique that leverages point and line features directly, without needing a
feature descriptor, to achieve better accuracy and efficiency.
Collectively, these studies represent a robust and multifaceted exploration of tradi-
tional approaches in the realms of motion estimation, depth estimation, and scale recovery
within visual odometry (VO). The methodologies vary widely, each bringing unique contri-
butions to scientific discourse and providing promising avenues for ongoing research and
development. Their collective focus on enhancing precision, robustness, and computational
efficiency underscores the central challenges of the field and the diverse means by which
these can be overcome.
Sensors 2024, 24, 1274 10 of 17

5. Machine Learning-Based Approaches


Machine learning-based approaches to VO are redefining the field with innovative
techniques that harness the power of neural networks. Generally, methods in this section
can be classified into two distinct categories: full deep learning approaches that utilize
neural networks almost exclusively, and semi-deep learning approaches that combine deep
learning with more traditional computer vision techniques.

5.1. Full Deep Learning Approaches


Full deep learning approaches leverage the complexity and flexibility of neural net-
works to solve challenging VO tasks.
Yang et al. [57] pioneered a method called D3VO. This deep learning-based approach
for VO estimates both camera motion and the 3D structure of the environment using
just a single camera input. Comprising three specialized deep neural networks, D3VO
handles depth prediction, pose estimation, and uncertainty estimation. D3VO’s method
of uncertainty estimation involves predicting a posterior probability distribution for each
pixel, which helps in adaptively weighting the residuals in the presence of challenging
conditions, like non-Lambertian surfaces or moving objects. Despite its performance edge
over existing VO methods in various benchmarks, D3VO faces significant challenges, such
as the need for extensive labeled training data, complexities in securing accurate depth
labels, and struggles with low-texture or featureless environments.
Ban et al. [51] contributed a unique perspective by integrating both the depth and
optical flow in a deep learning-based method for VO (Figure 3). This intricate algorithm first
extracts image features, which are then processed through a neural network to estimate the
depth and optical flow. The combination of these elements enables the accurate computation
of motion. However, a major drawback is the substantial requirement for training data,
which is pivotal for effectively training the neural network.

Figure 3. Images in the left column, arranged vertically, are as follows: (a) optical flow map in forward
order, (b) optical flow map in reverse order, (c) points of instantaneous optical flow superimposed
on the original image, (d) map showing monocular depth, (e) map illustrating the matching of key
points in a pair of images, and (f) map depicting the reconstructed trajectory, where the estimated
path is indicated by a blue line [51].
Sensors 2024, 24, 1274 11 of 17

In a novel approach, Kim et al. [53] designed a method to perform simultaneous VO,
object detection, and instance segmentation. By employing a deep neural network, the
method not only estimates the camera pose but also detects objects within the scene, all in
real time. While promising, this approach also faces its own set of challenges, particularly
the extensive need for training data and potential difficulties with occlusions and clutter.
A notable trend in this category involves self-supervised learning as a solution to
the data scarcity problem. Many supervised methods for VIO and depth map estimation
necessitate large labeled datasets. To mitigate this issue, the authors in [54] proposed a
self-supervised method that leverages scene consistency in shape and lighting. Utilizing
a deep neural network, this method estimates parameters such as camera pose, velocity,
and depth without labeled data (Figure 4). Still, challenges persist, such as the accuracy of
inertial measurements affected by noise and the depth estimation accuracy hampered by
occlusions and reflective surfaces.

Figure 4. Sample trajectories comparing the unsupervised learning approach SelfVIO with monocular
OKVIS, VINS, and the ground truth in meter scale using EuRoC dataset MH-03 and MH-05 sequences
in [54].

5.2. Semi-Deep Learning Approaches


Semi-deep learning approaches blend the power of deep learning with traditional tech-
niques, leading to methods that are sometimes more adaptable to real-world constraints.
Zhou et al. [38] addressed the unique challenge of absolute scale estimation in VO
using ground plane-based features. By identifying the ground plane and extracting its
features, they calculated the distance to the camera, assuming certain constants such as flat
ground and the known camera height. Using a convolutional neural network (CNN), the
method estimates the scale factor, offering potential applications in autonomous driving
and robotics.
Lin et al. [52] provided an unsupervised method for VO that ingeniously decouples
camera pose estimation into separate rotation and translation components. After the initial
feature extraction and essential matrix calculation, a deep learning-based network handles
the distinct estimation of rotation and translation. While groundbreaking, this approach is
not immune to challenges, including motion blur and changes in the lighting conditions.
Adding to the repertoire of semi-deep learning approaches, Ref. [9] introduced the
Windowed Pose Optimization Network (WPO-Net) for VO estimation. In this method,
features are extracted from input images, followed by relative pose computation, with a
WPO-Net optimizing the pose over a sliding window. Though promising, the computa-
tional complexity of the WPO-Net stands as a substantial hurdle, potentially impeding
real-time applications.
Sensors 2024, 24, 1274 12 of 17

In summary, machine learning-based approaches are forging new pathways in VO,


where full deep learning methods are stretching the capacities of neural networks, and semi-
deep learning methods are merging traditional techniques with contemporary progressions.
A salient distinction emerges in the realm of the uncertainty assessment: traditional ap-
proaches often allow for an analytical derivation of uncertainty, providing clear metrics
for measurement confidence. In contrast, deep learning methods grapple with this as an
open problem, with the quantification of uncertainty remaining an elusive goal in neural
network-based predictions. The pursuit of uncertainty estimation in deep learning remains
a vital research area, as it is critical for the reliability and safety of VO systems in practical
applications. The ongoing refinement of these methods underscores a vibrant field ripe
with opportunities for innovation, notwithstanding the substantial hurdles that persist.

6. Uncertainty of Positioning Provided by Monocular Visual Odometry


In Section 3, the uncertainty in monocular VO and its various sources were discussed.
Aksoy and Alatan [58] addressed this by proposing an inertially aided visual odometry
system that operates without the need for heuristics or parameter tuning. This system, lever-
aging inertial measurements for motion prediction and the EPnP algorithm for pose com-
putation, minimizes assumptions and computes uncertainties for all estimated variables.
They demonstrated high performance in their system, without relying on data-dependent
tuning. Building on the theme of measurement precision, Ross et al. [59] delved into the
intricacies of covariance estimation in a feature-based stereo visual odometry algorithm.
Their approach involved learning odometry errors through Gaussian process regression
(GPR), which facilitated the assessment of positioning errors alongside the monitoring
of VO confidence metrics, offering insights into the uncertainty of VO position estimates.
Gakne and O’Keefe [60] tackled the scale factor issue in a monocular VO using a 3D city
model. They proposed a method dealing with the camera height variation to improve the
accuracy of the scale factor estimation. They found that their method provided an accurate
solution but up to a scale only. Choi et al. [61] proposed a robust monocular VO method
for road vehicles using uncertain perspective projection. They modeled the uncertainty
associated with the inverse perspective projection of image features and used a parameter
space voting scheme to find a consensus on the vehicle state among tracked features. They
found that their method was suitable for any standard camera that views part of the road
surface in front of or behind the vehicle.
While the methods proposed in these studies differ, they all aim to improve the
accuracy of monocular VO by addressing the issue of scale uncertainty. The results of
these studies show that it is possible to estimate the uncertainty of positioning provided by
monocular VO and improve its accuracy. However, more research is needed to develop
robust and reliable methods that can be used in different applications.
The uncertainty model for monocular VO can be mathematically formulated as follows.
Let Xt represent the estimated pose of the vehicle at time t and Zt denote the visual
measurements obtained from the monocular camera. The uncertainty associated with the
visual measurements can be represented by the covariance matrix Rt . Additionally, the
uncertainty on the vehicle motion can be captured by the covariance matrix Qt . The relative
vehicle motion can be estimated by considering the uncertainty on the backprojection of
the ground plane features and the uncertainty on the vehicle motion, as proposed by Van
Hamme et al. [62]. This can be mathematically expressed as:

X t = f ( X t −1 , Z t , R t , Q t )

where f represents the function that estimates the pose of the vehicle at time t based on
the previous pose, visual measurements, and associated uncertainties. The uncertainty
model integrates the uncertainty of visual measurements and the uncertainty of vehicle
motion to provide a more accurate assessment of the positioning in monocular VO. The
uncertainty on the backprojection of ground plane features and the uncertainty on the
vehicle motion are crucial factors in accurately estimating the relative vehicle motion.
Sensors 2024, 24, 1274 13 of 17

The Hough-like parameter space vote is employed to extract motion parameters from
the uncertainty models, contributing to the robustness and reliability of the proposed
method in [61]. Despite the advancements and insights provided by the existing research,
a notable gap in the literature is the lack of a comprehensive sensitivity analysis regarding
the various sources of uncertainty in monocular VO. The current models and studies
often overlook the full spectrum of factors that contribute to uncertainty, ranging from
atmospheric conditions to sensor noise. This limitation highlights the need for a more
holistic approach to uncertainty modeling in monocular VO. A complete model would not
only account for the direct uncertainties in visual measurements and vehicle motion but also
extend to encompass external factors, like atmospheric disturbances, lighting variations,
and intrinsic sensor inaccuracies. Such a model would enable a deeper understanding of
how these diverse factors interact and influence the overall uncertainty in VO systems,
paving the way for the development of more sophisticated and resilient techniques that
can adapt to a wider range of environmental conditions and application scenarios

7. Discussion
The implementation and performance of various machine learning-based methods for
VO have led to interesting observations and challenges, particularly concerning feature
extraction, noise sensitivity, depth estimation, and data synchronization.
The difficulty in feature extraction at high speeds is highlighted in several works [3,48,51].
This challenge is exacerbated by factors such as the optical flow on the road and increased
motion blur when the vehicle moves fast. Such conditions make feature tracking an arduous
task, allowing for only a limited number of valid depth estimates. Some methods have
attempted to stabilize results by tuning the feature matcher for specific scenarios, like
highways. Still, this often leads to complications in urban settings, where feature matches
might become erratic.
Standstill detection, an essential aspect of VO, is another area fraught with difficulty.
When the vehicle speed is low, errors can occur if the standstill detection is not well
calibrated. The nature of the driving environment, such as open spaces where only the
road is considered suitable for depth estimation, adds further complexity to the problem.
The reliance on homography decomposition, as seen in [38], has been found to be
highly sensitive to noise. This sensitivity arises from the noisy feature matches obtained
from low-textured road surfaces and the multitude of parameters derived from the homog-
raphy matrix. The task of recovering both camera movement and ground plane geometry
is a significant challenge that can affect numerical stability. Moreover, any method relying
on the ground plane assumption is vulnerable to failure if the ground plane is obscured or
deviates from the assumed model. This reveals the intrinsic limitation of such methods in
varying environmental conditions.
A remarkable development in this field is ORB-SLAM3 [47], which has established
itself as a versatile system capable of visual–inertial and multimap SLAM using various
camera models. Unlike conventional VO systems, ORB-SLAM3’s ability to utilize all
previous information from widely separated or prior mapping sessions has enhanced
accuracy, showcasing a significant advancement in the field.
Deep learning-based approaches to VO, such as those using CNNs and RNNs, have
treated VO and depth recovery predominantly as supervised learning problems [3,50,52].
While these methods excel in camera motion estimation and optical flow calculations, they
are constrained by the challenge of obtaining ground truth data across diverse scenes. Such
data are often hard to acquire or expensive, limiting the scalability of these approaches.
The issue of timestamp synchronization also emerges as a critical concern, as high-
lighted in [9]. Delays in timestamping due to factors like data transfer, sensor latency, and
Operating System overhead can lead to discrepancies in visual–inertial measurements.
Even with hardware time synchronization, issues like clock skew can cause mismatches
between camera and IMU timestamps [63]. Moreover, synchronization challenges extend
to systems using LiDAR scanners, where the alignment with corresponding camera images
Sensors 2024, 24, 1274 14 of 17

must be precise. Any deviation in this synchronization can lead to erroneous depth data
and subsequent prediction artifacts.
In summary, the machine learning-based approaches to VO chart an intriguing course
of breakthroughs and obstacles. Notable progress in employing deep learning and the
advent of sophisticated systems such as ORB-SLAM3 mark the current era. Nevertheless,
the domain wrestles with intricate issues concerning feature extraction, noise sensitivity,
data synchronization, and the procurement of reliable ground truth data. Central to these
challenges is the assessment of uncertainty: traditional VO methods could offer probabilis-
tic insights into measurement accuracy, but the integration of uncertainty quantification
within deep learning remains a nascent and critical area of research. In traditional ap-
proaches, the provided uncertainty models primarily consider sensor noise, neglecting
other significant sources of uncertainty. These overlooked elements include factors such
as lighting conditions and environmental parameters, which also play a crucial role in
the overall accuracy and reliability of the system. A more profound understanding and
effective management of uncertainty could significantly enhance the reliability and appli-
cability of VO technologies, highlighting an essential frontier for ongoing investigative
efforts. As such, there is a pressing impetus for continuous research and development to
refine the robustness of VO systems and their adaptability to the unpredictable dynamics
of real-world environments.
Future research in the field of VO and machine learning is set to tackle key challenges,
such as improving feature extraction under difficult conditions, enhancing noise and
uncertainty management, developing versatile depth estimation methods, and achieving
precise data synchronization. There is a notable demand for novel feature extraction
algorithms that perform well in varied environments, alongside more sophisticated models
for noise filtering and uncertainty handling. Addressing depth estimation limitations and
refining synchronization techniques for integrating multiple sensor inputs are also critical.
Importantly, incorporating uncertainty quantification directly into deep learning models
for VO could significantly boost system reliability and utility across different applications.
These research directions promise to elevate the efficacy and adaptability of VO systems,
making them more suited for the complexities of real-world deployment.

8. Conclusions
In conclusion, this paper has provided an overview of traditional techniques and deep
learning-based methodologies for monocular VO, with an emphasis on displacement mea-
surement applications. It has detailed the fundamental concepts and general procedures
for VO implementation and highlighted the research challenges inherent in VO, including
scale estimation and ground plane considerations. This paper has shed light on a range
of methodologies, underscoring the diversity of approaches aimed at overcoming these
challenges. A focus has been placed on the assessment of uncertainty in VO, acknowledging
the need for further research to develop robust and reliable methods that can be used in
different applications. This paper concludes by emphasizing the importance of continued
research and innovation in this field, particularly in the realm of uncertainty assessment,
to enhance the reliability and applicability of VO technologies. Such advancements have
the potential to contribute significantly to a wide range of applications in robotics, aug-
mented reality, and navigation systems, paving the way for more nuanced and powerful
applications of VO across various domains.

Author Contributions: A.N.: Played a central role in the conception, design, execution, and coordina-
tion of this research project. Led the data collection, analysis, interpretation of findings, drafting of
this manuscript, and critical revision of its content. F.P.: Contributed to the writing of the original
draft, validation, writing—review and editing, and supervision. I.A.: Participated in aspects of
this project relevant to his expertise, contributing to the overall research and development of this
study. P.D.: Provided supervision and insight into this study’s direction and contributed to the
overall intellectual framework of this research. L.D.V.: Offered senior expertise in supervising this
project, contributing to its conceptualization and ensuring this study adhered to high academic
Sensors 2024, 24, 1274 15 of 17

standards. Each author’s unique contribution was essential to this project’s success, complementing
and ensuring the comprehensive coverage and depth of this study. All authors have read and agreed
to the published version of the manuscript.
Funding: This research was partially funded by the NATO Science for Peace and Security Programme,
under the Multi-Year Project G5924, titled “Inspection and security by Robots interacting with
Infrastructure digital twinS (IRIS)”.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110.
[CrossRef]
2. Zou, D.; Tan, P.; Yu, W. Collaborative visual SLAM for multiple agents: A brief survey. Virtual Real. Intell. Hardw. 2019, 1, 461–482.
[CrossRef]
3. Yang, G.; Wang, Y.; Zhi, J.; Liu, W.; Shao, Y.; Peng, P. A Review of Visual Odometry in SLAM Techniques. In Proceedings of the
2020 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA), Tianjin, China, 26–28 June
2020; pp. 332–336.
4. Razali, M.R.; Athif, A.; Faudzi, M.; Shamsudin, A.U. Visual Simultaneous Localization and Mapping: A review. PERINTIS
eJournal 2022, 12, 23–34.
5. Agostinho, L.R.; Ricardo, N.M.; Pereira, M.I.; Hiolle, A.; Pinto, A.M. A Practical Survey on Visual Odometry for Autonomous
Driving in Challenging Scenarios and Conditions. IEEE Access 2022, 10, 72182–72205. [CrossRef]
6. Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [CrossRef]
7. Ma, L.; Meng, D.; Zhao, S.; An, B. Visual localization with a monocular camera for unmanned aerial vehicle based on landmark
detection and tracking using YOLOv5 and DeepSORT. Int. J. Adv. Robot. Syst. 2023, 20. https://. [CrossRef]
8. Yousif, K.; Bab-Hadiashar, A.; Hoseinnezhad, R. An overview to visual odometry and visual SLAM: Applications to mobile
robotics. Intell. Ind. Syst. 2015, 1, 289–311. [CrossRef]
9. Gadipudi, N.; Elamvazuthi, I.; Lu, C.K.; Paramasivam, S.; Su, S.; Yogamani, S. WPO-Net: Windowed Pose Optimization Network
for Monocular Visual Odometry Estimation. Sensors 2021, 21, 8155. [CrossRef]
10. Xu, Z. Stereo Visual Odometry with Windowed Bundle Adjustment; University of California: Los Angeles, CA, USA, 2015.
11. Tsintotas, K.A.; Bampis, L.; Gasteratos, A. The revisiting problem in simultaneous localization and mapping: A survey on visual
loop closure detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19929–19953. [CrossRef]
12. Civera, J.; Davison, A.J.; Montiel, J.M.M. Inverse Depth Parametrization for Monocular SLAM. IEEE Trans. Robot. 2008,
24, 932–945. [CrossRef]
13. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans.
Robot. 2017, 33, 1255–1262. [CrossRef]
14. Graeter, J.; Wilczynski, A.; Lauer, M. LIMO: Lidar-Monocular Visual Odometry. In Proceedings of the 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7872–7879.
15. Scaramuzza, D.; Fraundorfer, F. Visual Odometry [Tutorial]. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [CrossRef]
16. Fraundorfer, F.; Scaramuzza, D. Visual Odometry: Part II: Matching, Robustness, Optimization, and Applications. IEEE Robot.
Autom. Mag. 2012, 19, 78–90. [CrossRef]
17. Basiri, A.; Mariani, V.; Glielmo, L. Enhanced V-SLAM combining SVO and ORB-SLAM2, with reduced computational complexity,
to improve autonomous indoor mini-drone navigation under varying conditions. In Proceedings of the IECON 2022—48th
Annual Conference of the IEEE Industrial Electronics Society, Brussels, Belgium, 17–20 October 2022; pp. 1–7. [CrossRef]
18. He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [CrossRef]
19. Aqel, M.O.; Marhaban, M.H.; Saripan, M.I.; Ismail, N.B. Review of visual odometry: Types, approaches, challenges, and
applications. SpringerPlus 2016, 5, 1897. [CrossRef]
20. Pottier, C.; Petzing, J.; Eghtedari, F.; Lohse, N.; Kinnell, P. Developing digital twins of multi-camera metrology systems in Blender.
Meas. Sci. Technol. 2023, 34, 075001. [CrossRef]
21. Feng, W.; Zhao, S.Z.; Pan, C.; Chang, A.; Chen, Y.; Wang, Z.; Yang, A.Y. Digital Twin Tracking Dataset (DTTD): A New RGB+
Depth 3D Dataset for Longer-Range Object Tracking Applications. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3288–3297.
22. Sundby, T.; Graham, J.M.; Rasheed, A.; Tabib, M.; San, O. Geometric Change Detection in Digital Twins. Digital 2021, 1, 111–129.
[CrossRef]
Sensors 2024, 24, 1274 16 of 17

23. Döbrich, O.; Brauner, C. Machine vision system for digital twin modeling of composite structures. Front. Mater. 2023, 10, 1154655.
[CrossRef]
24. Benzon, H.H.; Chen, X.; Belcher, L.; Castro, O.; Branner, K.; Smit, J. An Operational Image-Based Digital Twin for Large-Scale
Structures. Appl. Sci. 2022, 12, 3216. [CrossRef]
25. Wang, X.; Xue, F.; Yan, Z.; Dong, W.; Wang, Q.; Zha, H. Continuous-time stereo visual odometry based on dynamics model. In
Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg,
Germany, 2018; pp. 388–403.
26. Yang, Q.; Qiu, C.; Wu, L.; Chen, J. Image Matching Algorithm Based on Improved FAST and RANSAC. In Proceedings of the
2021 IEEE International Conference on Mechatronics and Automation (ICMA), Takamatsu, Japan, 8–11 August 2021; pp. 142–147.
[CrossRef]
27. Lam, S.K.; Jiang, G.; Wu, M.; Cao, B. Area-Time Efficient Streaming Architecture for FAST and BRIEF Detector. IEEE Trans.
Circuits Syst. II Express Briefs 2019, 66, 282–286. [CrossRef]
28. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [CrossRef]
29. Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary Robust invariant scalable keypoints. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [CrossRef]
30. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the
IJCAI’81, 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2,
pp. 674–679.
31. Mohr, R.; Triggs, B. Projective Geometry for Image Analysis. In Proceedings of the XVIIIth International Symposium on
Photogrammetry & Remote Sensing (ISPRS ’96), Vienna, Austria, 9–19 July1996; Tutorial given at International Symposium on
Photogrammetry & Remote Sensing.
32. Ma, Y.; Soatto, S.; Kosecká, J.; Sastry, S. An Invitation to 3-D Vision: From Images to Geometric Models; Interdisciplinary Applied
Mathematics; Springer: New York, NY, USA, 2012.
33. Lozano, R. Unmanned Aerial Vehicles: Embedded Control; ISTE, Wiley: Denver, CO, USA, 2013.
34. Abaspur Kazerouni, I.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022,
205, 117734. [CrossRef]
35. Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE
International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22.
36. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of
Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [CrossRef]
37. Yang, K.; Fu, H.T.; Berg, A.C. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature
Reconstruction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3403–3412.
38. Zhou, D.; Dai, Y.; Li, H. Ground-Plane-Based Absolute Scale Estimation for Monocular Visual Odometry. IEEE Trans. Intell.
Transp. Syst. 2020, 21, 791–802. [CrossRef]
39. Cao, L.; Ling, J.; Xiao, X. Study on the influence of image noise on monocular feature-based visual slam based on ffdnet. Sensors
2020, 20, 4922. [CrossRef] [PubMed]
40. Qiu, X.; Zhang, H.; Fu, W.; Zhao, C.; Jin, Y. Monocular visual-inertial odometry with an unbiased linear system model and robust
feature tracking front-end. Sensors 2019, 19, 1941. [CrossRef] [PubMed]
41. Jinyu, L.; Bangbang, Y.; Danpeng, C.; Nan, W.; Guofeng, Z.; Hujun, B. Survey and evaluation of monocular visual-inertial SLAM
algorithms for augmented reality. Virtual Real. Intell. Hardw. 2019, 1, 386–410. [CrossRef]
42. Chiodini, S.; Giubilato, R.; Pertile, M.; Debei, S. Retrieving Scale on Monocular Visual Odometry Using Low-Resolution Range
Sensors. IEEE Trans. Instrum. Meas. 2020, 69, 5875. [CrossRef]
43. Lee, H.; Lee, H.; Kwak, I.; Sung, C.; Han, S. Effective Feature-Based Downward-Facing Monocular Visual Odometry. IEEE Trans.
Control. Syst. Technol. 2024, 32, 266–273. [CrossRef]
44. Shan, T.; Englot, B.; Ratti, C.; Daniela, R. LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping.
In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; Institute of
Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 5692–5698. [CrossRef]
45. Wisth, D.; Camurri, M.; Das, S.; Fallon, M. Unified Multi-Modal Landmark Tracking for Tightly Coupled Lidar-Visual-Inertial
Odometry. IEEE Robot. Autom. Lett. 2021, 6, 1004–1011. [CrossRef]
46. Fang, B.; Pan, Q.; Wang, H. Direct Monocular Visual Odometry Based on Lidar Vision Fusion. In Proceedings of the 2023 WRC
Symposium on Advanced Robotics and Automation (WRC SARA), Beijing, China, 19 August 2023; IEEE: Piscataway, NJ, USA,
2023; pp. 256–261.
47. Campos, C.; Elvira, R.; Rodriguez, J.J.; Montiel, J.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual,
Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [CrossRef]
48. Huang, W.; Wan, W.; Liu, H. Optimization-Based Online Initialization and Calibration of Monocular Visual-Inertial Odometry
Considering Spatial-Temporal Constraints. Sensors 2021, 21, 2673. [CrossRef]
Sensors 2024, 24, 1274 17 of 17

49. Zhou, L.; Wang, S.; Kaess, M. DPLVO: Direct Point-Line Monocular Visual Odometry; DPLVO: Direct Point-Line Monocular
Visual Odometry. IEEE Robot. Autom. Lett. 2021, 6, 7113. [CrossRef]
50. Li, R.; Wang, S.; Long, Z.; Gu, D. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. In Proceedings
of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018.
51. Ban, X.; Wang, H.; Chen, T.; Wang, Y.; Xiao, Y. Monocular Visual Odometry Based on Depth and Optical Flow Using Deep
Learning. IEEE Trans. Instrum. Meas. 2021, 70, 2501619. [CrossRef]
52. Lin, L.; Wang, W.; Luo, W.; Song, L.; Zhou, W. Unsupervised monocular visual odometry with decoupled camera pose estimation.
Digit. Signal Process. Rev. J. 2021, 114. [CrossRef]
53. Kim, U.H.; Kim, S.H.; Kim, J.H. SimVODIS: Simultaneous Visual Odometry, Object Detection, and Instance Segmentation. IEEE
Trans. Pattern Anal. Mach. Intell. 2022, 44, 428–441. [CrossRef]
54. Almalioglu, Y.; Turan, M.; Saputra, M.R.U.; de Gusmão, P.P.; Markham, A.; Trigoni, N. SelfVIO: Self-supervised deep monocular
Visual–Inertial Odometry and depth estimation. Neural Netw. 2022, 150, 119–136. [CrossRef]
55. Tian, R.; Zhang, Y.; Zhu, D.; Liang, S.; Coleman, S.; Kerr, D. Accurate and Robust Scale Recovery for Monocular Visual Odometry
Based on Plane Geometry. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an,
China, 30 May–5 June 2021. [CrossRef]
56. Fan, C.; Hou, J.; Yu, L. A nonlinear optimization-based monocular dense mapping system of visual-inertial odometry. Meas. J.
Int. Meas. Confed. 2021, 180, 109533. [CrossRef]
57. Yang, N.; von Stumberg, L.; Wang, R.; Cremers, D. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual
Odometry. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA,
USA, 13–19 June 2020; pp. 1278–1289. [CrossRef]
58. Aksoy, Y.; Alatan, A.A. Uncertainty modeling for efficient visual odometry via inertial sensors on mobile devices. In Proceedings
of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; IEEE: Piscataway, NJ,
USA, 2014; pp. 3397–3401.
59. Ross, D.; De Petrillo, M.; Strader, J.; Gross, J.N. Uncertainty estimation for stereo visual odometry. In Proceedings of the
34th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2021), Online, 20–24
September 2021; pp. 3263–3284.
60. Gakne, P.V.; O’Keefe, K. Tackling the scale factor issue in a monocular visual odometry using a 3D city model. In Proceedings of
the ITSNT 2018, International Technical Symposium on Navigation and Timing, Toulouse, France, 13–16 November 2018.
61. Hamme, D.V.; Goeman, W.; Veelaert, P.; Philips, W. Robust monocular visual odometry for road vehicles using uncertain
perspective projection. EURASIP J. Image Video Process. 2015, 2015, 10. [CrossRef]
62. Van Hamme, D.; Veelaert, P.; Philips, W. Robust visual odometry using uncertainty models. In Proceedings of the Inter-
national Conference on Advanced Concepts for Intelligent Vision Systems, Ghent, Belgium, 22–25 August 2011; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 1–12.
63. Brzozowski, B.; Daponte, P.; De Vito, L.; Lamonaca, F.; Picariello, F.; Pompetti, M.; Tudosa, I.; Wojtowicz, K. A remote-controlled
platform for UAS testing. IEEE Aerosp. Electron. Syst. Mag. 2018, 33, 48–56. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like