Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual
Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual
Article
From Pixels to Precision: A Survey of Monocular Visual
Odometry in Digital Twin Applications †
Arman Neyestani, Francesco Picariello , Imran Ahmed , Pasquale Daponte and Luca De Vito *
Abstract: This survey provides a comprehensive overview of traditional techniques and deep learning-
based methodologies for monocular visual odometry (VO), with a focus on displacement measure-
ment applications. This paper outlines the fundamental concepts and general procedures for VO
implementation, including feature detection, tracking, motion estimation, triangulation, and trajec-
tory estimation. This paper also explores the research challenges inherent in VO implementation,
including scale estimation and ground plane considerations. The scientific literature is rife with
diverse methodologies aiming to overcome these challenges, particularly focusing on the problem of
accurate scale estimation. This issue has been typically addressed through the reliance on knowledge
regarding the height of the camera from the ground plane and the evaluation of feature movements
on that plane. Alternatively, some approaches have utilized additional tools, such as LiDAR or depth
sensors. This survey of approaches concludes with a discussion of future research challenges and
opportunities in the field of monocular visual odometry.
Keywords: monocular; localization; feature based; odometry; survey; machine learning; deep
learning; measurement
Figure 1. The diagram depicts the camera maneuvers used to update the digital twin representation.
The starting position v(ti ) of the present path segment is depicted by a blue arrow at the moment ti .
The position for each snapshot is calculated using the parameters θt . Subsequently, the map points are
reprojected onto every snapshot, and the reprojection discrepancy r (Φ(i ), θt , t) is reduced to ascertain
the accurate path [25].
Sensors 2024, 24, 1274 4 of 17
This paper aims to provide a comprehensive examination of the state of the art in vari-
ous approaches to VO, with an emphasis on recent developments in the use of monocular
cameras. By presenting a thorough analysis, it contributes to a broader understanding of
this complex and rapidly evolving field. The remaining sections are structured as follows:
Section 2 outlines the fundamental concepts and general procedure for VO implemen-
tation, while Section 3 explores the research challenges inherent in VO implementation.
Section 6 discusses the positioning uncertainty assessment provided by monocular visual
odometry. The subsequent sections offer overviews of traditional methods and machine
learning-based approaches, culminating in a discussion of future research challenges and
opportunities. The articulation of these elements provides a solid foundation for scholars
and practitioners interested in navigating the rich and multifaceted landscape of VO and
VSLAM technologies.
environment and map the surroundings. This composite task draws upon both the
estimated camera motion from step (iii) and the 3D positioning of the tracked features
from step (iv). Together, these elements coalesce into a coherent picture of the camera’s
path, contributing to a broader understanding of the spatial context [33].
In summary, the basic algorithm for VO is a multi-step process that artfully combines
detection, tracking, estimation trajectory, and triangulation to provide a nuanced under-
standing of camera motion within an unknown environment. By progressing through
these distinct yet interrelated phases, VO offers a versatile and valuable tool in the quest
to navigate and interpret complex spatial environments. Its contributions extend across
various domains, and its underlying methodologies continue to stimulate research and
innovation in both theoretical and applied contexts.
Environmental Open
Reference Sensor Type Method Key Points
Structure Source
Using LiDAR for camera feature tracks and keyframe-based
Bundle
[14] LiDAR Outdoor Yes motion estimation. Labeling is used for outlier rejection and
Adjustment
landmark weighting.
Ground
A ground plane and camera height-based divide-and-conquer
[38] Monocular Plane-Based Outdoor No
method. A scale correction strategy reduces scale drift in VO.
Deep Learning
A VO algorithm using a standard front end with camera
Feature tracking relative to triangulated landmarks optimizing the
[42] LiDAR Outdoor No
Extraction camera poses and landmark map with range sensor depth
information resolves monocular scale ambiguity and drift.
Sensors 2024, 24, 1274 7 of 17
Table 1. Cont.
Environmental Open
Reference Sensor Type Method Key Points
Structure Source
A VO system utilizing a downward-facing camera, feature
extraction, velocity-aware masking, and nonconvex
Feature
[43] Monocular Indoor No optimization, enhanced with LED illumination and a ToF
Extraction
sensor, for improved accuracy and efficiency in mobile
robot navigation.
Feature LVI-SAM achieves real-time state estimation and map
[44] LiDAR Outdoor Yes
Extraction building with high accuracy and robustness.
A multi-sensor odometry system for mobile platforms that
Feature
[45] LiDAR Outdoor–Indoor No integrates visual, LiDAR, and inertial data. Real time with
Extraction
fixed lag smoothing.
A method combining LiDAR depth with monocular visual
odometry, using photometric error minimization and
Feature
[46] LiDAR Outdoor No point-line feature refinement, alongside LiDAR-based
Extraction
segmentation for improved pose estimation and
drift reduction.
Feature The main innovation is a visual–inertial SLAM system that
[47] Monocular Outdoor Yes
Extraction uses MAP estimation even during IMU initialization.
The authors developed a lightweight scale recovery
Feature framework using an accurate ground plane estimate. The
[48] Monocular Outdoor No
Extraction framework includes ground point extraction and aggregation
algorithms for selecting high-quality ground points.
This paper presents VO using points and lines. Direct
Feature
[49] Monocular Indoor No methods choose pixels with enough gradients to minimize
Extraction
photometric errors.
The approach of this paper combines unsupervised deep
Deep Learning
[50] Monocular Outdoor No learning and scale recovery, which is trained with stereo
Based
image pairs but tested with monocular images.
The authors proposed a self-supervised monocular depth
Deep Learning estimation network for stereo videos, which aligns training
[3] Monocular Outdoor–Indoor No
Based image pairs with predictive brightness
transformation parameters.
A VO system called DL Hybrid is proposed, which uses DL
Deep Learning
[51] Monocular Outdoor No networks in image processing and geometric localization
Based
theory based on hybrid pose estimation methods.
The authors created a decoupled cascade structure and
Deep Learning residual-based posture refinement in an unsupervised VO
[52] Monocular Outdoor No
Based framework that estimates 3D camera positions by decoupling
the rotation, translation, and scale.
The suggested network in this work is built on supervised
Deep Learning learning-based approaches with a feature encoder and pose
[9] Monocular Outdoor No
Based regressor that takes multiple successive two grayscale picture
stacks for training and enforces composite pose restrictions.
Deep Learning A neural architecture that performs VO, object detection, and
[53] Monocular Outdoor Yes
Based instance segmentation in a single thread (SimVODIS).
The proposed method is called SelfVIO, which is a
Deep Learning self-supervised deep learning-based VO and depth map
[54] Monocular Outdoor Yes
Based recovery method using adversarial training and self-adaptive
visual sensor fusion.
4. Traditional Approaches
The scientific literature is rife with diverse methodologies aiming to overcome the
challenges outlined in the preceding section, particularly focusing on the problem of
accurate scale estimation. This issue has typically been addressed through the reliance on
knowledge regarding the height of the camera from the ground plane and the evaluation of
feature movements on that plane. Alternatively, some approaches have utilized additional
tools, such as LiDAR or depth sensors.
Within the domain of autonomous driving, precise vehicle motion estimation is a
crucial concern. Various powerful algorithms have been devised to address this need,
although most commonly, they depend on binocular imagery or LiDAR measurements.
Sensors 2024, 24, 1274 8 of 17
In the following paragraphs, an overview of some prominent works associated with the
scaling challenge is provided, highlighting different strategies and technologies.
Tian et al. [55] made a significant contribution by developing a lightweight scale recov-
ery framework for VO. This framework hinged on a ground plane estimate that excelled
in both accuracy and robustness. By employing a meticulous ground point extraction
technique, the framework ensured precision in the ground plane estimate. Subsequently,
these carefully selected points were aggregated through a local sliding window and an
innovative ground point aggregation algorithm. To translate the aggregated data into the
correct scale, a Random Sample Consensus (RANSAC)-based optimizer was employed.
This optimizer solved a least-squares problem, fine-tuning parameters to derive the correct
scale, and thus displaying the marriage of optimization techniques and spatial analysis.
The parameters for this fine-tuning are likely chosen based on experimental results to
achieve the best performance
H. Lee et al. [43] presented a VO system using a downward-facing camera. This system,
designed for mobile robots, integrates feature extraction, a novel velocity-aware masking
algorithm, and a nonconvex optimization problem to enhance pose estimation accuracy. It
employs cost-effective components, including an LED for illumination and a ToF sensor,
to improve feature tracking on various surfaces. The methodology combines efficient
feature selection with global optimization for motion estimation, demonstrating improved
accuracy and computational efficiency over the existing methods. The authors claimed the
experimental results validated its performance in diverse environments, showcasing its
potential for robust mobile robot navigation.
B. Fang et al. [46] proposed a method for enhancing monocular visual odometry
through the integration of LiDAR depth information, aiming to overcome inaccuracies in
feature-depth associations. The methodology involves a two-stage process: initial pose
estimation through photometric error minimization and pose refinement using point-line
features with photometric error minimization for more accurate estimation. It employs
ground and plane point segmentation from LiDAR data, optimizing frame-to-frame match-
ing based on these features, and incorporating multi-frame optimization to reduce drift and
enhance accuracy. Based on the authors’ claim, the approach demonstrates improved pose
estimation accuracy and robustness across diverse datasets, indicating its effectiveness in
real-world scenarios.
Chiodini et al. [42] expanded the improvement on scale estimation by demonstrating a
flexible sensor fusion strategy. By merging data from a variety of depth sensors, including
Time-of-Flight (ToF) cameras and 2D and 3D LiDARs, the authors crafted a method that
broke free from the constraints of sensor-specific algorithms that pervade much of the
literature. This universal applicability is particularly significant for mobile systems without
specific sensors. The proposed approach optimized camera poses and landmark maps
using depth information, clearing up the scale ambiguity and drift that can be encountered
in monocular perception.
LiDAR–monocular visual odometry (LiMo) was presented by Graeter et al. [14]. This
novel algorithm capitalizes on the integration of data from a monocular camera and LiDAR
sensor to gauge vehicle motion. By leveraging LiDAR data to estimate the motion scale and
provide additional depth information, LiMo enhances both the accuracy and robustness of
VO. Real-world datasets were utilized to evaluate the proposed algorithm, and it exhibited
marked improvements over other state-of-the-art methods. The potential applications of
LiMo in fields like autonomous driving and robotics underscore the relevance and impact
of this research.
To mitigate the influence of outliers on feature detection and matching and en-
hance motion estimation, other researchers introduced data fusion with inertial mea-
surements. This visual–inertial odometry (VIO) integrated system is exemplified in works
like Shan et al. [44], which brought together LiDAR, visual, and inertial measurements
in a tightly coupled LiDAR–visual–inertial (LVI) odometry system. This holistic fusion,
achieved through a novel smoothing and mapping algorithm, elevates the system’s accu-
Sensors 2024, 24, 1274 9 of 17
racy and robustness. The proposal also introduced an innovative technique for estimating
extrinsic calibration parameters, further optimizing performance for applications like
autonomous driving and robotics.
Wisth et al. [45] and ORB-SLAM3 [47] further illustrated the technological advances
in multi-sensor odometry systems and real-time operation in various environments. The
use of factor graphs, dense mapping systems, and various sensors such as IMUs, visual
sensors, and LiDAR highlights the multifaceted approaches to challenges in motion and
depth estimation.
Chuanliu Fan et al. [56] introduce a monocular dense mapping system for visual–
inertial odometry, optimizing IMU preintegration and applying a nonlinear optimization-
based approach to improve trajectory estimation (Figure 2) and 3D reconstruction under
challenging conditions. By marginalizing frames within a sliding window, it manages
the computational complexity and combines an IMU and visual data to enhance the
depth estimation and map reconstruction accuracy. The authors claimed the method
outperforms vision-only approaches, particularly in environments with dynamic objects or
weak textures, and demonstrates superior performance in comparison to existing odometry
systems through evaluations of public datasets.
Figure 2. This figure demonstrates the accuracy and effectiveness of the proposed nonlinear optimization-
based monocular dense mapping system of VIO [56].
Two additional pioneering works are by Huang et al. [48], who introduced a VIO
optimization-based online initialization and spatial–temporal calibration, and Zhou et al. [49],
who introduced ‘Dplvo: Direct point-line monocular visual odometry’. The former focuses
on an intricate calibration process that aligns and interpolates camera and IMU measure-
ment data without geographical or temporal information. In contrast, the latter presents
an innovative technique that leverages point and line features directly, without needing a
feature descriptor, to achieve better accuracy and efficiency.
Collectively, these studies represent a robust and multifaceted exploration of tradi-
tional approaches in the realms of motion estimation, depth estimation, and scale recovery
within visual odometry (VO). The methodologies vary widely, each bringing unique contri-
butions to scientific discourse and providing promising avenues for ongoing research and
development. Their collective focus on enhancing precision, robustness, and computational
efficiency underscores the central challenges of the field and the diverse means by which
these can be overcome.
Sensors 2024, 24, 1274 10 of 17
Figure 3. Images in the left column, arranged vertically, are as follows: (a) optical flow map in forward
order, (b) optical flow map in reverse order, (c) points of instantaneous optical flow superimposed
on the original image, (d) map showing monocular depth, (e) map illustrating the matching of key
points in a pair of images, and (f) map depicting the reconstructed trajectory, where the estimated
path is indicated by a blue line [51].
Sensors 2024, 24, 1274 11 of 17
In a novel approach, Kim et al. [53] designed a method to perform simultaneous VO,
object detection, and instance segmentation. By employing a deep neural network, the
method not only estimates the camera pose but also detects objects within the scene, all in
real time. While promising, this approach also faces its own set of challenges, particularly
the extensive need for training data and potential difficulties with occlusions and clutter.
A notable trend in this category involves self-supervised learning as a solution to
the data scarcity problem. Many supervised methods for VIO and depth map estimation
necessitate large labeled datasets. To mitigate this issue, the authors in [54] proposed a
self-supervised method that leverages scene consistency in shape and lighting. Utilizing
a deep neural network, this method estimates parameters such as camera pose, velocity,
and depth without labeled data (Figure 4). Still, challenges persist, such as the accuracy of
inertial measurements affected by noise and the depth estimation accuracy hampered by
occlusions and reflective surfaces.
Figure 4. Sample trajectories comparing the unsupervised learning approach SelfVIO with monocular
OKVIS, VINS, and the ground truth in meter scale using EuRoC dataset MH-03 and MH-05 sequences
in [54].
X t = f ( X t −1 , Z t , R t , Q t )
where f represents the function that estimates the pose of the vehicle at time t based on
the previous pose, visual measurements, and associated uncertainties. The uncertainty
model integrates the uncertainty of visual measurements and the uncertainty of vehicle
motion to provide a more accurate assessment of the positioning in monocular VO. The
uncertainty on the backprojection of ground plane features and the uncertainty on the
vehicle motion are crucial factors in accurately estimating the relative vehicle motion.
Sensors 2024, 24, 1274 13 of 17
The Hough-like parameter space vote is employed to extract motion parameters from
the uncertainty models, contributing to the robustness and reliability of the proposed
method in [61]. Despite the advancements and insights provided by the existing research,
a notable gap in the literature is the lack of a comprehensive sensitivity analysis regarding
the various sources of uncertainty in monocular VO. The current models and studies
often overlook the full spectrum of factors that contribute to uncertainty, ranging from
atmospheric conditions to sensor noise. This limitation highlights the need for a more
holistic approach to uncertainty modeling in monocular VO. A complete model would not
only account for the direct uncertainties in visual measurements and vehicle motion but also
extend to encompass external factors, like atmospheric disturbances, lighting variations,
and intrinsic sensor inaccuracies. Such a model would enable a deeper understanding of
how these diverse factors interact and influence the overall uncertainty in VO systems,
paving the way for the development of more sophisticated and resilient techniques that
can adapt to a wider range of environmental conditions and application scenarios
7. Discussion
The implementation and performance of various machine learning-based methods for
VO have led to interesting observations and challenges, particularly concerning feature
extraction, noise sensitivity, depth estimation, and data synchronization.
The difficulty in feature extraction at high speeds is highlighted in several works [3,48,51].
This challenge is exacerbated by factors such as the optical flow on the road and increased
motion blur when the vehicle moves fast. Such conditions make feature tracking an arduous
task, allowing for only a limited number of valid depth estimates. Some methods have
attempted to stabilize results by tuning the feature matcher for specific scenarios, like
highways. Still, this often leads to complications in urban settings, where feature matches
might become erratic.
Standstill detection, an essential aspect of VO, is another area fraught with difficulty.
When the vehicle speed is low, errors can occur if the standstill detection is not well
calibrated. The nature of the driving environment, such as open spaces where only the
road is considered suitable for depth estimation, adds further complexity to the problem.
The reliance on homography decomposition, as seen in [38], has been found to be
highly sensitive to noise. This sensitivity arises from the noisy feature matches obtained
from low-textured road surfaces and the multitude of parameters derived from the homog-
raphy matrix. The task of recovering both camera movement and ground plane geometry
is a significant challenge that can affect numerical stability. Moreover, any method relying
on the ground plane assumption is vulnerable to failure if the ground plane is obscured or
deviates from the assumed model. This reveals the intrinsic limitation of such methods in
varying environmental conditions.
A remarkable development in this field is ORB-SLAM3 [47], which has established
itself as a versatile system capable of visual–inertial and multimap SLAM using various
camera models. Unlike conventional VO systems, ORB-SLAM3’s ability to utilize all
previous information from widely separated or prior mapping sessions has enhanced
accuracy, showcasing a significant advancement in the field.
Deep learning-based approaches to VO, such as those using CNNs and RNNs, have
treated VO and depth recovery predominantly as supervised learning problems [3,50,52].
While these methods excel in camera motion estimation and optical flow calculations, they
are constrained by the challenge of obtaining ground truth data across diverse scenes. Such
data are often hard to acquire or expensive, limiting the scalability of these approaches.
The issue of timestamp synchronization also emerges as a critical concern, as high-
lighted in [9]. Delays in timestamping due to factors like data transfer, sensor latency, and
Operating System overhead can lead to discrepancies in visual–inertial measurements.
Even with hardware time synchronization, issues like clock skew can cause mismatches
between camera and IMU timestamps [63]. Moreover, synchronization challenges extend
to systems using LiDAR scanners, where the alignment with corresponding camera images
Sensors 2024, 24, 1274 14 of 17
must be precise. Any deviation in this synchronization can lead to erroneous depth data
and subsequent prediction artifacts.
In summary, the machine learning-based approaches to VO chart an intriguing course
of breakthroughs and obstacles. Notable progress in employing deep learning and the
advent of sophisticated systems such as ORB-SLAM3 mark the current era. Nevertheless,
the domain wrestles with intricate issues concerning feature extraction, noise sensitivity,
data synchronization, and the procurement of reliable ground truth data. Central to these
challenges is the assessment of uncertainty: traditional VO methods could offer probabilis-
tic insights into measurement accuracy, but the integration of uncertainty quantification
within deep learning remains a nascent and critical area of research. In traditional ap-
proaches, the provided uncertainty models primarily consider sensor noise, neglecting
other significant sources of uncertainty. These overlooked elements include factors such
as lighting conditions and environmental parameters, which also play a crucial role in
the overall accuracy and reliability of the system. A more profound understanding and
effective management of uncertainty could significantly enhance the reliability and appli-
cability of VO technologies, highlighting an essential frontier for ongoing investigative
efforts. As such, there is a pressing impetus for continuous research and development to
refine the robustness of VO systems and their adaptability to the unpredictable dynamics
of real-world environments.
Future research in the field of VO and machine learning is set to tackle key challenges,
such as improving feature extraction under difficult conditions, enhancing noise and
uncertainty management, developing versatile depth estimation methods, and achieving
precise data synchronization. There is a notable demand for novel feature extraction
algorithms that perform well in varied environments, alongside more sophisticated models
for noise filtering and uncertainty handling. Addressing depth estimation limitations and
refining synchronization techniques for integrating multiple sensor inputs are also critical.
Importantly, incorporating uncertainty quantification directly into deep learning models
for VO could significantly boost system reliability and utility across different applications.
These research directions promise to elevate the efficacy and adaptability of VO systems,
making them more suited for the complexities of real-world deployment.
8. Conclusions
In conclusion, this paper has provided an overview of traditional techniques and deep
learning-based methodologies for monocular VO, with an emphasis on displacement mea-
surement applications. It has detailed the fundamental concepts and general procedures
for VO implementation and highlighted the research challenges inherent in VO, including
scale estimation and ground plane considerations. This paper has shed light on a range
of methodologies, underscoring the diversity of approaches aimed at overcoming these
challenges. A focus has been placed on the assessment of uncertainty in VO, acknowledging
the need for further research to develop robust and reliable methods that can be used in
different applications. This paper concludes by emphasizing the importance of continued
research and innovation in this field, particularly in the realm of uncertainty assessment,
to enhance the reliability and applicability of VO technologies. Such advancements have
the potential to contribute significantly to a wide range of applications in robotics, aug-
mented reality, and navigation systems, paving the way for more nuanced and powerful
applications of VO across various domains.
Author Contributions: A.N.: Played a central role in the conception, design, execution, and coordina-
tion of this research project. Led the data collection, analysis, interpretation of findings, drafting of
this manuscript, and critical revision of its content. F.P.: Contributed to the writing of the original
draft, validation, writing—review and editing, and supervision. I.A.: Participated in aspects of
this project relevant to his expertise, contributing to the overall research and development of this
study. P.D.: Provided supervision and insight into this study’s direction and contributed to the
overall intellectual framework of this research. L.D.V.: Offered senior expertise in supervising this
project, contributing to its conceptualization and ensuring this study adhered to high academic
Sensors 2024, 24, 1274 15 of 17
standards. Each author’s unique contribution was essential to this project’s success, complementing
and ensuring the comprehensive coverage and depth of this study. All authors have read and agreed
to the published version of the manuscript.
Funding: This research was partially funded by the NATO Science for Peace and Security Programme,
under the Multi-Year Project G5924, titled “Inspection and security by Robots interacting with
Infrastructure digital twinS (IRIS)”.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110.
[CrossRef]
2. Zou, D.; Tan, P.; Yu, W. Collaborative visual SLAM for multiple agents: A brief survey. Virtual Real. Intell. Hardw. 2019, 1, 461–482.
[CrossRef]
3. Yang, G.; Wang, Y.; Zhi, J.; Liu, W.; Shao, Y.; Peng, P. A Review of Visual Odometry in SLAM Techniques. In Proceedings of the
2020 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA), Tianjin, China, 26–28 June
2020; pp. 332–336.
4. Razali, M.R.; Athif, A.; Faudzi, M.; Shamsudin, A.U. Visual Simultaneous Localization and Mapping: A review. PERINTIS
eJournal 2022, 12, 23–34.
5. Agostinho, L.R.; Ricardo, N.M.; Pereira, M.I.; Hiolle, A.; Pinto, A.M. A Practical Survey on Visual Odometry for Autonomous
Driving in Challenging Scenarios and Conditions. IEEE Access 2022, 10, 72182–72205. [CrossRef]
6. Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [CrossRef]
7. Ma, L.; Meng, D.; Zhao, S.; An, B. Visual localization with a monocular camera for unmanned aerial vehicle based on landmark
detection and tracking using YOLOv5 and DeepSORT. Int. J. Adv. Robot. Syst. 2023, 20. https://. [CrossRef]
8. Yousif, K.; Bab-Hadiashar, A.; Hoseinnezhad, R. An overview to visual odometry and visual SLAM: Applications to mobile
robotics. Intell. Ind. Syst. 2015, 1, 289–311. [CrossRef]
9. Gadipudi, N.; Elamvazuthi, I.; Lu, C.K.; Paramasivam, S.; Su, S.; Yogamani, S. WPO-Net: Windowed Pose Optimization Network
for Monocular Visual Odometry Estimation. Sensors 2021, 21, 8155. [CrossRef]
10. Xu, Z. Stereo Visual Odometry with Windowed Bundle Adjustment; University of California: Los Angeles, CA, USA, 2015.
11. Tsintotas, K.A.; Bampis, L.; Gasteratos, A. The revisiting problem in simultaneous localization and mapping: A survey on visual
loop closure detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19929–19953. [CrossRef]
12. Civera, J.; Davison, A.J.; Montiel, J.M.M. Inverse Depth Parametrization for Monocular SLAM. IEEE Trans. Robot. 2008,
24, 932–945. [CrossRef]
13. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans.
Robot. 2017, 33, 1255–1262. [CrossRef]
14. Graeter, J.; Wilczynski, A.; Lauer, M. LIMO: Lidar-Monocular Visual Odometry. In Proceedings of the 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7872–7879.
15. Scaramuzza, D.; Fraundorfer, F. Visual Odometry [Tutorial]. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [CrossRef]
16. Fraundorfer, F.; Scaramuzza, D. Visual Odometry: Part II: Matching, Robustness, Optimization, and Applications. IEEE Robot.
Autom. Mag. 2012, 19, 78–90. [CrossRef]
17. Basiri, A.; Mariani, V.; Glielmo, L. Enhanced V-SLAM combining SVO and ORB-SLAM2, with reduced computational complexity,
to improve autonomous indoor mini-drone navigation under varying conditions. In Proceedings of the IECON 2022—48th
Annual Conference of the IEEE Industrial Electronics Society, Brussels, Belgium, 17–20 October 2022; pp. 1–7. [CrossRef]
18. He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [CrossRef]
19. Aqel, M.O.; Marhaban, M.H.; Saripan, M.I.; Ismail, N.B. Review of visual odometry: Types, approaches, challenges, and
applications. SpringerPlus 2016, 5, 1897. [CrossRef]
20. Pottier, C.; Petzing, J.; Eghtedari, F.; Lohse, N.; Kinnell, P. Developing digital twins of multi-camera metrology systems in Blender.
Meas. Sci. Technol. 2023, 34, 075001. [CrossRef]
21. Feng, W.; Zhao, S.Z.; Pan, C.; Chang, A.; Chen, Y.; Wang, Z.; Yang, A.Y. Digital Twin Tracking Dataset (DTTD): A New RGB+
Depth 3D Dataset for Longer-Range Object Tracking Applications. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3288–3297.
22. Sundby, T.; Graham, J.M.; Rasheed, A.; Tabib, M.; San, O. Geometric Change Detection in Digital Twins. Digital 2021, 1, 111–129.
[CrossRef]
Sensors 2024, 24, 1274 16 of 17
23. Döbrich, O.; Brauner, C. Machine vision system for digital twin modeling of composite structures. Front. Mater. 2023, 10, 1154655.
[CrossRef]
24. Benzon, H.H.; Chen, X.; Belcher, L.; Castro, O.; Branner, K.; Smit, J. An Operational Image-Based Digital Twin for Large-Scale
Structures. Appl. Sci. 2022, 12, 3216. [CrossRef]
25. Wang, X.; Xue, F.; Yan, Z.; Dong, W.; Wang, Q.; Zha, H. Continuous-time stereo visual odometry based on dynamics model. In
Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg,
Germany, 2018; pp. 388–403.
26. Yang, Q.; Qiu, C.; Wu, L.; Chen, J. Image Matching Algorithm Based on Improved FAST and RANSAC. In Proceedings of the
2021 IEEE International Conference on Mechatronics and Automation (ICMA), Takamatsu, Japan, 8–11 August 2021; pp. 142–147.
[CrossRef]
27. Lam, S.K.; Jiang, G.; Wu, M.; Cao, B. Area-Time Efficient Streaming Architecture for FAST and BRIEF Detector. IEEE Trans.
Circuits Syst. II Express Briefs 2019, 66, 282–286. [CrossRef]
28. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [CrossRef]
29. Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary Robust invariant scalable keypoints. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [CrossRef]
30. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the
IJCAI’81, 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2,
pp. 674–679.
31. Mohr, R.; Triggs, B. Projective Geometry for Image Analysis. In Proceedings of the XVIIIth International Symposium on
Photogrammetry & Remote Sensing (ISPRS ’96), Vienna, Austria, 9–19 July1996; Tutorial given at International Symposium on
Photogrammetry & Remote Sensing.
32. Ma, Y.; Soatto, S.; Kosecká, J.; Sastry, S. An Invitation to 3-D Vision: From Images to Geometric Models; Interdisciplinary Applied
Mathematics; Springer: New York, NY, USA, 2012.
33. Lozano, R. Unmanned Aerial Vehicles: Embedded Control; ISTE, Wiley: Denver, CO, USA, 2013.
34. Abaspur Kazerouni, I.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022,
205, 117734. [CrossRef]
35. Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE
International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22.
36. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of
Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [CrossRef]
37. Yang, K.; Fu, H.T.; Berg, A.C. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature
Reconstruction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3403–3412.
38. Zhou, D.; Dai, Y.; Li, H. Ground-Plane-Based Absolute Scale Estimation for Monocular Visual Odometry. IEEE Trans. Intell.
Transp. Syst. 2020, 21, 791–802. [CrossRef]
39. Cao, L.; Ling, J.; Xiao, X. Study on the influence of image noise on monocular feature-based visual slam based on ffdnet. Sensors
2020, 20, 4922. [CrossRef] [PubMed]
40. Qiu, X.; Zhang, H.; Fu, W.; Zhao, C.; Jin, Y. Monocular visual-inertial odometry with an unbiased linear system model and robust
feature tracking front-end. Sensors 2019, 19, 1941. [CrossRef] [PubMed]
41. Jinyu, L.; Bangbang, Y.; Danpeng, C.; Nan, W.; Guofeng, Z.; Hujun, B. Survey and evaluation of monocular visual-inertial SLAM
algorithms for augmented reality. Virtual Real. Intell. Hardw. 2019, 1, 386–410. [CrossRef]
42. Chiodini, S.; Giubilato, R.; Pertile, M.; Debei, S. Retrieving Scale on Monocular Visual Odometry Using Low-Resolution Range
Sensors. IEEE Trans. Instrum. Meas. 2020, 69, 5875. [CrossRef]
43. Lee, H.; Lee, H.; Kwak, I.; Sung, C.; Han, S. Effective Feature-Based Downward-Facing Monocular Visual Odometry. IEEE Trans.
Control. Syst. Technol. 2024, 32, 266–273. [CrossRef]
44. Shan, T.; Englot, B.; Ratti, C.; Daniela, R. LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping.
In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; Institute of
Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 5692–5698. [CrossRef]
45. Wisth, D.; Camurri, M.; Das, S.; Fallon, M. Unified Multi-Modal Landmark Tracking for Tightly Coupled Lidar-Visual-Inertial
Odometry. IEEE Robot. Autom. Lett. 2021, 6, 1004–1011. [CrossRef]
46. Fang, B.; Pan, Q.; Wang, H. Direct Monocular Visual Odometry Based on Lidar Vision Fusion. In Proceedings of the 2023 WRC
Symposium on Advanced Robotics and Automation (WRC SARA), Beijing, China, 19 August 2023; IEEE: Piscataway, NJ, USA,
2023; pp. 256–261.
47. Campos, C.; Elvira, R.; Rodriguez, J.J.; Montiel, J.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual,
Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [CrossRef]
48. Huang, W.; Wan, W.; Liu, H. Optimization-Based Online Initialization and Calibration of Monocular Visual-Inertial Odometry
Considering Spatial-Temporal Constraints. Sensors 2021, 21, 2673. [CrossRef]
Sensors 2024, 24, 1274 17 of 17
49. Zhou, L.; Wang, S.; Kaess, M. DPLVO: Direct Point-Line Monocular Visual Odometry; DPLVO: Direct Point-Line Monocular
Visual Odometry. IEEE Robot. Autom. Lett. 2021, 6, 7113. [CrossRef]
50. Li, R.; Wang, S.; Long, Z.; Gu, D. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. In Proceedings
of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018.
51. Ban, X.; Wang, H.; Chen, T.; Wang, Y.; Xiao, Y. Monocular Visual Odometry Based on Depth and Optical Flow Using Deep
Learning. IEEE Trans. Instrum. Meas. 2021, 70, 2501619. [CrossRef]
52. Lin, L.; Wang, W.; Luo, W.; Song, L.; Zhou, W. Unsupervised monocular visual odometry with decoupled camera pose estimation.
Digit. Signal Process. Rev. J. 2021, 114. [CrossRef]
53. Kim, U.H.; Kim, S.H.; Kim, J.H. SimVODIS: Simultaneous Visual Odometry, Object Detection, and Instance Segmentation. IEEE
Trans. Pattern Anal. Mach. Intell. 2022, 44, 428–441. [CrossRef]
54. Almalioglu, Y.; Turan, M.; Saputra, M.R.U.; de Gusmão, P.P.; Markham, A.; Trigoni, N. SelfVIO: Self-supervised deep monocular
Visual–Inertial Odometry and depth estimation. Neural Netw. 2022, 150, 119–136. [CrossRef]
55. Tian, R.; Zhang, Y.; Zhu, D.; Liang, S.; Coleman, S.; Kerr, D. Accurate and Robust Scale Recovery for Monocular Visual Odometry
Based on Plane Geometry. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an,
China, 30 May–5 June 2021. [CrossRef]
56. Fan, C.; Hou, J.; Yu, L. A nonlinear optimization-based monocular dense mapping system of visual-inertial odometry. Meas. J.
Int. Meas. Confed. 2021, 180, 109533. [CrossRef]
57. Yang, N.; von Stumberg, L.; Wang, R.; Cremers, D. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual
Odometry. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA,
USA, 13–19 June 2020; pp. 1278–1289. [CrossRef]
58. Aksoy, Y.; Alatan, A.A. Uncertainty modeling for efficient visual odometry via inertial sensors on mobile devices. In Proceedings
of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; IEEE: Piscataway, NJ,
USA, 2014; pp. 3397–3401.
59. Ross, D.; De Petrillo, M.; Strader, J.; Gross, J.N. Uncertainty estimation for stereo visual odometry. In Proceedings of the
34th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2021), Online, 20–24
September 2021; pp. 3263–3284.
60. Gakne, P.V.; O’Keefe, K. Tackling the scale factor issue in a monocular visual odometry using a 3D city model. In Proceedings of
the ITSNT 2018, International Technical Symposium on Navigation and Timing, Toulouse, France, 13–16 November 2018.
61. Hamme, D.V.; Goeman, W.; Veelaert, P.; Philips, W. Robust monocular visual odometry for road vehicles using uncertain
perspective projection. EURASIP J. Image Video Process. 2015, 2015, 10. [CrossRef]
62. Van Hamme, D.; Veelaert, P.; Philips, W. Robust visual odometry using uncertainty models. In Proceedings of the Inter-
national Conference on Advanced Concepts for Intelligent Vision Systems, Ghent, Belgium, 22–25 August 2011; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 1–12.
63. Brzozowski, B.; Daponte, P.; De Vito, L.; Lamonaca, F.; Picariello, F.; Pompetti, M.; Tudosa, I.; Wojtowicz, K. A remote-controlled
platform for UAS testing. IEEE Aerosp. Electron. Syst. Mag. 2018, 33, 48–56. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.