0% found this document useful (0 votes)
18 views13 pages

Deep Ukf

The document presents DeepUKF-VIN, an adaptively-tuned Deep Unscented Kalman Filter for 3D Visual-Inertial Navigation, which utilizes a Deep Learning-based Adaptation Mechanism to enhance navigation accuracy in GPS-denied environments. The proposed method integrates data from IMUs and visual features, effectively capturing non-linearities and improving estimation performance compared to traditional filters. Real-world evaluations demonstrate the filter's robustness and superior performance in estimating orientation, position, and velocity for unmanned vehicles.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Deep Ukf

The document presents DeepUKF-VIN, an adaptively-tuned Deep Unscented Kalman Filter for 3D Visual-Inertial Navigation, which utilizes a Deep Learning-based Adaptation Mechanism to enhance navigation accuracy in GPS-denied environments. The proposed method integrates data from IMUs and visual features, effectively capturing non-linearities and improving estimation performance compared to traditional filters. Real-world evaluations demonstrate the filter's robustness and superior performance in estimating orientation, position, and velocity for unmanned vehicles.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Personal use of this material is permitted.

Permission from the author(s) and/or copyright holder(s), must be obtained for all other uses. Please contact us and provide details if
you believe this document breaches copyrights.

DeepUKF-VIN: Adaptively-tuned Deep Unscented


Kalman Filter for 3D Visual-Inertial Navigation
based on IMU-Vision-Net
Khashayar Ghanizadegan and Hashim A. Hashim

Abstract—This paper addresses the challenge of estimating buildings) [1], [2]. Similar challenges are encountered in
the orientation, position, and velocity of a vehicle operating in underwater navigation, where robots must operate in deep,
arXiv:2502.00575v1 [cs.RO] 1 Feb 2025

three-dimensional (3D) space with six degrees of freedom (6- GNSS-denied environments [3]. Unmanned ground vehicles
DoF). A Deep Learning-based Adaptation Mechanism (DLAM)
is proposed to adaptively tune the noise covariance matrices (UGVs) and unmanned aerial vehicles (UAVs) have shown
of Kalman-type filters for the Visual-Inertial Navigation (VIN) immense potential in various sectors. For example, UGVs and
problem, leveraging IMU-Vision-Net. Subsequently, an adaptively UAVs are increasingly used in care facilities to assist with
tuned Deep Learning Unscented Kalman Filter for 3D VIN monitoring and delivery tasks [4], in logistical services for
(DeepUKF-VIN) is introduced to utilize the proposed DLAM, autonomous package delivery [5], and in surveillance of hard-
thereby robustly estimating key navigation components, including
orientation, position, and linear velocity. The proposed DeepUKF- to-access locations [1]. These include monitoring forests for
VIN integrates data from onboard sensors, specifically an inertial early fire detection [6], tracking icebergs in the Arctic [7], and
measurement unit (IMU) and visual feature points extracted conducting surveys in other remote areas. The effectiveness
from a camera, and is applicable for GPS-denied navigation. of these applications hinges on the precision and reliability of
Its quaternion-based design effectively captures navigation non- their navigation systems. In the realm of mobile technology,
linearities and avoids the singularities commonly encountered
with Euler-angle-based filters. Implemented in discrete space, the accurate navigation is essential for enhancing user experiences,
DeepUKF-VIN facilitates practical filter deployment. The filter’s particularly in smartphone applications that rely on real-time
performance is evaluated using real-world data collected from positional data, such as augmented reality (AR) platforms
an IMU and a stereo camera at low sampling rates. The results and wayfinding tools [8]. Similarly, in aerospace applications,
demonstrate filter stability and rapid attenuation of estimation obtaining precise positional and orientation data is vital for
errors, highlighting its high estimation accuracy. Furthermore,
comparative testing against the standard Unscented Kalman the accurate analysis and interpretation of observational infor-
Filter (UKF) in two scenarios consistently shows superior perfor- mation [9], [10].
mance across all navigation components, thereby validating the
efficacy and robustness of the proposed DeepUKF-VIN.
B. Related Work
Index Terms—Deep Learning, Unscented Kalman Filter, Adap-
tive tuning, Estimation, Navigation, Unmanned Aerial Vehicle, One of the primary approaches to addressing the challenge
Sensor-fusion. of navigation in GPS-denied environments involves utilizing
ego-acceleration measurements from onboard accelerometers
For video of navigation experiment visit: link to estimate a vehicle’s pose relative to its previous position.
This technique, known as Dead Reckoning (DR), integrates
acceleration data to derive positional information [1]. DR of-
I. I NTRODUCTION fers a straightforward, cost-effective solution, particularly with
A. Motivation low-cost sensors, making it accessible for many applications
[11]. To enhance the accuracy of Dead Reckoning, a gyroscope

N AVIGATION is a fundamental component in the success-


ful operation of a wide array of applications, spanning
fields such as robotics, aerospace, and mobile technology. At
is often incorporated to measure the vehicle’s angular velocity.
This integration provides additional orientation data, improv-
ing the overall pose estimation. However, a significant draw-
its core, navigation involves estimating an object’s position, back of this method is its susceptibility to cumulative errors or
orientation, and velocity, a task that becomes particularly crit- drift over time. Without supplementary sensors or correction
ical and challenging in environments where Global Navigation mechanisms, these errors can accumulate rapidly, leading
Satellite Systems (GNSS), like GPS, BeiDou, and GLONASS, to inaccurate navigation results, especially during prolonged
are unavailable (e.g., indoor environments) or unreliable (e.g., use. In controlled environments like harbors, warehouses, or
urban settings with obstructed satellite signals due to tall other predefined spaces, ultra-wideband (UWB) technology
This work was supported in part by National Sciences and Engineering can significantly enhance navigation accuracy. UWB systems
Research Council of Canada (NSERC), under the grants RGPIN-2022-04937 measure distances between the vehicle and fixed reference
and DGECR-2022-00103. points, known as anchors, providing highly accurate and robust
K. Ghanizadegan and H. A. Hashim are with the Department of Mechanical
and Aerospace Engineering, Carleton University, Ottawa, Ontario, K1S-5B6, localization data [12]. This approach is widely adopted in
Canada (e-mail: [email protected]). applications where precision is paramount, such as robotic
K. Ghanizadegan and H. A. Hashim, ”DeepUKF-VIN: Adaptively-tuned Deep Unscented Kalman Filter for 3D Visual-Inertial Navigation based on
IMU-Vision-Net,” Expert Systems With Applications, vol. 271, pp. 126656, 2025. doi: 10.1016/j.eswa.2025.126656
2

operations within structured environments and object-tracking maintaining comparable computational complexity to the EKF
systems like Apple’s AirTag [13]. When used alongside an In- [26], [30]. Nevertheless, Kalman-type filters rely on accurate
ertial Measurement Unit (IMU), accelerometer, and gyroscope modeling of system and measurement noise. While it is
within a DR framework, UWB can serve as an additional standard to assume these noise components are zero-mean,
sensor to correct positional errors and mitigate drift [12]. their covariance matrices serve as critical tuning parameters,
However, this solution has limitations since UWB requires and the performance of these filters is sensitive to inaccuracies
the installation of anchors in the environment, which confines in their specification [31]. In practice, determining the values
its applicability to pre-configured spaces. As a result, it may of covariance matrices is challenging and typically involves
not be suitable for dynamic or unstructured environments, an iterative process of trial and error, which can be both time-
reducing the system’s flexibility and immediate usability out of consuming and effort-intensive.
the box [14]. Additionally, UWB is susceptible to high levels Deep learning techniques have shown significant promise
of noise, which can degrade estimation accuracy [12]. in adaptively tuning the covariance matrices of Kalman-based
With the development of advanced point cloud registration filters, addressing a critical challenge in achieving accurate
algorithms such as Iterative Closest Point (ICP) [15] and state estimation [32]–[35]. These methods offer an efficient
Coherent Point Drift (CPD) [16], sensors capable of cap- alternative to traditional manual tuning, leveraging data-driven
turing two-dimensional (2D) points from three-dimensional models to dynamically estimate noise parameters based on
(3D) space have emerged as promising candidates to com- observed system behavior. For instance, Brossard et al. [32]
plement IMUs without requiring prior environmental knowl- utilized Convolutional Neural Networks (CNNs) to predict
edge. Sound Navigation and Ranging (SONAR) is one such measurement noise parameters for the DR of ground vehicles
sensor, widely adopted in marine applications due to its using an Invariant EKF (IEKF). This approach improved noise
effectiveness in underwater environments, where mechanical estimation by learning from raw sensor data, enhancing overall
waves propagate efficiently [17]. Similarly, Light Detection navigation accuracy. Similarly, Or et al. [34] applied deep
and Ranging (LiDAR) employs electromagnetic waves in- learning to model trajectory uncertainty by extracting features
stead of mechanical waves and has demonstrated utility in such as vehicle speed and path curvature demonstrating the
aerospace applications, where sound propagation is limited, potential to enhance state predictions by accurately capturing
but light transmission is effective [18]. However, both SONAR the system’s dynamic characteristics. Furthermore, Yan et al.
and LiDAR exhibit significant limitations in complex indoor [35] proposed a multi-level framework where the state vector
and outdoor environments, as they rely solely on structural estimates and covariance predictions from traditional filters
properties and cannot capture texture or color information. serve as inputs to deep learning architectures. Therefore, deep
In contrast, recent advancements in low-cost, high-resolution learning can be employed to iteratively refine the covari-
cameras designed for navigation applications, combined with ance estimates, improving robustness in complex scenarios
robust fusion between IMU and feature detection [2], [19]. allowing Kalman-based filters to dynamically adjust varying
Popular tracking feature detection-based algorithms include noise conditions, reducing dependency on intensive manual
Scale-Invariant Feature Transform (SIFT) [20], Good Features tuning and significantly improving performance in real-world
to Track (GFTT) [21], and the Kanade-Lucas-Tomasi (KLT) applications.
algorithm have facilitated the widespread adoption of cameras
as correction sensors alongside IMUs [1], [22]–[24].
D. Contributions
Motivated by the above discussion, the key contributions
C. Persistent Challenges and Potentials of this work are as follows: (1) The proposed approach
To integrate the aforementioned sensor data, Kalman-type employs singularity-free quaternion dynamics to represent ego
filters are widely employed in navigation due to their stochastic orientation, ensuring robust handling of orientation estima-
framework and ability to handle noisy measurements [25]– tion and avoiding singularities typically encountered with
[28]. The Kalman Filter (KF) provides a maximum likeli- Euler-angle-based representations. (2) A quaternion-based,
hood estimate of the system’s state vector based on available adaptively-tuned Deep Learning Unscented Kalman Filter for
measurement data; however, it operates optimally only within 3D Visual-Inertial Navigation (DeepUKF-VIN) based on Deep
linear systems. To overcome this limitation, the Extended Learning-based Adaptation Mechanism (DLAM) is formulated
Kalman Filter (EKF) was developed. The EKF linearizes the in discrete form. This approach accurately models the true
system around the current estimated state vector and applies navigation kinematics, simplifies the implementation process,
the KF framework to this linearized model. Its intuitive struc- and dynamically estimates the covariance matrices, thereby
ture, ease of implementation, and computational efficiency enhancing the overall performance of Kalman-type filters.
have established the EKF as a standard choice for navigation (3) A novel deep learning-based adaptation mechanism is
applications [16], [23], [29]. However, the EKF’s performance introduced to dynamically estimate the covariance matrices as-
degrades with increasing system nonlinearity. To address the sociated with the measurement noise vectors in the UKF. This
EKF’s limitations, the Unscented Kalman Filter (UKF) was adaptive approach enhances the filter’s estimation performance
introduced. The UKF effectively captures the propagation by reducing dependency on manual tuning. (4) The proposed
of mean and covariance through a nonlinear transformation DeepUKF-VIN demonstrates superior performance compared
up to the second order, offering improved accuracy while to the standard UKF across various scenarios. DeepUKF-VIN
3

TABLE I: Nomenclature A. Preliminary


{B} / {W} : Fixed body-frame / fixed world-frame The matrix R ∈ R3×3 represents the vehicle’s orientation,
SO (3) : Special Orthogonal Group of order 3
provided it belongs to the Special Orthogonal Group of order
3, denoted SO(3), which is defined by:
S3 : Three-unit-sphere
SO(3) := R ∈ R3×3 det(R) = +1, RR⊤ = I3

qk , q̂k : True and estimated quaternion at step k
pk , p̂k : True and estimated position at step k
A quaternion vector q is defined in the scalar-first format
vk , v̂k : True and estimated linear velocity at step k as q = [qw , qx , qy , qz ]⊤ = [qw , qv⊤ ]⊤ ∈ S3 with qv ∈ R3 ,
re,k , pe,k , ve,k : Attitude, position, and velocity estimation error qw ∈ R, and S3 := { q ∈ R4 ||q|| = 1}. To obtain the
ak , am,k : True and measured acceleration at step k quaternion
 representation corresponding
 to a rotation matrix
ωk , ωm,k : True and measured angular velocity at step k
R(1,1) R(1,2) R(1,3)
R = R(2,1) R(2,2) R(2,3) , the mapping qR : SO(3) →
ηω,k , ηa,k : Angular velocity and acceleration measurements
noise R(3,1) R(3,2) R(3,3)
S3 is defined as [36]:
bω,k , ba,k : Angular velocity and acceleration measurements
bias   1p
1 + R(1,1) + R(2,2) + R(3,3)
 
qw 2
1
4qw (R(3,2) − R(2,3) )
C× : Covariance matrix of n× .  qx   
qR (R) =  qy  = 
  1

4qw (R(1,3) − R(3,1) )
lb,k , lb,w : landmark coordinates in {B} and {W}. 
qz 1
xk , xa
k , uk : The state, augmented state, and input vectors at 4qw (R(2,1) − R(1,2) )
the kth time step
(1)
ẑk , zk : Predicted and true measurement
The orientation resulting from two subsequent rotations q1 =
{χi|j }ν , : Sigma points of state, augmented state, and
{χa } , measurements [qw1 , qv1 ]⊤ ∈ S3 and q2 = [qw2 , qv2 ]⊤ ∈ S3 is defined through
i|j ν
{ζi|j }ν quaternion multiplication, denoted by the ⊗ operator [36]:
q3 = q1 ⊗ q2

 
qw1 qw2 − qv1 qv2
effectiveness is validated using real-world data collected from = ∈ S3 (2)
qw1 qv2 + qw2 qv1 + [qv1 ]× qv2
low-cost sensors operating at low sampling rates. To the best of
the authors’ knowledge, no deep learning-enhanced Kalman- The orientation identical in terms of unit quaternion is qI =
type filter based on inertial measurement and vision units has [1, 0, 0, 0]⊤ . For q = [qw , qv⊤ ]⊤ ∈ S3 , the inverse of q is given
been proposed for VIN. by q −1 = [qw , −qv⊤ ]⊤ ∈ S3 . It is worth noting that q ⊗ q −1 =
qI . For m ∈ R3 , the skew-symmetric matrix [m]× is defined
as:
   
E. Structure 0 −m3 m2 m1
[m]× =  m3 0 −m1  ∈ so(3), m =  m2 
The structure of the paper is organized as follows: Section
−m2 m1 0 m3
II introduces the preliminary concepts and mathematical foun-
dations. Section III defines the nonlinear navigation kinemat- The mapping from quaternion q = [qw , qv⊤ ]⊤ ∈ S3 to rotation
ics problem. Section IV presents the quaternion-based UKF matrix Rq ∈ SO(3) is defined by [36]:
framework tailored for navigation kinematics. Section V pro-
vides a detailed description of the deep learning architecture
2
Rq (q) = (qw − ∥qv ∥2 )I3 + 2qv qv⊤ + 2qw [qv ]× (3)
for adaptive tuning. Section VI outlines the training process
The inverse of the skew-symmetric matrix functionis given by:
and implementation methodology of the proposed DeepUKF-
VIN. Section VII evaluates the performance of the DeepUKF- vex([m]× ) = m ∈ R3 (4)
VIN algorithm using a real-world dataset. Finally, Section VIII
offers concluding remarks. Let Pa (·) : R3×3 → so(3) be anti-symmetric projection
operator where
1
II. P RELIMINARIES Pa (M ) = (M − M ⊤ ) ∈ so(3), ∀M ∈ Rm×m (5)
2
The orientation of a rigid body can also be represented by
Notation: In this paper, the set of d1 -by-d2 matrices of
a rotation angle θ ∈ R around a unit vector u ∈ S2 ⊂ R3
real numbers is denoted by Rd1 ×d2 . A vector v ∈ Rd is said
with S2 := { u ∈ R3 ||u|| = 1}. Angle-axis parametrization
to lie on the d-dimensional
√ sphere Sd−1 ⊂ Rd when its norm,
⊤ is obtained from the rotation matrix R ∈ SO(3), where [36]:
denoted as ∥m∥ = m m ∈ R, is equal to one. The identity
matrix of dimension d is denoted by Id ∈ Rd×d . The world
 
Tr(R) − 1

 θR (R) = arccos
 ∈R
frame {W} and the body frame {B} refer to the coordinate

2
systems attached to the Earth and the vehicle, respectively. (6)
1
uR (R) = vex(Pa (R)) ∈ S2


Table I lists a summary of notations heavily used in this paper. sin(θR (R))
4

with Tr(·) denoting the trace function. Let the rotation vector III. P ROBLEM F ORMULATION
r be described via the angle-axis parametrization as follows: In this section, the kinetic and measurement models are
introduced. After defining the state vector, a state transition
r = rθ,u (θ, u) = θu ∈ R3 , ∀θ ∈ R, u ∈ S2 (7)
function is established to define the relation between naviga-
The rotation matrix associated with a rotation vector is given tion state and the input data. Moreover, the interdependence
by [36]: between the state and measurements vector is formulated,
which is essential for the proposed DeepUKF-VIN perfor-
Rr (r) = exp([r]× ) ∈ SO(3) (8) mance.
The mapping from rotation vector representation to quaternion
representation is found by utilizing (1), (6), and (7) such that A. Navigation Model in 3D
qr : R3 → S3 : The true navigation kinematics of a vehicle travelling in 3D
space are represented by [2], [19]:
qr (r) = qR (Rr (r)) ∈ S3 (9)
1

3
 q̇ = 2 Γ(ω)q ∈ S


The rotation vector corresponding to a rotation represented by
a quaternion is found in light of (3), (6), and (7) by rq : S3 → ṗ = v ∈ R3 (15)

R3 such that

v̇ = g + Rq (q)a ∈ R3

rq (q) = rθ,u (θR (Rq (q)) , uR (Rq (q))) (10) with


−ω ⊤
 
0
Γ(ω) = ∈ R4×4
To facilitate addition ⊞ and subtraction ⊟ between a rotation ω −[ω]×
vector r ∈ R3 and a quaternion q ∈ S3 , using the definitions where q describe vehicle’s orientation with respect to quater-
in (4), (7), and (9), the following operations are defined: nion, ω ∈ R3 and a ∈ R3 denote angular velocity and
q ⊞ r := qr (r) ⊗ q ∈ S3 (11) acceleration, respectively, while p ∈ R3 and v ∈ R3 refer
−1 3 to vehicle’s position and linear velocity, respectively, with
q ⊟ r := qr (r) ⊗q ∈S (12) q, ω, a ∈ {B} and p, v ∈ {W}. In light of [2], the kinematics
In light of (3), the subtraction of two quaternions q1 , q2 ∈ S3 in (15) is equivalent to:
   1  
is given by: q̇ 2 Γ(ω)q 0 0 0 q
 ṗ   0 0 I3 0  p 
q1 ⊟ q2 := rq (q1 ⊗ q2−1 ) ∈ R3 (13)  v̇  = 
  
0

0 0 g + Rq (q)a   v 
 (16)

Consider a set of quaternions Q = {qi ∈ S3 } and their 0 0 0 0 0 1


| {z }
corresponding weights W = {wi ∈ R}. To compute the M c (q,ω,a)
weighted average of these quaternions, the matrix E is first
Since the onboard data processor operating in discrete space
constructed as:
and the sensor data are updated at discrete instances, the
X
E= wi qi qi⊤ ∈ R4×4 continuous kinematics in (16) need to be discretized. The true
discrete value at the kth time-step of q ∈ S3 , ω ∈ R3 , a ∈ R3 ,
Next, the quaternion weighted mean QWM(Q, W ) is the p ∈ R3 , and v ∈ R3 is defined by qk ∈ S3 , ωk ∈ R3 , ak ∈ R3
eigenvector corresponding to the largest magnitude eigenvalue pk ∈ R3 , and vk ∈ R3 , respectively. The equivalent discretized
of E such that: kinematics of the expression in (16) is [2]
   
QWM(Q, W ) = EigVector(E)i ∈ S3 (14) qk qk−1
 pk  c
 pk−1 
 vk  = exp(Mk−1 dT )  vk−1  (17)
   
where i = argmax(|EigValue(E)i |) ∈ R. A d-dimensional
random variable (RV) h ∈ Rd drawn from a Gaussian 1 1
distribution with a mean h ∈ Rd and a covariance matrix c
where Mk−1 = M c (qk−1 , ωk−1 , ak−1 ) and dT denote a
Ch ∈ Rd×d is represented by the following:
sample time.
h ∼ N (h, Ch )
B. Measurement Model and Setup
Note that the expected value of h, denoted by E(h), is equal
to h. The Gaussian (Normal) probability density function of The IMU measurements at time step k (angular velocity
h is formulated below: ωm,k ∈ R3 and acceleration am,k ∈ R3 ) and the related bias
in readings (bω,k and ba,k ) are as follows [27], [28]:
P(h) = N (h|h, Ch )  (
 ωm,k = ωk + bω,k + ηω,k
exp − 12 (h − h)⊤ Ch−1 (h − h) IMU
 


= ∈R a = ak + ba,k + ηa,k
p
(2π)d det(Ch ) ( m,k (18)
 b ω,k = bω,k−1 + ηbω,k
Bias


where P(h) is the probability density of h.

 b =ba,k +η
a,k−1 ba,k
5

where ηω,k and ηa,k refer to gyroscope and accelerometer where dz,k = 3dl,k represents the dimension of the mea-
additive zero-mean white noise, respectively, while the zero- surement vector zk = lb , k. The ith measurement function
mean white noise terms ηba,k−1 , and ηbω,k−1 ∈ R3 correspond hi : Rdx × R3 → R3 is defined as:
to ba,k and bω,k ∈ R3 . In other words ⊤
hi (xk , lw,i ) = Rq (qk ) (lw,i − pk ) ∈ R3 (25)
ηω,k ∼ N (03 , Cηω,k )

where qk , pk ⊂ xk (see (21)). The measurement function h :


 η ∼ N (0 , C

a,k 3 ηa,k )
(19) Rdx × Rdz,k → Rdz,k is given by:

 η ba,k ∼ N (0 3 , Cηbω,k ) h i⊤
⊤ ⊤

ηbω,k ∼ N (03 , Cηba,k ) h(xk , lw ) = hi (xk , lw,0 ) , . . . , hi (xk , lw,dl ) ∈ Rdz,k

Note that if the noise vectors in (19) are assumed to be (26)


uncorrelated, their covariance matrices will be diagonal with The measurement function in (26) is used to find the measure-
positive entries, such that [27], [28]: ment vector zk at each time step k such that
2 zk = h(xk , lw ) + ηl,k ∈ Rdz,k (27)

Cηω,k = diag(cηω,k )

= diag(c2ηa,k )

C
ηa,k where ηl,k ∼ N (0dz,k , Cηl ,k ) is the measurement additive
(20)
Cηbω,k = diag(c2ηbω,k )
 white noise. The covariance matrix Cηl ,k ∈ Rdz,k ×dz,k is
defined by:

Cηba,k = diag(c2ηba,k )

Cηl ,k = c2ηl ,k Idz,k (28)
where cηω,k , cηa,k , cηbω,k , and cηba,k ∈ R3 represent the square
roots of the diagonal elements of their respective covariance where cηl ,k ∈ R is a scalar. This definition is particularly
matrices. Let us define the state vector xk ∈ Rdx and the useful since dz,k may vary at each time step k.
augmented state vector xak ∈ Rda such that
 h i⊤ IV. Q UATERNION - BASED UKF-VIN
xk = qk⊤ p⊤

k v ⊤
k b⊤
ω,k b ⊤
a,k ∈ Rdx This section provides a detailed description of the
(21)
h
xa = x⊤ η ⊤ ∈ Rda
 i quaternion-based Unscented Kalman Filter for 3D Visual-
k k x,k Inertial Navigation (UKF-VIN) design which will be sub-
with dx = 16 and da = 22 representing the dimensions of the sequently tightly-coupled with the proposed Deep Learning-
state vector and the augmented state vector, respectively, and based Adaptation Mechanism (DLAM) for adaptive tuning of
ηx,k representing the augmented noise vector, such that UKF-VIN covariance matrices. The proposed approach builds
upon the standard UKF [26], incorporating specific modi-
⊤ ⊤
 ⊤
∈ Rdηx

ηx,k = ηω,k ηa,k (22) fications to address challenges inherent in navigation tasks.
These adaptations ensure the UKF operates effectively within
where dηx = 6 is the dimension of the augmented noise.
the quaternion space S3 , preserving the physical validity of
Consider formulating the additive noise vector such that:
orientation estimation. Furthermore, the design accommodates
⊤
ηw,k = 0⊤ ⊤
n⊤ ∈ Rdx the intermittent nature of vision data, which is not available

10 nbω,k ba,k (23)
at every time step, while consistently integrating IMU data.
Then, the expression in (17), using (18), (21), (22), and (23)
can be written in form of state transition function f : Rda →
Rdx such that A. Initialization
he filter is initialized with the initial state vector estimate
xk = f(xak−1 , uk−1 ) + ηw,k−1 (24)
x̂0|0 ∈ Rdx , and its associated covariance estimate P0|0 ∈
with the input vector being defined as uk−1 = R(dx −1)×(dx −1) which represents the confidence in the initial

[ωm,k−1 , a⊤
m,k−1 ] ⊤
∈ R du
and d u = 6 being the dimension state estimate. The reduced dimensionality of the covariance
of the input vector. Let the landmark coordinates in {W} matrix arises from the fact that the quaternion in the state
be represented as {lw,k,i ∈ R3 }i , where these coordinates vector has three degrees of freedom, despite having four
are either known from prior information or obtained from components [23].
a series of stereo camera measurements. Similarly, let the
corresponding coordinates measured by the latest stereo B. Aggregate Predict
camera data in {B} be denoted as {lb,k,i ∈ R3 }i , where
i = {1, . . . , dl,k } represents the index of each measured At each time step k where image data is available, the
landmark, and dl,k ∈ R denotes the number of landmarks at current state vector is predicted using the last db input vectors
each measurement step. Note that dl,k is not constant and uk−1−db :k−1 ∈ Rdb ×du , along with the previous state vector
may change at each step. We then construct the concatenated estimate x̂k−1−db |k−1 ∈ Rdx and the covariance matrix
vectors lw,k ∈ Rdz,k , and lb,k ∈ Rdz,k such that: Pk−1−db |k−1 ∈ R(dx −1)×(dx −1) , the current state vector is
 h i⊤ predicted. Here db represents the number of measurements
lw,k = l⊤ , . . . , l⊤

∈ Rdz,k received from the IMU between the current and the last
w,k,1 w,k,dl,k
h i⊤ instance when image data was available. For each j = {k−1−
⊤ ⊤
lb,k = lb,k,1
 , . . . , lb,k,d l
∈ Rdz,k db , . . . , k − 1}, the following steps are executed sequentially.
6

1) Augmentation: The augmented state vector xaj ∈ Rda , with χj+1|j,ν,q ∈ S3 and χj+1|j,ν,− ∈ Rdx −4 representing the
and the augmented covariance matrix Pj|ja
∈ R(da −1)|(da −1) quaternion and non-quaternion components of χj+1|j,ν ∈ Rdx ,
are constructed as follows: respectively. Note that in (35), the quaternion weighted average
h i⊤ (14) is used for quaternion components of the propagated
x̂aj|j = x̂⊤ ⊤
j|j , 0mnx ×1 ∈ Rma (29) sigma point vectors and the straightforward weighted average
for non-orientation components of the propagated sigma point
a
Pj|j = diag(Pj|j , Cηx ,k−1 ) ∈ R(ma −1)×(ma −1) (30)
vector. The weights {wνm }ν and {wνc }ν in (35) and (36) are
where Cηx ,k = diag(Cηw,k , Cηa,k ) ∈ Rηx is the covariance derived from:

matrix of ηx,k as defined in (22). λ

 w0m = ∈R
2) Sigma Points Construction: Using the augmented esti- λ + (d a − 1)




mate of the state vector and its covariance, while accounting λ


 wc =

+ 1 − α2 + β ∈ R
for the reduced dimensionality of the quaternions and applying 0
λ + (da − 1) (37)
the unscented transform [26], the sigma points representing the 
 m c 1
wν = wν = ∈R

prior distribution are computed as follows:

2((da − 1) + λ)





χaj|j,0 = x̂aj|j ∈ Rda ν = {1, . . . , 2(da − 1)}ν
 


χaj|j,ν = x̂aj|j ⊞ δ x̂aν ∈ Rda ν = {1, . . . , 2(da − 1)}ν where α and β ∈ R are tuning parameters. The ⊟ operator in

 a (36) is defined in accordance with (13) as follows:
χj|j,ν+ma = x̂aj|j ⊟ δ x̂aν ∈ Rda  
(31) χj+1|j,ν,q ⊖ x̂j+1|j,q
χj+1|j,ν ⊟ x̂j+1|j = ∈ Rdx −1 (38)
χj+1|j,ν,− − x̂j+1|j,−
q 
where δ x̂aj,ν = (da − 1 + λ)Pj|j a ∈ Rma −1 , with the
ν
subscript ν representing the νth column. The operators ⊞ and where x̂j+1|j,q ∈ S3 , and x̂j+1|j,− ∈ Rdx −4 represent the
⊟ in (31) are defined in accordance with (11) and (12), such quaternion and non-quaternion components of x̂j+1|j ∈ Rdx ,
that respectively. Note that Cηw ,k = diag(0dx −7 , Cηw,k , Cηa,k ) ∈
" # R(dx −1)×(dx −1) denotes the covariance matrix of ηw,k as
a a x̂aj|j,q ⊞ δ x̂aj,ν,r defined in (23).
x̂j|j ⊞ δ x̂ν = a ∈ Rda (32)
x̂j|j,− + δ x̂aj,ν,− 5) Iterate over batch: If j = k−1, corresponding to the end
"
x̂ a
⊟ δ x̂ a
# of the batch, the predicted state estimate x̂k|k−1 , the covariance
j,ν,r
x̂aj|j ⊟ δ x̂aν = aj|j,q ∈ Rda (33) Pk|k−1 , and the predicted sigma points {χj+1|j,ν }ν , as defined
x̂j|j,− − δ x̂aj,ν,−
in (35), (36), and (34), respectively, are passed to the update
step (see Section IV-C). Otherwise, x̂j+1|j+1 and Pj+1|j+1
where x̂aj|j,q ∈ S3 and x̂aj|j,− ∈ Rda −4 represent the quaternion
are set to x̂j+1|j and Pj+1|j , respectively. The aggregate
and non-quaternion components x̂aj|j ∈ Rda , respectively, prediction algorithm then increments j ← j + 1 and continues
and δ x̂aj,ν,r ∈ R3 and δ x̂aj,ν,− ∈ Rda −4 denote the rotation by returning to Section IV-B1.
vector and non-rotation vector components of δ x̂aj,ν ∈ Rda −1 ,
respectively. Note that λ ∈ R is a tuning parameter that C. Update
controls the spread of the sigma points.
1) Calculate Measurement Sigma Points And Its Statistics:
3) Sigma Points Propagation: In this step, the sigma points At time step k, the predicted sigma points {χj+1|j,ν }ν are
defined in (31) are propagated through the state transition passed through the measurement function (26) to calculate the
function (24) to obtain the predicted sigma points {χj+1|j,ν }ν 
measurement sigma points ζk,ν ∈ Rdz,k ν such that
such that
ζk,ν = h(χk|k−1,ν , lw ) ∈ Rdz,k (39)
χj+1|j,ν = f(χaj|j,ν , uj ) ∈ Rdx ν = {1, . . . , 2(da − 1)}ν
(34) Considering (27) and (37), the expected value and covariance
4) Calculate the Predicted Mean and Covariance: The of {ζk,ν }ν , denoted by ẑk ∈ Rdz,k and Pzk ∈ Rdz,k ×dz,k ,
weighted mean x̂j+1|j ∈ Rdx and covariance Pj+1|j ∈ respectively, are determined as follows:
R(dx −1)×(dx −1) of the predicted sigma points {χj+1|j,ν }ν 2(da −1)
X
are determined in this step in accordance with (14). These ẑk = wνm ζk,ν (40)
quantities are calculated as follows: ν=0
2(da −1)
QWM({χj+1|j,ν,q }ν , {wνm }ν )
  X
Pzk = wνc [ζk,ν − ẑk ][ζk,ν − ẑk ]⊤
2(da −1) d
x̂j+1|j =  ∈R x (35) ν=0
 X 
m
wν Xj+1|j,ν,− + Cηl ,k (41)
ν=0
Using (37), (38), and (40), the cross covariance matrix
2(da −1) Pxk ,zk ∈ R(dx −1)×dz,k is calculated by
X 
wνc (χj+1|j,ν ⊟ x̂j+1|j )(χj+1|j,ν ⊟ x̂j+1|j )⊤

Pj+1|j = 2(da −1)
X
ν=0 Pxk ,zk = wjc [χk|k−1,ν ⊟ x̂k|k−1 ][ζk,ν − ẑk ]⊤ (42)
(dx −1)×(dx −1)
+ Cηw ,k−1 ∈ R (36) ν=0
7

Note that the operator ⊟ in (42) is defined in (38). generated by IMU-Net and Vision-Net, let us define the
2) Calculate The Current State Estimate: First, the Kalman standard deviation vector ck ∈ R13 using the square root of
gain Kk ∈ R(dx −1)×dz,k , based on (42) and (41), is computed the diagonal elements of the covariance matrices in (20) and
as (28), such that:
Kk = Pxk ,zk Pz⊤k ,zk ∈ R(dx −1)×dz,k (43) ⊤
ck = c⊤ c⊤ c⊤ c⊤ c⊤ ∈ R13 (49)

ηω,k ηa,k ηbω,k ηba,k ηl ,k
The correction vector δ x̂k ∈ Rdx −1 is then derived, using (40)
For each ith element of ck , denoted by ck,i ∈ R, let c̄k,i ∈ R
and (43), as
represent its nominal value, obtained through traditional offline
δ x̂k = Kk (zk − ẑk ) ∈ Rdx −1 (44) tuning methods. Then, at each time step k, using the scaling
parameters {γi,k }i={1,...,13} from (48) and (47), the standard
Representing the rotation and non-rotation components of deviation vector elements defined in (49) are computed as in
δ x̂k ∈ Rdx −1 as δ x̂k,r ∈ S3 and δ x̂k,− ∈ Rdx −4 , respectively, [32], [33]:
the updated state estimate x̂k|k is computed as
ck,i = c̄k,i 10υ tanh γi,k i = {1, 2, . . . , 13} (50)
x̂k|k = x̂k|k−1 ⊞ δ x̂k (45)
where υ ∈ R specifies the degree to which the predicted ck,i
Note that the ⊞ operator in (45) has been defined in (32). Fi- may deviate from the nominal value c̄k,i . Considering (50),
nally, the covariance matrix associated with this state estimate (49), (20), and (28), the covariance matrices used by the filter
is updated, based on (36), (41), and (43), as are found as follows:
Pk|k = Pk|k−1 − Kk Pzk ,zk Kk⊤

(46) 
 Cηω,k = diag(c2k,1:3 )

2
Cηa,k = diag(ck,4:6 )



D. Iterate and Collect IMU Measurements Cηbω,k = diag(c2k,7:9 ) (51)
Proceed to the next index by setting k ← k + 1. The

Cηba,k = diag(c2

)
 k,10:12
IMU measurements uk−1−db :k−1 ∈ Rdb ×du are collected

= c2k,13 Idz,k

C
ηl,k
until image data becomes available at zk , at which point the
algorithm proceeds to IV-B. If image data is not yet available, where ck,i:j ∈ Rj−i+1 denotes the ith to jth components of
the collection of IMU data continues. ck .

V. D EEP L EARNING - BASED A DAPTATION M ECHANISM


(DLAM) GRU Layer 1

This section provides a detailed discussion of the design


GRU 1 cell 1 GRU 1 cell 2
of the proposed DLAM. This mechanism adaptively updates
the covariance matrices of the noise parameters used by the
quaternion-based UKF-VIN at each time step, leveraging the
GRU Layer 2
input data. The DLAM is designed to enhance the filter’s
performance by dynamically adjusting to changes in noise GRU 2 cell 1 GRU 2 cell 2

characteristics, thereby ensuring more accurate and robust


state estimation. The proposed DLAM is composed of two
neural networks, namely, the IMU-Net (Section V-A) and the
Vision-Net (Section V-B). The IMU-Net processes the last ReLU

dGRU ∈ R measurement vectors (see (24)) as input, while the


Vision-Net takes the current stereo image measurements as
input. Each network produces scaling factors corresponding
to their respective sensor covariance matrices. Given that the Output Layer
IMU noise model includes 12 unknown terms (see (20)), and
the vision unit covariance matrix contains a single unknown
element (see (28)), the IMU-Net and Vision-Net generate Fig. 1: IMU-Net Architecture Schematics
outputs of size 12 and 1, respectively. This can be formulated
as:
A. IMU-Net
{γi,k }i={1,...,12} = IMUNet(uk−1−dGRU :k−1 , WIN ), (47)
γ13,k = VisionNet(imgl , imgr , WV N ), (48) It is assumed that the covariance matrices Cηw,k and Cηa,k
can be optimized for each batch by considering the input
where uk−11:k−1 ∈ RdGRU ×du represents the last dGRU mea- vector, which comprises the last db IMU measurements and
surement vectors, imgl and imgr denote the left and right im- practically it is a feasible and realizable condition. Recur-
ages, respectively, and WIN and WV N represent the weights rent deep learning frameworks, particularly Recurrent Neural
and biases of the IMU-Net and Vision-Net, respectively. To Networks (RNNs) and their advanced variants, have proven
explain the use of the scaling parameters {γi,k }i={1,...,13} effective in modeling sequential data due to their capacity to
8

capture dependencies across time steps [37]. While traditional Equations (52), (53), (54), and (55) can be adapted to calculate


RNNs are foundational, they often struggle with long-term the backward hidden vector h l by moving in reverse across
dependencies due to challenges such as vanishing gradients the sequence, computing each hidden state based on the
←−
[38], [39], leading to the development of more sophisticated subsequent hidden vector h l+1 and the input vector αl . For
architectures, such as Long Short-Term Memory (LSTM) net- each GRU layer, consisting of dGRU GRU cells, both forward
works [40] and Gated Recurrent Units (GRUs) [41]. LSTMs and backward passes are computed. The forward and backward
and GRUs were specifically designed to address the limitations hidden vectors for each cell are concatenated to form:
of standard RNNs by incorporating gating mechanisms that
→− ←− ⊤
h i
regulate information flow, enabling more stable long-term hl = h ⊤ l h⊤l
∈ R2dh (56)
memory retention. In particular, GRUs offer a streamlined
architecture by combining the forget and input gates of LSTMs In summary, equations (52), (53), (54), (55), and (56) collec-
into a single update gate, making them computationally more tively define the GRU function GRU : Rdu × Rdh → R2dh as
efficient while retaining the ability to model complex temporal follows:

− ←−
relationships. GRUs have been shown to outperform LSTMs, hl = GRU(αl , h l−1 , h l+1 )
particularly when the dataset is small, while also being less To design IMU-Net, two layers of GRUs have been stacked
computationally intensive [41], [42]. with a fully connected network at the end, with Rectified Lin-
For a GRU cell at time step l, let αl ∈ Rdu denote the GRU ear Unit (ReLU) activation function as the activation function
cell input vector. Each GRU cell computes its hidden state between the GRUs and the fully connected network, producing

− the scaling parameters (see (47)). Note that the input to the
h l ∈ Rdh by leveraging three key components: the update
gate →−
z l ∈ Rdh , reset gate →

r l ∈ Rdh , and candidate hidden first GRU layer (that is α1 , α2 , . . . , αdGRU ) is set to the last

− dh
state n l ∈ R , where dh represents the dimensionality of dGRU IMU measurements (uk−1−dGRU :k−1 ∈ RdGRU ×du ). This
the hidden state. The equations governing these components process is visualized in Fig. 1.
are as follows:
B. Vision-Net
• Update Gate:
The Vision-Net network is designed to adaptively estimate

− −
→ − →
→ − →
− the measurement covariance matrix based on vision data. The
z l = σ(W z αl + U z h l−1 + b z ) ∈ Rdh (52)
uncertainty in image measurements can be effectively esti-
−→ →

where W z ∈ Rdh ×du and U z ∈ Rdh ×dh are weight mated from the current stereo-vision measurements. In Vision-


matrices, and b z ∈ Rdh is the bias term. Note that the Net, each image is processed through a 2D convolutional layer,
function σ : Rdh → Rdh denotes the sigmoid function. followed by 2D max pooling, then a second 2D convolutional
The update gate controls the degree to which the previous layer, and another 2D max pooling layer. The resulting features


hidden state h l−1 is retained. are flattened, concatenated, and subsequently passed through
two fully connected layers, ultimately producing the scaling
• Reset Gate: parameter γ13 ∈ R (see (48)). Each convolutional layer is

− −
→ − →
→ − →
− followed by a ReLU activation function. A visualization of
r l = σ(W r αl + U r h l−1 + b r ) ∈ Rdh (53)
Vision-Net is provided in Fig. 2.

→ →
− →

where W r ∈ Rdh ×du , U r ∈ Rdh ×dh , and b r ∈ Rdh are
the corresponding parameters for the reset gate, which VI. D EEP UKF-VIN T RAINING AND I MPLEMENTATION
determines the relevance of the previous hidden state in In summary, the DLAM-equipped UKF-VIN algorithm re-
computing the candidate hidden state. ferred to quaternion-based DeepUKF-VIN, is illustrated in
• Candidate Activation: Fig. 3. For IMU-Net, we selected the last dGRU = 10 IMU
measurements as input (see (47)). This choice is based on

− −
→ − →
→ − →

n l = tanh(W n αl + →

r l ◦ ( U n h l−1 ) + b n ) (54) the IMU’s 200 Hz sample rate, compared to the image data’s
20 Hz sample rate, meaning there are at least 10 IMU
where the ◦ operator represents element-wise multiplica- measurements between each pair of vision measurements.

→ →
− →

tion, and W n ∈ Rdh ×du , U n ∈ Rdh ×dh , and b n ∈ Rdh Although the structure of IMU-Net allows for variable-length
are the weight matrices and bias vector associated with time series input data, using a fixed length of 10 measurements
the candidate hidden state. Note that tanh : Rdh → Rdh enhances consistency and predictability. To train the models,
denotes the hyperbolic tangent function. a loss function must be defined. Let the estimated orientation,
position, and velocity at time step k be denoted by q̂k , p̂k ,
• Hidden State Update:
and v̂k , respectively, with their corresponding estimation errors

− →

hl = →
−z l ◦ h l−1 + (1 − →

z l) ◦ →

nl (55) denoted by re,k , pe,k , and ve,k . These errors are defined as
follows: 
This update equation combines the previous hidden state 3

− re,k = qk ⊟ q̂k ∈ S

h l−1 and the candidate →

n l , governed by the update gate pe,k = pk − p̂k ∈ R3 (57)


z .
l

ve,k = vk − v̂k ∈ R3

9

MaxPool
MaxPool Conv2+ReLU (4 × 4)
Conv1+ReLU (4 × 4) (32 × 5 × 5)
Left Image (16 × 5 × 5)
Output
Hidden γ13
(32) (1)

Flatten

MaxPool
MaxPool Conv2+ReLU (4 × 4)
Conv1+ReLU (4 × 4) (32 × 5 × 5)
Right Image (16 × 5 × 5)

Fig. 2: Vision-Net Architecture Schematics

Adaption Mechanism

IMU Net Vision Net

Sensor Data Filter

Current State Vector


Vision Data Aggregate IMU Data Aggregate Predict Update
Estimate

Aggregate IMU IMU Noise Covariance Matrix Vision Data Vision Noise Covariance Matrix

Filter
Update
Aggregate Predict

Calculate Measurement
Sigma Points Calculate the Predicted If at end Calculate The Current
yes Sigma Points And Its
Propagation Mean and Covariance of Aggregate State Estimate
Statistics

Sigma Points
Augmentation No
Construction

Fig. 3: Summary schematic architecture of quaternion-based DeepUKF-VIN. First, the Aggregate Predict step of the filter is
executed, incorporating the last known state information, aggregated IMU data, and the IMU noise covariance computed by
IMU-Net. Next, the Update step is performed using the predicted state information and the vision covariance matrix estimated
by Vision-Net. Raw IMU and vision data are used as inputs to IMU-Net and Vision-Net, respectively.

For a set of dmini-batch ∈ R estimations in the i-th mini-batch, square errors (MSE) of the individual errors defined in (57).
the total loss is computed as the weighted sum of the mean
10

Fig. 5: Matched feature points between the left and right


images of a set of stereo image measurements using EuRoC
Fig. 4: Training loss convergence over 30 epochs. dataset [43].

Algorithm 1 Training Procedure


The loss function for the mini-batch is given by:
Initialization:
∥re,k ∥2 ∥pe,k ∥2
P P
mini-batch 1: Set initial values for x̂0 and P0 .
Lossi =wq mini-batch + wp mini-batch
d P d 2: Create mini-batches of size 32 uk−11:k−1 , zk , and xk .
(58)
∥ve,k ∥2 For each i-th epoch:
+ wv mini-batch 3: Initialize Gradient ← 0.
d
where wq , wp , and wv ∈ R are the weights that determine For each j-th mini-batch:
the relative importance of each term and are tuned offline. 4: Initialize Lossmini-batch
j ← 0.
As the steady-state performance of the filter is of greater 5:
( Compute parameter estimates (see (47) and (48)):
mini-batch
importance than its transient performance, the loss function γ1:12 ← IMUNet(mini-batch, WIN )
mini-batch
in (58) is evaluated starting from the 51st data point onward. γ13 ← VisionNet(mini-batch, WV N )
This ensures that the loss function disregards the first 50 data 6: Calculate covariance scaling (see (50) and (51)):
mini-batch
points, which represent the transient response of the filter. Cov mini-batch ← Cov · 10υ tanh(γ )

The V1 02 medium part of the EuRoC dataset [43] has For each data point in mini-batch (Sections IV-B and IV-C):
DataPoint
been utilized for training. This dataset includes IMU mea- 7: x̂k|k−1 , Pk|k−1 ← Predict(DataPoint, Cov1:12 )
DataPoint
surements, recorded at 200 Hz using the ADIS16448 sensor, x̂k|k , Pk|k ← Update(x̂k|k−1 , zk , Cov13 )
mounted on an MAV. Stereo images are captured as well by 8: Store current state and covariance estimates.
the Aptina MT9V034 global shutter camera at a rate of 20 End For
Hz. Ground truth data is provided at 200 Hz, measured via the 9: Compute loss for the mini-batch see (58)
Vicon motion capture system. At each epoch, the whole dataset Lossmini-batch ← Loss(x̂mini-batch , xmini-batch )
will be utilized as a single batch. In other words, given the 10: Compute and clip gradients:
( mini-batch
initial state estimate and covariance matrix, at each time step, Gradientmini-batch ← ∂Loss∂Wmodels
the filter will estimate the state vector based on the IMU and Gradientmini-batch ← max(Gradientmini-batch , 1)
landmark measurements, as well as the covariance matrices 11: Accumulate gradient:
found by IMU and Vision-Nets. To manage computational Gradient ← Gradient + Gradientmini-batch
resources effectively, the data is divided into mini-batches of End For
size 32. After each mini-batche is processed, the loss value 12: Update model weights using ADAM optimizer:
is found per (58). the gradient of this loss with respect to Wmodels ← ADAM(Gradient, Wmodels )
the networks weights are found and clipped to one to avoid 13: Reset gradient: Gradient ← 0
gradient explosion. These gradients are accumulated through End For
mini-batches to find the gradient of the batch. After all the
mini-batches in a batch are processed, the weights are updated
using the Adam optimizer [44]. L2 regularization [45] has
challenges in computing the loss gradient, particularly in step
been performed during the weight training to avoid overfitting.
10 of Algorithm 1. To address these challenges, we employed
The IMU and Vision-Networks have 27,276 and 2,901,089 an EKF as the filter during the model training phase. This sub-
parameters, respectively, with the weights in (58) set to stitution simplifies gradient computation considerably, thereby
wq = 1000, wp = 600, and wv = 100. The quaternion-based enhancing the training efficiency. Despite this modification, we
UKF involves eigenvalue decomposition in (14) and the use of hypothesize that the model can learn the optimal covariance
singular value decomposition (SVD) for computing the matrix matrices corresponding to sensor uncertainties independently
square root in (31). These operations introduce significant of the filter type used. This hypothesis will be further examined
11

Fig. 6: Validation results of quaternion-based DeepUKF-VIN: The algorithm is evaluated using the V1 02 medium EuRoC
dataset. On the left, the MAV trajectory along with three sample orientations in 3D space is displayed. On the right, the
magnitudes of the orientation (top), position (middle), and velocity (bottom) vectors over time are illustrated.

in Section VII. Given these considerations, the implementation initially high magnitudes, demonstrating quaternion-based
was carried out using PyTorch for handling the neural network DeepUKF-VIN’s efficacy. To further examine the results, the
components and for orientation calculations [46]. The models individual components of each estimation error are visualized
were trained over 30 epochs, during which the loss function in Fig. 7. It can be observed that all error components
converged to its minimum. The convergence behaviour is converge to near-zero values promptly, further underscoring
illustrated in Fig. 4. the effectiveness of the proposed filter.
The measurement function is implemented by detecting 2D
feature points in each available vision dataset using the KLT
algorithm. An example of the results of this step is visualized
in Fig. 5. Considering the matched 2D points in the stereo
images and the camera calibration data, the feature points in
the world coordinate frame {W} are computed using triangu-
lation [47]. These computed points serve as the measurement
values in this problem. In summary, the DLAM-equipped
UKF-VIN algorithm, named quaternion-based DeepUKF-VIN,
is illustrated in Fig. 3. The proposed algorithm leverages
IMU-Net and Vision-Net, as described in Sections V-A and
V-B, respectively, to compute the covariance matrices of the
UKF-VIN, as discussed in Section IV. These components are
integrated to enhance the accuracy and performance of the
algorithm. The training algorithm is summarized in Algorithm
1. Fig. 7: Components of the orientation (left), position (mid-
dle), and velocity (right) estimation error vectors in the
VII. E XPERIMENTAL VALIDATION
V1 02 medium EuRoC dataset experiment using quaternion-
To validate the effectiveness of quaternion-based based DeepUKF-VIN.
DeepUKF-VIN, the algorithm is tested using the real-
world V1 02 medium EuRoC dataset [43]. For video of the To investigate the effectiveness of the proposed IMU and
experiment, visit the following link. The dataset trajectory, as Vision-Nets, the quaternion-based DeepUKF-VIN was com-
well as the magnitudes of orientation, position, and velocity pared to its non-deep counterpart, UKF-VIN, and another
errors defined in (57) over time, are visualized in Fig. 6. Kalman-type filter with a learning component, the DeepEKF.
The errors converge rapidly to near-zero values despite The DLAM algroithm were evaluated in two environments,
12

V1 02 medium and V2 02 medium, which are subsets of data from a 6-axis Inertial Measurement Unit (IMU) and stereo
the EuRoC dataset [43]. Note that V2 02 medium dataset cameras, achieving robust navigation even in GPS-denied en-
was not utilized during training or validation phases. The vironments. The Deep Learning-based Adaptation Mechanism
dataset was recorded using an Aptina MT9V034 global shutter (DLAM) dynamically adjusts noise covariance matrices based
camera, which captured stereo images at a rate of 20 Hz. on sensor data, improving estimation accuracy by responding
Additionally, an ADIS16448 sensor was employed to capture adaptively to varying conditions. Evaluated with real-world
IMU data at 200 Hz, while ground truth data was recorded data from low-cost sensors operating at low sampling rates,
at 200 Hz using the Vicon motion capture system. To ensure DeepUKF-VIN demonstrated stability and rapid error atten-
a fair comparison, all filters were configured with the same uation. Comparative testing across two experimental setups
nominal covariances as DeepUKF-VIN. Furthermore, in each consistently showed that DeepUKF-VIN outperformed the
environment, all filters were initialized with the same state standard Unscented Kalman Filter (UKF) in all key navigation
vector and covariance matrix. The loss values, as defined in components. These results underscore the algorithm’s superior
(58), for the aforementioned filters in both experiments are adaptability, efficacy, and robustness in practical scenarios,
presented in Table II. The proposed DeepUKF-VIN outper- validating its potential for accurate and reliable 3D navigation.
formed both its non-deep counterpart (UKF-VIN) and the Future work could explore the application of the proposed
DeepEKF in terms of loss values across both experiments. The DLAM to other Kalman-type and non-Kalman-type filters de-
DeepEKF was exposed to data from the first experiment, while veloped for the VIN problem. Given that the proposed DLAM
the DeepUKF-VIN was never trained on either experiment. was trained using the Extended Kalman Filter (EKF) and
Thus, both experiments were entirely novel to the DeepUKF- validated with the UKF, it is reasonable to hypothesize that in-
VIN. To further examine these experiments, the MSE values tegrating DLAM with other algorithms may yield similar bene-
of the orientation, position, and velocity estimation errors are fits. Furthermore, the vision-based component of the proposed
tabulated in Table III. It can be observed from Table III that DLAM could be adapted for alternative sensor inputs, such as
across both experiments and all components, quaternion-based Light Detection and Ranging (LiDAR) and Sound Navigation
DeepUKF-VIN consistently outperformed the non-deep UKF- and Ranging (SONAR), with minimal modifications. Such
VIN and DeepEKF. Specifically, the DeepUKF-VIN yielded adaptations have the potential to enhance the performance of
lower MSE values across all tested experiments and compo- algorithms relying on these sensor technologies.
nents, demonstrating its superior performance in orientation,
position, and velocity estimation.
R EFERENCES
TABLE II: Loss Value Comparison of DeepUKF-VIN against
UKF-VIN and DeepEKF. [1] H. A. Hashim, “Advances in UAV Avionics Systems Architecture,
Classification and Integration: A Comprehensive Review and Future
Filter V1 02 medium V2 02 medium Perspectives,” Results in Engineering, vol. 25, p. 103786, 2025.
DeepEKF 1918 834 [2] H. A. Hashim, M. Abouheaf, and M. A. Abido, “Geometric Stochastic
UKF-VIN 132 251 Filter with Guaranteed Performance for Autonomous Navigation based
DeepUKF-VIN 88 250 on IMU and Feature Sensor Fusion,” Control Engineering Practice, vol.
116, p. 104926, 2021.
[3] Y.-J. Gong, T. Huang, Y.-N. Ma, S.-W. Jeon, and J. Zhang, “Mtrajplan-
ner: A multiple-trajectory planning algorithm for autonomous underwa-
TABLE III: Components MSEs for the two filters in each ter vehicles,” IEEE Transactions on Intelligent Transportation Systems,
experiment vol. 24, no. 4, pp. 3714–3727, 2023.
[4] G. Yang and et al., “Homecare robotic systems for healthcare 4.0:
Filter MSE Element V1 02 medium V2 02 medium Visions and enabling technologies,” IEEE journal of biomedical and
Orientation 1.572 0.0544 health informatics, vol. 24, no. 9, pp. 2535–2549, 2020.
DeepEKF Position 9.3188 1.1228 [5] X. Bai, Y. Ye, B. Zhang, and S. S. Ge, “Efficient package delivery task
Velocity 7.5663 0.9558 assignment for truck and high capacity drone,” IEEE Transactions on
Orientation 0.0015 0.0026 Intelligent Transportation Systems, vol. 24, no. 11, pp. 13 422–13 435,
UKF-VIN Position 0.0929 0.3070 2024.
Velocity 0.0509 0.1319 [6] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, “Fault-tolerant
Orientation 0.0008 0.0080 cooperative navigation of networked uav swarms for forest fire moni-
DeepUKF-VIN Position 0.0806 0.3011 toring,” Aerospace Science and Technology, vol. 123, p. 107494, 2022.
Velocity 0.0282 0.0914 [7] K. N. Braun and C. G. Andresen, “Heterogeneity in ice-wedge per-
mafrost degradation revealed across spatial scales,” Remote Sensing of
Environment, vol. 311, p. 114299, 2024.
[8] F. Chen and wt al., “Augmented reality navigation for minimally invasive
VIII. C ONCLUSION knee surgery using enhanced arthroscopy,” Computer Methods and
Programs in Biomedicine, vol. 201, p. 105952, 2021.
In this paper, we proposed an adaptively-tuned Deep Learn- [9] R. Korkin, I. Oseledets, and A. Katrutsa, “Multiparticle kalman filter
ing Unscented Kalman Filter for 3D Visual-Inertial Navigation for object localization in symmetric environments,” Expert Systems with
Applications, vol. 237, p. 121408, 2024.
(DeepUKF-VIN) to estimate the orientation, position, and
[10] S. Wattanarungsan, T. Kuwahara, and S. Fujita, “Magnetometer-based
velocity of a vehicle with six degrees of freedom (6-DoF) in attitude determination extended kalman filter and optimization tech-
three-dimensional space. By effectively addressing kinematic niques,” IEEE Transactions on Aerospace and Electronic Systems,
nonlinearities through a quaternion-based framework, the al- vol. 59, no. 6, pp. 7993–8004, 2023.
[11] X. Hou and J. Bergmann, “Pedestrian dead reckoning with wearable
gorithm mitigates numerical instabilities commonly associated sensors: A systematic review,” IEEE Sensors Journal, vol. 21, no. 1, pp.
with Euler-angle representations. DeepUKF-VIN integrates 143–152, 2021.
13

[12] H. A. Hashim, A. E. Eltoukhy, and K. G. Vamvoudakis, “UWB Ranging [29] A. T. Erdem and A. O. Ercan, “Fusing inertial sensor data in an extended
and IMU Data Fusion: Overview and Nonlinear Stochastic Filter for kalman filter for 3d camera tracking,” IEEE Transactions on Image
Inertial Navigation,” IEEE Transactions on Intelligent Transportation Processing, vol. 24, no. 2, pp. 538–548, 2014.
Systems, vol. 25, no. 1, pp. 359–369, 2024. [30] E. A. Wan and R. Van Der Merwe, “The unscented kalman filter,”
[13] T. M. Roth, F. Freyer, M. Hollick, and J. Classen, “Airtag of the Kalman filtering and neural networks, pp. 221–280, 2001.
clones: Shenanigans with liberated item finders,” 2022 IEEE Security [31] S. Wernitz, E. Chatzi, B. Hofmeister, M. Wolniak, W. Shen, and
and Privacy Workshops (SPW), pp. 301–311, 2022. R. Rolfes, “On noise covariance estimation for kalman filter-based
[14] H. A. Hashim, “Exponentially Stable Observer-based Controller for damage localization,” Mechanical Systems and Signal Processing, vol.
VTOL-UAVs without Velocity Measurements,” International Journal of 170, p. 108808, 2022.
Control, vol. 96, no. 8, pp. 1946–1960, 2023. [32] M. Brossard, A. Barrau, and S. Bonnabel, “Ai-imu dead-reckoning,”
[15] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” IEEE Transactions on Intelligent Vehicles, vol. 5, no. 4, pp. 585–595,
in Sensor fusion IV: control paradigms and data structures, vol. 1611. 2020.
Spie, 1992, pp. 586–606. [33] H. Zhou and et al., “Imu dead-reckoning localization with rnn-iekf
[16] A. Myronenko and X. Song, “Point set registration: Coherent point drift,” algorithm,” in 2022 IEEE/RSJ International Conference on Intelligent
IEEE transactions on pattern analysis and machine intelligence, vol. 32, Robots and Systems (IROS). IEEE, 2022, pp. 11 382–11 387.
no. 12, pp. 2262–2275, 2010. [34] B. Or and I. Klein, “Learning vehicle trajectory uncertainty,” Engineer-
[17] Y. Xu, R. Zheng, S. Zhang, and M. Liu, “Robust inertial-aided underwa- ing Applications of Artificial Intelligence, vol. 122, p. 106101, 2023.
ter localization based on imaging sonar keyframes,” IEEE Transactions [35] S. Yan, Y. Liang, and B. Wang, “Multi-level deep learning kalman
on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022. filter,” in 2023 International Conference on Advanced Robotics and
[18] J. A. Christian and S. Cryan, “A survey of lidar technology and its use Mechatronics (ICARM). IEEE, 2023, pp. 1113–1118.
in spacecraft relative navigation,” in AIAA Guidance, Navigation, and [36] H. A. Hashim, “Special Orthogonal Group SO(3), Euler Angles, Angle-
Control (GNC) Conference, 2013, p. 4641. axis, Rodriguez Vector and Unit-quaternion: Overview, Mapping and
[19] H. A. Hashim, “GPS-denied Navigation: Attitude, Position, linear Ve- Challenges,” arXiv preprint arXiv:1909.06669, 2019.
locity, and Gravity Estimation with Nonlinear Stochastic Observer,” in [37] P. B. Weerakody, K. W. Wong, G. Wang, and W. Ela, “A review of
2021 American Control Conference (ACC). IEEE, 2021, pp. 1146– irregular time series data handling with gated recurrent neural networks,”
1151. Neurocomputing, vol. 441, pp. 161–178, 2021.
[20] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” [38] T. K. Rusch and S. Mishra, “Unicornn: A recurrent model for learning
International journal of computer vision, vol. 60, pp. 91–110, 2004. very long time dependencies,” in International Conference on Machine
[21] J. Shi et al., “Good features to track,” in 1994 Proceedings of IEEE Learning. PMLR, 2021, pp. 9168–9178.
conference on computer vision and pattern recognition. IEEE, 1994, [39] M. Lechner and R. M. Hasani, “Learning long-term dependencies
pp. 593–600. in irregularly-sampled time series,” ArXiv, vol. abs/2006.04418,
[22] F. Santoso, M. A. Garratt, and S. G. Anavatti, “Visual–inertial naviga- 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:
tion systems for aerial robotics: Sensor fusion and technology,” IEEE 219530825
Transactions on Automation Science and Engineering, vol. 14, no. 1, [40] Y. Cheng, J. Wu, H. Zhu, S. W. Or, and X. Shao, “Remaining useful life
pp. 260–275, 2016. prognosis based on ensemble long short-term memory neural network,”
[23] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman IEEE Transactions on Instrumentation and Measurement, vol. 70, pp.
filter for vision-aided inertial navigation,” in Proceedings 2007 IEEE 1–12, 2020.
international conference on robotics and automation. IEEE, 2007, pp. [41] R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (gru)
3565–3572. neural networks,” 2017 IEEE 60th International Midwest Symposium
[24] K. Sun and et al., “Robust stereo visual inertial odometry for fast on Circuits and Systems (MWSCAS), pp. 1597–1600, 2017. [Online].
autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, no. 2, Available: https://api.semanticscholar.org/CorpusID:8492900
pp. 965–972, 2018. [42] A. N. Shewalkar, D. Nyavanandi, and S. A. Ludwig, “Performance
[25] A. Odry, R. Fuller, I. J. Rudas, and P. Odry, “Kalman filter for mobile- evaluation of deep neural networks applied to speech recognition: Rnn,
robot attitude estimation: Novel optimized and adaptive solutions,” lstm and gru,” Journal of Artificial Intelligence and Soft Computing
Mechanical systems and signal processing, vol. 110, pp. 569–589, 2018. Research, vol. 9, pp. 235 – 245, 2019.
[26] K. Ghanizadegan and H. A. Hashim, “Quaternion-based Unscented [43] M. Burri and et al., “The EuRoC micro aerial vehicle datasets,” The
Kalman Filter for 6-DoF Vision-based Inertial Navigation in GPS-denied International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–
Regions,” IEEE Transactions on Instrumentation and Measurement, 1163, 2016.
vol. 74, no. 1, pp. 1–13, 2025. [44] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv
[27] H. A. Hashim, L. J. Brown, and K. McIsaac, “Nonlinear Stochastic At- preprint arXiv:1412.6980, 2014.
titude Filters on the Special Orthogonal Group 3: Ito and Stratonovich,” [45] P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards understanding con-
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, vergence and generalization of adamw,” IEEE Transactions on Pattern
no. 9, pp. 1853–1865, 2019. Analysis and Machine Intelligence, 2024.
[28] H. A. Hashim, “Systematic Convergence of Nonlinear Stochastic Esti- [46] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson,
mators on the Special Orthogonal Group SO(3),” International Journal and G. Gkioxari, “Accelerating 3d deep learning with pytorch3d,”
of Robust and Nonlinear Control, vol. 30, no. 10, pp. 3848–3870, 2020. arXiv:2007.08501, 2020.
[47] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.
Cambridge university press, 2003.

You might also like