Attitude Estimation With IMUs usingMachineLearning
Attitude Estimation With IMUs usingMachineLearning
This master’s thesis is the final requirement for the degree of Master of Science
in Mechanical Engineering at the Norwegian University of Science and Technol-
ogy. The thesis was written during the spring of 2022, under the supervision of
Professor Olav Egeland.
I would like to thank Olav Egeland for the supervision throughout this thesis. The
freedom to explore the subject of this thesis has been appreciated. His feedback
and support during the project has been very helpful.
I would also like to thank my family for all the support given during the course
of this thesis. I would especially like to thank my girlfriend for all the emotional
support given during times facing difficulties in the project.
Abstract
Low cost Micro Electro Mechanical System (MEMS) Inertial Measurement Units
(IMU) measure angular velocities and accelerations using gyroscopes and ac-
celerometers. They are used in a wide variety of applications for estimating
orientation, also referred to as attitude. The signals from the gyroscopes and
accelerometers are typically corrupted with noise and time-varying bias. This re-
duces the quality in the estimated orientation. Estimating the orientation from
open-loop integration typically yields poor estimates that quickly drifts from the
true value with time.
This thesis builds on an existing method of using a Convolutional Neural Network
(CNN) to obtain reliable estimates for the orientation. This method is compared
to an existing conventional orientation estimation filter to investigate how good
the CNN-based method performs. A new method of data augmentation is used
together with the CNN-based method in an attempt to improve the performance
of the CNN. New CNN architectures replacing this existing CNN are also tested,
to further improve the performance of the method.
The existing CNN-based approach was proven to be very effective for orientation
estimation when compared to the conventional orientation estimation filter. The
use of the new technique for data augmentation did not produce consistently bet-
ter results, and was therefore considered to be ineffective in further improving
the performance. The use of new CNN architectures combined with the exist-
ing method managed to perform better than the original CNN, improving the
method.
Sammendrag
Preface i
Abstract iii
Sammendrag v
1 Introduction 1
1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Preliminaries 3
2.1 Rotation matrices, the SO(3)-group . . . . . . . . . . . . . . . . . 3
2.1.1 Rotation matrices . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 The Lie algebra so(3) . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 The exponential map in SO(3) . . . . . . . . . . . . . . . . 4
2.1.4 Kinematic differential equation . . . . . . . . . . . . . . . . 4
2.1.5 The logarithmic map in SO(3) . . . . . . . . . . . . . . . . 5
2.2 Quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Definitions and main properties . . . . . . . . . . . . . . . . 5
2.2.2 Unit quaternions for representing rotations . . . . . . . . . 7
2.2.3 The exponential map for quaternions . . . . . . . . . . . . . 7
2.2.4 Kinematic differential equation . . . . . . . . . . . . . . . . 7
3 Method 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Modeling of kinematics and IMUs . . . . . . . . . . . . . . . . . . 9
3.3 Training a Convolutional Neural Network for estimating IMU cor-
rections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Modeling of noise free angular velocities from the IMU . . . 11
3.3.3 Convolutional Neural Networks for time series predictions . 12
3.3.4 Convolutional Neural Network for corrections of angular ve-
locities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
viii Contents
4 Method development 25
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Comparison with existing conventional filters . . . . . . . . . . . . 25
4.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Data augmentation using virtual rotations . . . . . . . . . . 26
4.4 Use of different neural network architectures . . . . . . . . . . . . . 27
4.4.1 Residual neural network . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Dense neural network . . . . . . . . . . . . . . . . . . . . . 28
6 Conclusion 51
List of Figures
5.1 AOE, AYE and AIE on the EuRoC dataset using the Madgwick filter 37
5.2 AOE, AYE and AIE on the TUM VI dataset using the Madgwick
filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 ROE on EuRoC using virtual rotation data augmentation . . . . . 40
5.4 ROE on TUM VI using virtual rotation data augmentation . . . . 41
List of Tables
Introduction
into how well the new method performs. Additionally, methods for improving the
performance of the works of [2] are explored. This includes investigating whether
different architectures of the CNN perform better.
Chapter 2
Preliminaries
This chapter was first presented in a preliminary study on the subject [24], but
this theory is essential for understanding this thesis as well. Therefore it is also
included in this report.
RT R = I (2.1)
det(R) = 1 (2.2)
R−1 = RT (2.3)
4 Chapter 2 Preliminaries
1 ∥u∥
× ×
R = exp(u ) = I + sinc(∥u∥)u + sinc2 u× u× (2.6)
2 2
Ṙ = Rω × ∈ TR SO(3) (2.8)
2.2 Quaternions 5
where TR SO(3) is the tangent space of the SO(3)-group. Here, R describes the
rotation from a spatial frame to the body frame, and ω × is the skew symmetric
representation of the angular velocity in the body frame. The corresponding
integration scheme based on increments in angular velocity becomes
Here, n = 1, 2, 3, ... denotes the time step and dt is the time increment between
measurements. It is assumed that the angular velocity ω n is constant during dt.
trR − 1
θ = arccos , 0<θ<π (2.10)
2
and
(
0.5 1 + 16 θ2 (R − RT ), θ<δ
log(R) = θ T
(2.11)
2 sin θ (R − R ), δ < θ < π
Where δ is chosen such that the error is less than machine precision.
2.2 Quaternions
This section presents the use of quaternions for representing rotations in 3D-space,
and is based on [4].
qs
" #
α q
1
q≜ = ∈ R4 (2.12)
β q2
q3
6 Chapter 2 Preliminaries
Where α ∈ R is the scalar part and β ∈ R3 is the vector part of the quaternion.
The sum of two quaternions q 1 and q 2 is defined as
" # " # " #
α α α1 ± α2
q1 ± q2 = 1 ± 2 = (2.13)
β1 β2 β1 ± β2
q 1 ◦ q 2 = QL (q 1 )q 2 = q 1 QR (q 2 ) (2.15)
where
" #
α −β T
QL (q) = (2.16)
β αI + β ×
and
" #
α −β T
QR (q) = (2.17)
β αI − β ×
The norm of a quaternion follows from the inner product of two quaternions
q 1 · q 2 = q T1 q 2 (2.19)
and is given by
√ p q
∥q∥ = q·q = q◦q = qT q (2.20)
2.2 Quaternions 7
q
q −1 = (2.21)
∥q∥2
where k is a unit vector. From this it can be seen that a unit quaternion can
represent a rotation of an angle θ around an axis k, similarly to a rotation matrix
R. The rotation matrix R corresponding to the unit quaternion is then
1
q̇ = q ◦ ω (2.25)
2
where q is the quaternion representing the rotation from the spatial frame to the
body frame, and ω is the angular velocity in the body frame. The corresponding
integration scheme based on increments in angular velocity becomes
8 Chapter 2 Preliminaries
Method
3.1 Overview
In this chapter the method this study is based on [2] is presented in detail.
where ω IMU is the angular velocity measured by the IMU, ω is the true angular
velocity, b is the bias and n1 is zero-mean white noise. The bias can be described
by the Wiener process
ḃ = n2 (3.2)
where n2 is zero-mean white noise. This is a simple model, and is well suited
for use in attitude estimation filters such as the Multiplicative Extended Kalman
Filter [20]. The method proposed by Brossard et al. [2] includes a more complex
measurement model than this, since they also account for calibration of the IMU.
They also include accelerations measured by the IMU in the measurement model.
The expression in discrete form becomes
" # " #
ω IMU ωn
uIMU
n = n
=C + bn + nn ∈ R6 (3.3)
aIMU
n an
10 Chapter 3 Method
is also optimized during training to negate the effects C has on the measurements
in equation (3.3). The reason why the loss is calculated based on orientations
and not gyro-rates is because accurate ground truth angular velocities are not
achievable to obtain, whereas accurate ground truth orientations are [2].
Figure 3.1: Training process for the CNN. Gyro-corrections are computed in
the CNN using measurements of the angular velocities and accelerations from the
IMU as inputs. The gyro-corrections are combined with the angular velocities
from the IMU to produce angular velocities without noise. The angular velocities
are then integrated to orientations in open-loop. These estimated orientations are
compared to the ground truth orientation in order to calculate a value for the
loss. The loss is then used to train the CNN. Figure initially presented in [24].
ω̂ n = Ĉ ω ω IMU
n + ω̃ n (3.6)
ω̃ n = ĉn + b̂ (3.7)
where ĉn captures time-varying bias and noise, and b̂ captures static bias.
12 Chapter 3 Method
k−1
X
F (s) = (x ∗d f )(s) = f (i) · xs−d·i (3.8)
i=0
Here f : {0, ..., k − 1} → R is the filter, k is the kernel size of the filter, d is the
dilation factor and s is the element of the sequence the convolution is operating
on. It is important that the convolutions in the CNN are causal. This means
that each predicted output of the CNN is inferred only from past information
of the inputs and not future information. In order to achieve this in the CNN
architecture, zero-padding is used on the start of the sequence. Zero-padding is
implemented by simply adding the amount of zeros needed to the start of the
input vector. How many zeros that are needed, is computed from the kernel size
and dilation of the current layer as
padding = d · (k − 1) (3.9)
Figure 3.2 illustrates the use of padding and dilated convolutions and the effects it
has on the dimensions of the output vector. As can be seen on the figure, the use
of dilations increases the receptive field of the CNN. It is normal to increase the
dilation exponentially for layers deeper in the network, which quite quickly gives
a large receptive field and makes sure that each input to the CNN is included in
the computation of the output. From the figure it is clear that the use of padding
is essential to ensure that the CNN is causal
3.3 Training a Convolutional Neural Network for estimating IMU corrections 13
Figure 3.2: Dilated convolutions. The figure shows how the dilation gap in the
convolutions produces a larger receptive field, allowing more values to contribute
in calculating the current output. In the first layer, the receptive field is 3. in
the second layer it has increased to 5. Zero-padding is added to the start of the
sequence for each layer to ensure that the convolutions are causal, meaning that
no information from future time steps are included in calculating the current time
step.
and is called the receptive field of the CNN. The receptive field is thus a parameter
which can be tuned for the best results possible. It was noted by [2] that keeping
the receptive field relatively small mitigated the problem of overfitting to specific
patterns in the training data, allowing the CNN to generalize better on unseen
test data.
The kinematic equation used for integrating angular velocities to orientations,
equation (2.9), does not include accelerometer measurements in estimating the
orientations. Regardless of this, the CNN uses both angular velocities and ac-
14 Chapter 3 Method
We see that the accelerometer measurements contains some information about the
angular velocity in this case. The variations in velocity is not assumed to be small
in this method. Regardless, this relation gives some intuition that the accelerom-
eter measurements contains information that can be utilized by the CNN. The
relation between the gyro-corrections ω̃ n and the function learned by the CNN is
ω̃ n = f (uIMU IMU
n−N , ..., un ) (3.12)
where uIMU IMU are the angular velocities and acceleration measurements
n−N , ..., un
from the IMU.
The activation function used in the CNN is the non-linear Gaussian Error Linear
Unit (GELU) [9], which is applied after each convolution operation. It is defined
3.3 Training a Convolutional Neural Network for estimating IMU corrections 15
as
in real time applications. This should give the method great value for the use in
applications where computational resources are scarce.
The datasets used to train the CNN are relatively small, with 45 000 samples
in the TUM VI [26] dataset and 90 000 samples in the EuRoC [3] dataset. The
number of trainable parameters is therefore similar to the number of samples in
the training data. It is therefore necessary to use techniques to avoid overfitting
to the training data. For this reason, dropout and weight decay is utilized during
training. Dropout works so that each channel of a layer in the network has a
probability p of being set to zero during training. For this implementation the
value of p is set to 0.1. This has the effect of limiting co-adaptions between the
channels in the network [10], making each unit more independent of the activations
other units in the network has. This should make each unit contribute more in
making the correct predictions from the CNN.
systems for angular velocities usually have a frequency of 20 – 120 Hz [2]. Due
to this, the loss is instead based on the estimated orientations calculated using
equation (2.9). To reduce the risk of overfitting to specific patterns in the training
data, increments of rotation on the form
i+j−1
Y
δRi,i+j = RTi Ri+j = exp (ω k ) (3.14)
k=i
are used instead of calculating the loss from the orientation at every time step. The
increments are down-sampled so that the IMU-frequency is reduced by a factor
of j. Brossard et al. [2] stated that one advantage of using rotated increments on
this form is that they are invariant to changes in the orientation and yaw-angle.
This means that, for instance, left-multiplication of two rotation matrices with
R does not change the value of the product of the two rotation matrices. The
loss-function for these rotation increments is computed as
X T
Lj = ρ log δRi,i+j δ R̂i,i+j (3.15)
i
where log(·) is the logarithmic map defined in equation (2.11) and ρ(·) is the
Huber loss function [12];
1 2
(
2 ϵ , |ϵ| ≤ δ,
ρ(ϵ) =
1 (3.16)
δ |ϵ| − 2δ , otherwise.
where ϵ is the error between the ground truth and estimated value. This function
is linear when the error is large and quadratic when the error is small. The Huber
loss function is visualized in Figure 3.5, where it is plotted together with the Mean
Square Error (MSE) and the Mean Absolute Error (MAE). Notice that the Huber
function overlaps the MSE for small values and the MAE for large values. This
is an appreciated property when dealing with datasets containing outliers in the
ground truth data. If MSE was used instead, outliers would potentially have too
much influence in the magnitude of the loss and negatively affect the training of
the CNN.
For all experiments, a loss parameter δ = 0.005 is chosen and the loss function
used is
Figure 3.5: Comparison of the Huber loss function with the Mean Square Error
(MSE) and Mean Absolute Error (MAE). For large values MAE and the Huber loss
function coincides. For small values MSE and the Huber loss function coincides.
This characteristic is beneficial on datasets with outliers in the ground truth
values. Figure initially presented in [24].
Such that the increments used for computing the loss is reduced by a factor of 16
and 32.
at once took 108 ms on an Nvidia GTX 950m GPU. This corresponded to 1.1 ×
10−4 ms of time used per computation, which reduced the execution time with a
factor of approximately 45 000. This details the significant performance increases
gained from utilizing parallelization.
Another measure to improve the execution time is in the implementation of com-
puting the rotated increments in equation (3.14). When computing this equation
directly, this equation requires j = 32 computations if the loss function in equa-
tion (3.17) is used. Instead, Brossard et al. [2] viewed this equation as a tree
of matrix multiplications [2], as shown in Figure 3.6. Here we can see that it
takes only 2 computations to sub sample by a factor of 4. By this method, the
necessary number of computations in equation (3.14) is reduced from j = 32 to
log2 (j) = log2 (32) = 5.
The execution time is further improved by calculating the ground truth rotated
increments only once, before training, and storing them for use during training.
This is possible since the same increments for ground truth is used in the loss
function every epoch.
in a different reference frame from the ground truth orientations. Zhang and
Scaramuzza goes into detail on this topic in [28], and the evaluation methods
used in [2] is highlighted there. Note that [2] used the openVINS toolbox [7]
implemented in the Robot Operating System (ROS). ROS is a framework used
for applications related to robotics. Since ROS was not used in this project, the
openVINS toolbox was not used for evaluation either. Therefore the evaluation
methods covered in this chapter was completely re-implemented in python for the
use in this project.
Several different methods for evaluating a trajectory exists, and each method has
its advantages and disadvantages. Two commonly used metrics are the absolute
error and the relative error. These methods are detailed here. In order to achieve
an informative evaluation it is recommended to use both the absolute error and
the relative error in order to account for the advantages and disadvantages of each
method.
′ T
R = RR̂ (3.18)
where R is the ground truth orientation and R̂ is the estimated orientation for the
first time step. The estimated orientations can now be aligned with the ground
truth trajectory by rotating the orientation for each time step using
′ ′
R̂n = R R̂n (3.19)
′
where R̂n is the aligned estimated orientation.
v
uM
uX 1 2
AOE = t log RTn R̂n (3.20)
n=1
M 2
which is the Root Mean Square Error (RMSE) of the orientations. Here, Rn is
the ground truth orientation, R̂n is the estimated orientation and M is the total
number of time steps. log(·) is defined in equation (2.11).
One advantage of the AOE is that just one single number is computed in order
to quantify the performance of an entire estimated trajectory. This makes the
metric easy to use for comparing the quality of several trajectories. Secondly, the
AOE is relatively easy to implement. The main disadvantage of the AOE is that
the magnitude of the error is sensitive to when in the trajectory the error occurs.
Errors occurring earlier in a sequence will have a much greater impact on the
AOE than errors occurring later in the sequence. Zhang et al. [28] cited several
researchers observing this problem.
Then, the yaw-angle θ is retrieved from R̃n by utilizing the relation between the
rotation matrix and the corresponding roll-pitch-yaw-angles (ϕ, θ, ψ),
r11 r12 r13 cθ cψ −cϕ sψ + sϕ sθ cψ sϕ sψ + cϕ sθ cψ
Rn = r21 r22 r23 = cθ sψ cϕ cψ + sϕ sθ sψ −sϕ cψ + cϕ sθ sψ (3.22)
r31 r32 r33 −sθ sϕ cθ cϕ cθ
Where the notation cθ , sθ , ... is short for cos θ, sin θ, ... . From this, we have that
r21 cθ sψ
= = tan ψ (3.23)
r11 cθ cψ
22 Chapter 3 Method
which means that the expression for retrieving the yaw-angle for each time step
is given by
r21
ψn = tan−1 (3.24)
r11
The Absolute Yaw Error (AYE) is now defined as the RMSE of the yaw-errors as
v
uM
uX 1
AYE = t ψn2 (3.25)
n=1
M
As is the case for the AOE, the magnitude of the AYE is also sensitive to when
an error occurs in the trajectory.
Now use these increments to calculate the ROE for the given sub-trajectory as
ROE = log δRTn,g(n) δ R̂n,g(n) (3.28)
2
Next, take time step n + 1 and find the time step where the IMU has traveled
N meters from this time step, g(n + 1). Align the orientations and calculate the
orientation increments and the ROE for this sub-trajectory. Repeat this until the
ROE has been calculated for all sub trajectories of this length.
3.4 Evaluation metrics 23
The ROE is thus not a single number representing the performance of the al-
gorithm, as opposed to the AOE, but rather a collection of errors from all the
sub-trajectories. This collection of errors can now be used to compute statistics
such as the mean, median and percentiles, and the results can be visualized using
box-plots.
It is advantageous to calculate the ROE for sub-trajectories of several different
lengths N to get more informative metrics. This can be helpful to assess the qual-
ity of the estimate both over shorter and longer distances. The error over shorter
distances is related to local consistency while the error over longer distances is
related to long-term accuracy [28].
The main advantage of the ROE compared to the AOE is that the ROE is not
sensitive to the time an error occurs in the trajectory. The ROE also offers much
more flexibility in terms of the choice of sub-trajectory lengths used as well as
more informative statistics produced. The main disadvantage is that the ROE is
more complicated to implement than the AOE.
Chapter 4
Method development
4.1 Overview
This chapter describes further development of the method presented in Chapter 3.
To gain insight into how well the method of [2] and the newly developed methods
perform, a conventional filter [18] is presented for comparison. Then, a technique
of data augmentation based on virtual rotations of the IMUs in the datasets is
detailed. Lastly, two new CNN architectures are presented, which are used to
replace the CNN in the method of [2], one CNN based on ResNet [8] and another
CNN based on DenseNet [11].
v
uM
uX 1 q
AIE = t 2 cos−1 qs2 + q32 (4.1)
n=1
M
26 Chapter 4 Method development
which gives the error of the rotation excluding the error in the yaw-angle, and
is calculated using the quaternion representation. This metric is included since
the Madgwick filter is expected to preform better with regards to the AIE than
the AOE or AYE when using only a gyroscope and accelerometer as inputs. The
evaluation of how such a filter performs when compared to the method of [2] is
meant to provide valuable insight into how well this method performs compared
to more traditional methods.
cθ cψ −cϕ sψ + sϕ sθ cψ sϕ sψ + cϕ sθ cψ
R = cθ sψ cϕ cψ + sϕ sθ sψ −sϕ cψ + cϕ sθ sψ (4.2)
−sθ sϕ cθ cϕ cθ
Now, the angular velocities, accelerations and ground truth orientations for all
time steps in an entire sequence is rotated by this R.
change within each block. This means that the output of each block has a different
channel dimension from the input. Hence, the input and the output can no longer
be added together. To mitigate this problem, a 1 × 1 convolutional layer is used to
scale up the channel dimension of the skip connections before the addition while
maintaining an identity mapping, as shown in Figure 4.2. The scaling up of the
channel dimension happens in the first convolutional layer of each block. Each
layer of each block then uses the same dilation gap and has the same channel
dimension of each output.
In order to get an output from the CNN that has the correct dimensions for the
gyro-corrections, ω̃ n ∈ R3 , a 1 × 1 convolution layer is applied after the final
CNN-block. This layer changes the dimension of the outputs to R3 and scales
down the channel dimension to 1.
Figure 4.2: Residual block. Each block contains two convolutional layers. The
skip-connections from the input contain a 1 × 1 convolutional layer to ensure
matching channel dimensions when the skip-connections are added to the outputs
of the convolutional layers.
network in order to obtain the correct channel dimension. As for the Residual
network, a final 1×1 convolutional layer is applied to ensure the correct dimensions
of the output of the network.
30 Chapter 4 Method development
Figure 4.3: Dense network architecture. Every layer in the network is connected
to every subsequent layer. Each layer in the network thus receives information
from every preceding layer.
Chapter 5
5.1 Overview
This chapter starts with detailing the reproducibility of results when training
neural networks using PyTorch. Then a brief description of the datasets used in
the experiments follows. Experimentation done with regards to the new methods
presented in Chapter 4 is thereafter detailed: Firstly, experimentation from using
the Madgwick filter is presented and compared to the method of [2]. Then an
analysis of the effects of data augmentation on the datasets follows. Lastly, the
use of new CNN architectures replacing the CNN in [2] is studied.
PyTorch uses only deterministic algorithms. These measures have been tested to
ensure reproducible results between executions on the same GPU. Using different
GPUs between executions of the same program did not produce reproducible re-
sults. Therefore all experiments has been done on the same GPU, an Nvidia Tesla
V100. For the specific seed values used in these experiments, the CNN model de-
veloped by Brossard et al. achieves slightly worse results here than those obtained
in [2]. This detail is not considered to be important. It is much more important
that the results are consistent between executions. This ensures that differences
in the performance observed when modifying the method of [2] is caused only by
the modifications made, and not by the random behaviour of training the CNN.
The PyTorch version used for all experiments is PyTorch 1.10.2, and the operating
system used is CentOs Linux 8.2.2004.
device. For the EuRoC dataset, it is mounted on a rotor-driven MAV. Hence, the
IMU-measurements on the EuRoC dataset should contain more noise than the
TUM VI dataset, coming from the rotors. The rotors should produce colored noise
[2] which in theory would be harder to estimate. The datasets are, however, similar
in the movement patterns captured. Both datasets contain mostly translational
movement patterns.
of [2] and the Madgwick filter. Note that the metrics provided in the table for
the Madgwick filter are the values corresponding to the optimal tuning for each
respective metric. In Figures 5.1 and 5.2 it can be seen that the optimal tuning
parameter β for the AOE, AYE and AIE, respectively, does not share the same
value. It is evident that the Madgwick filter fails to estimate the 3D-orientation
when considering the AOE for both datasets. On the EuRoC dataset a slight
decrease in the error is observed compared to open loop integration. On the
TUM VI dataset the Madgwick filter achieves the exact same error as open loop
integration. The reason why the the error decreases for the EuRoC dataset and
not for the TUM VI dataset is most likely related to the fact that the IMU in
TUM VI is calibrated while the IMU in EuRoC is not. Most likely the filter is
able to attenuate some of the noise present in the EuRoC dataset, which leads to
better results for the AOE compared to open loop integration.
The source of the poor estimates for the 3D-orientation is found in the results for
the AYE. Surprisingly, open loop integration actually yielded better performance
than the Madgwick filter on both datasets. Figures 5.1 and 5.2 show that the
filter does not manage to provide good estimates of the yaw-angle, regardless of
the value of β. The reason for these poor results stem from the fact that we use
an IMU with only a gyroscope and an accelerometer. Without another sensor
such as a magnetometer, the only direction vector we have for corrections in the
estimate is the direction of gravity measured by the accelerometer. Therefore the
filter has no observations that can be used to correct for the yaw angle, which
leads to poor results with regards to the AYE, and, ultimately, also the AOE.
Contrary to the results obtained from open loop integration and the Madgwick
filter, the CNN of [2] achieves very good results for the AOE and AYE, and the
error is reduced substantially on both datasets. The CNN has been trained by
minimizing the loss for the 3D-orientation. It is evident that good estimates of the
3D-orientation rely on good estimates of the yaw-angle as well, and that minimiz-
ing the loss for the orientation generally leads to good estimates of the yaw-angle.
As discussed, the Madgwick filter does not manage to estimate the yaw angle since
it is not able to utilize information from the gyroscopes or accelerometers to infer
this angle. The results for the CNN proves that the gyroscopes and accelerometers
actually contain some information of the yaw-angle. This demonstrates one of the
advantages of using a CNN. The CNN is able to find relations in the IMU mea-
surements that are not modeled by the kinematic equations. Some intuition for
how the accelerometer measurements relate to the angular velocities was provided
in equation (3.11), but this relation was derived by assuming negligible velocity
variations between time steps, which is an invalid assumption on these datasets.
The results for the CNN in Table 5.1 indicate that a similar relation exists also
when velocity variations are present between time steps.
36 Chapter 5 Experimental trials, results and discussion
The results for the AIE in Table 5.1, which captures the errors in the roll and pitch
angles, are more favourable for the Madgwick filter. Very similar performances
between the CNN and the Madgwick filter is achieved on the TUM VI dataset,
and slightly better results are achieved for the CNN on EuRoC. As expected,
the Madgwick filter does a really good job of estimating the roll and pitch angles
when compared to open loop integration. The CNN, however, manages to provide
similar or better results for the AIE even though the CNN was trained to estimate
the 3D-orientation. A surprising observation is the AIE for open loop integration
for the TUM VI dataset, which achieves an error of 1.02 degrees. This indicates
that very good estimates of the roll and pitch angles are possible to obtain simply
by open loop integration from a well calibrated IMU.
In light of the results obtained from the Madgwick gradient descent filter, it
becomes evident just how successful the method of [2] is at estimating the 3D-
orientation. Since the method of [2] is based on open-loop integration of denoised
gyro-rates, no corrections in the estimated orientation is possible to apply in real
time. The performance is therefore completely dependent on how good the gyro-
corrections from the CNN are so that drift in the estimated orientations from the
ground truth orientations is avoided. Despite this, the method of [2] manages to
get excellent results on the TUM VI and EuRoC datasets.
AOE (deg) AYE (deg) AIE (deg)
open loop CNN [2] Madgwick open loop CNN [2] Madgwick open loop CNN [2] Madgwick
EuRoC 120.9 2.88 105.8 91.2 1.80 97.2 82.1 2.64 3.27
TUM VI 6.26 1.56 6.26 5.91 1.38 6.22 1.02 0.56 0.55
Table 5.1: Results comparing the Madgwick filter with the CNN from [2] and
open loop integration of angular velocities. The Madgwick filter does not perform
well with regards to the AOE and AYE. Good performance is achieved for the
filter with regards to the AIE. The CNN achieves excellent results for all metrics.
5.5 Madgwicks gradient descent optimization filter 37
(a) AOE EuRoC Madgwick filter (b) AYE EuRoC Madgwick filter
Figure 5.1: AOE, AYE and AIE on the EuRoC dataset using the Madgwick
filter. The optimal tuning of the filter obtained an AOE, AYE and AIE of 105.75,
98.22 and 3.27 degrees respectively. Hence the filter only obtained good results
for the AIE, representing the errors in the roll and pitch angles. The filter failed
to estimate the yaw-angle, and as a result, the full 3D-orientation.
38 Chapter 5 Experimental trials, results and discussion
Figure 5.2: AOE, AYE and AIE on the TUM VI dataset using the Madgwick
filter. The optimal tuning of the filter obtained an AOE, AYE and AIE of 6.26,
6.22 and 0.55 degrees respectively. As for the EuRoC dataset, the filter only
obtained good results for the AIE, representing the errors in the roll and pitch
angles. The filter failed to estimate the yaw-angle, and as a result, the full 3D-
orientation.
5.6 The effects of data augmentation 39
Figure 5.3: ROE on EuRoC using virtual rotation data augmentation. The best
results for the ROE was obtained from the baseline configuration with Gaussian
noise on IMU measurements as data augmentation. Increasing the range of rota-
tions when constructing the virtual rotation led to gradually worse results.
Table 5.2: Results on EuRoC using virtual rotation data augmentation. The
best results for the AOE, AYE and mean ROE was obtained from the baseline
configuration with Gaussian noise on IMU measurements as data augmentation.
All metrics became gradually worse when increasing the range of rotations used
for constructing the virtual rotations.
5.6 The effects of data augmentation 41
Figure 5.4: ROE on TUM VI using virtual rotation data augmentation. The
baseline configuration performed best without using Gaussian noise on the IMU
measurements as data augmentation. The best performance was obtained using a
slight virtual rotation range of ±0.25◦ . Using varying ranges of virtual rotations
did not consistently perform better than the baseline configuration.
42 Chapter 5 Experimental trials, results and discussion
Table 5.3: Results on TUM VI using virtual rotation data augmentation. The
best results for the AOE, AYE and mean ROE were obtained using both Gaussian
noise on the IMU measurements and a slight virtual rotation with a range of
±0.25◦ . Using different ranges of virtual rotation did, however, not consistently
improve the performance of the CNN.
5.6.4 Discussion
Using virtual rotations as a means of data augmentation was not found to consis-
tently improve the performance of the CNN in [2] for either of the two datasets.
On the EuRoC dataset the best results were achieved by using the exact method
in [2]. On the TUM VI dataset, better results were obtained by using a slight
virtual rotation with angles within ±0.25◦ as data augmentation for each epoch.
Using other rotation ranges did, however, not consistently improve the results.
On the EuRoC dataset, using a too large rotation range resulted in substantially
worse performance.
It was not expected that the specified range in values for the roll, pitch and yaw
angles would affect the performance as much as it did in these experiments, and
that using a large range for the angles would degrade the performance as much as
it did for the EuRoC dataset. Most likely this happened because the datasets now
appeared to be too diverse. This could have made the CNN put more emphasis
on patterns in the datasets that were only introduced as a result of this data
augmentation technique. Hence, there is a chance that the CNN generalized on
too diverse patterns in the training data, and that the training data no longer
gave a good enough representation of how the IMU measurements used as inputs
were supposed to look like, leading to the CNN performing worse on the unseen
5.7 Different CNN architectures 43
test data. The reason why data augmentation using virtual rotations performed
worse on the EuRoC dataset than on the TUM VI dataset is also most likely
related to this problem. Since the IMU measurements in the EuRoC dataset are
uncalibrated, and since colored noise is present from the rotors of the MAV, this
dataset potentially contains more diverse characteristics than the TUM VI dataset
without the use of data augmentation.
In conclusion, it was not possible to obtain better results on the EuRoC dataset
using virtual rotation data augmentation. On the TUM VI dataset, better results
were obtained using some of the combinations of Gaussian noise and virtual rota-
tions as data augmentations. Better results were, however, only achieved for some
of the specified ranges of values for the roll, pitch and yaw, and not consistently
over several different rotation ranges. It is therefore concluded that the method
of using virtual rotations as data augmentation is not very effective, and hence it
is not relied on for any of the further experiments.
It was not possible to find a CNN-configuration that preformed better than the
CNN of [2], in this section referred to as baseline, for both the EuRoC and TUM
VI datasets simultaneously. It was, however, possible to find two different ar-
chitectures that preformed better than baseline for each of the two datasets. An
architecture based on the Dense CNN preformed best on the EuRoC dataset. The
structure is summarized in Table 5.4. An architecture based on the Residual CNN
preformed best on the TUM VI dataset, and is summarized in Table 5.5. Both
networks consist of 6 blocks. The receptive fields of the Dense and Residual net-
works are 1701 and 7168 samples respectively. This corresponds to 8.5 s and 35.8
s of past IMU measurements respectively. Note that the channel dimension for
both networks is the same for blocks 5 and 6. When the channel dimension of the
last block was increased to 512, following the pattern of doubling the dimension
per block, the networks required too much time to train. Therefore, the channel
dimension of block 6 was limited to 256 for both architectures.
CNN block 1 2 3 4 5 6 final layer
kernel dimension 7 7 7 7 7 7 1
dilation gap 1 3 9 27 81 243 1
channel dimension 16 32 64 128 256 256 1
Table 5.4: Dense network architecture details. The receptive field of this network
is N = max(kerneldimension × dilationgap) = 7 × 243 = 1701, corresponding to
8.5 s of past IMU measurements.
Table 5.5: Residual network architecture details. The receptive field of this
network is N = max(kerneldimension × dilationgap) = 7 × 1024 = 7168, corre-
sponding to 35.8 s of past IMU measurements.
AOE and AYE as well as the mean ROE in deg/m is displayed for each architecture
over all sequences in each of the datasets. The results for each network on all
sequences is included to provide an even more detailed comparison than just
comparing the mean values.
Note that the results for the AOE and AYE reported in [2] are better than the
results achieved in these experiments for the method developed in [2], here referred
to as baseline, as well as the results for the new CNN architectures. For the
baseline architecture Brossard et al. [2] achieved an AOE/AYE of 2.10◦ /0.96◦
and 1.28◦ /0.82◦ for the EuRoC and TUM VI datasets respectively. When the
same architecture was evaluated in this project, an AOE/AYE of 2.88◦ /1.80◦ and
1.56◦ /1.38◦ was achieved for the same datasets. The reason for this discrepancy is
explained in Section 5.2. The method regarding reproducibility described in this
section was used for these experiments. This ensures that the results obtained
from these experiments are comparable, and that any improvements in the results
are caused solely from the new methods implemented. For this reason, the results
reported in Table 5.6 are not directly comparable to the results reported in [2].
observed with regards to the mean AOE, AYE and ROE when using the Dense
network. The most significant performance gains of the Dense network was seen
for the ROEs over 35 meters traveled, where the mean ROE decreased with 0.47
degrees compared to baseline. This indicates that the Dense network has a greater
long-term accuracy while maintaining a good local consistency. This increase
in long-term accuracy is most likely related to the large receptive field of the
network. With the receptive field of the Dense network covering 35.8 s of past
IMU measurements, the receptive field covers almost all measurements over a
distance of 35 m for the velocities the IMU has in both datasets. This means that
the Dense network can combine more information from past IMU measurements
to infer the current gyro-correction.
Table 5.6: Results using different CNN architectures. The Residual network
performed better than baseline on all sequences on the TUM VI dataset for all
metrics except the AOE on room 4. The Dense network performed better than
baseline on the EuRoC dataset on three of the five sequences used, and obtained
a better result for the AOE, AYE and mean ROE than baseline on average.
5.7 Different CNN architectures 47
tested on the BROAD dataset. But it was still not possible to produce acceptable
results. Therefore producing good results on the BROAD dataset remains an
unsolved issue.
Despite the fact that it was not possible to produce good results on the BROAD
dataset, it was possible to improve the method of [2] by using a different CNN
architecture and tuning the network to each individual dataset. Slight improve-
ments were made on the EuRoC dataset and more substantial improvements were
obtained for the TUM VI dataset. These improvements comes at a cost of in-
creasing the complexity of the CNN model, making the new models require greater
computational resources. While the CNN-model in [2] took about 3 minutes to
train on a V100 GPU, the CNN-models presented in this chapter took about 15–20
minutes to train. Therefore, the new developed CNN models should be considered
on systems where computational resources are in abundance, whereas the CNN
model of [2] should be preferred on systems where computational resources are
scarce. Furthermore, a CNN method individually tuned for the specific dataset
or IMU in use is preferred.
50 Chapter 5 Experimental trials, results and discussion
Conclusion
In this thesis, the performance of the method proposed in [2] using a CNN to
estimate the orientation has been compared to the performance of the Madgwick
orientation filter [18] on the TUM VI and EuRoC datasets. Compared to this
filter, the method of [2] is extremely successful. The Madgwick filter fails to
estimate the 3D-orientation since it failed to estimate the yaw-angle from an IMU
containing only a gyroscope and accelerometer. In contrast, the method of [2]
managed to achieve good estimates both for the yaw angle and the 3D-orientation.
A new method of data augmentation was used together with the method of [2] to
try to increase the performance of the CNN. A thorough analysis was performed
with regards to virtually rotating the IMU to create a more diverse dataset. It was
not possible to produce better results from this method on the EuRoC dataset. On
the TUM VI dataset better results were achieved by applying very slight virtual
rotations to the IMU. The results on the TUM VI dataset were not consistent for
different amounts of virtual rotations, and the results were therefore not satisfying.
In conclusion, using virtual rotations for data augmentation did not improve the
performance of the method from [2].
Replacing the CNN in [2] with new CNN architectures increased the performance
of the method. Using a Dense neural network yielded slight improvements on the
EuRoC dataset. Using a Residual neural network yielded good improvements on
the TUM VI dataset. The combination of a larger receptive field and a more
complex CNN-model individually tuned for each dataset proved to be successful.
References
[1] S. Bai, J. Z. Kolter, and V. Koltun. “An empirical evaluation of generic con-
volutional and recurrent networks for sequence modeling”. In: arXiv preprint
arXiv:1803.01271 (2018).
[2] M. Brossard, S. Bonnabel, and A. Barrau. “Denoising imu gyroscopes with
deep learning for open-loop attitude estimation”. In: IEEE Robotics and
Automation Letters 5.3 (2020), pp. 4796–4803.
[3] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
Achtelik, and R. Siegwart. “The EuRoC micro aerial vehicle datasets”. In:
The International Journal of Robotics Research 35.10 (2016), pp. 1157–1163.
[4] O. Egeland. Quaternions, attitude estimation and SLAM. NTNU, 2021.
[5] O. Egeland. Robot Vision. NTNU, 2021.
[6] R. L. Farrenkopf. “Analytic steady-state accuracy solutions for two common
spacecraft attitude estimators”. In: Journal of Guidance and Control 1.4
(1978), pp. 282–284.
[7] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang. “Openvins: A re-
search platform for visual-inertial estimation”. In: 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 4666–
4672.
[8] K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016, pp. 770–778.
[9] D. Hendrycks and K. Gimpel. “Gaussian error linear units (gelus)”. In: arXiv
preprint arXiv:1606.08415 (2016).
[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhut-
dinov. “Improving neural networks by preventing co-adaptation of feature
detectors”. In: arXiv preprint arXiv:1207.0580 (2012).
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. “Densely
connected convolutional networks”. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2017, pp. 4700–4708.
54 References