0% found this document useful (0 votes)
102 views4 pages

Droid Slam Supplemental

This document summarizes results from DROID-SLAM, a monocular visual SLAM system that uses a learned depth estimation network. It shows that DROID-SLAM achieves state-of-the-art trajectory accuracy on stereo EuRoC datasets, reducing average error over ORB-SLAM3 by 71%. Ablation studies demonstrate that the system benefits from stereo inputs, global optimization, and training with differentiable bundle adjustment. Key components of the network architecture, like global context and bundle adjustment, are also shown to improve performance. Camera models and Jacobians used for optimization are defined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views4 pages

Droid Slam Supplemental

This document summarizes results from DROID-SLAM, a monocular visual SLAM system that uses a learned depth estimation network. It shows that DROID-SLAM achieves state-of-the-art trajectory accuracy on stereo EuRoC datasets, reducing average error over ORB-SLAM3 by 71%. Ablation studies demonstrate that the system benefits from stereo inputs, global optimization, and training with differentiable bundle adjustment. Key components of the network architecture, like global context and bundle adjustment, are also shown to improve performance. Camera models and Jacobians used for optimization are defined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1 DROID-SLAM: Supplementary Material

2 A Additional Results

MH01 MH02 MH03 MH04 MH05 V101 V102 V103 V201 V202 V203 Avg
D3VO + DSO [6] - - 0.08 - 0.09 - - 0.11 - 0.05 - -
ORB-SLAM2 [4] 0.035 0.018 0.028 0.119 0.060 0.035 0.020 0.048 0.037 0.035 - -
VINS-Fusion [5] 0.540 0.460 0.330 0.780 0.500 0.550 0.230 - 0.230 0.200 - -
SVO [3] 0.040 0.070 0.270 0.170 0.120 0.040 0.040 0.070 0.050 0.090 0.790 0.159
ORB-SLAM3 [2] 0.029 0.019 0.024 0.085 0.052 0.035 0.025 0.061 0.041 0.028 0.521 0.084
Ours 0.015 0.013 0.035 0.048 0.040 0.037 0.011 0.020 0.018 0.015 0.017 0.024

Table 1: Stereo SLAM on the EuRoC datasets, ATE[m].

3 We provide stereo results on the EuRoC dataset[1] in Tab. 1 using our network trained on synthetic,
4 monocular video. In the stereo setting, it is possible to recover the trajectory of the camera up to
5 scale. Compared to ORB-SLAM3[2] we reduce the average ATE by 71%.

6 B Ablations

Keyframe Image Keyframe Depth Optical Flow X-Confidence Y-Confidence

Figure 1: Visualizations of keyframe image, depth, flow and confidence estimates.

1.0 1.0
0.8 0.8
0.6 0.6
% runs

% runs

1 Keyframe
0.4 Monocular (local) 0.4 2 Keyframes
Monocular (full) 3 Keyframes
0.2 Stereo (local) 0.2 5 Keyframes
Stereo (full) 8 Keyframes
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
ATE [m] ATE [m]

Figure 2: (Left) we show the performance of the system with different inputs (monocular vs. stereo)
and whether global optimization is performed in addition to local BA (local vs. full). (Right) Tracking
accuracy as a function of the number of keyframes. We use 5 keyframes (bold) in our experiments.

7 Ablations We ablate various design choices regarding our SLAM system and network architecture.
8 Ablations are performed on our validation split of the TartanAir dataset. In Fig. 1 we show visu-
9 alizations on the validation set of keyframe depth estimates alongside optical flow and associated
10 confidence weights.
11 In Fig.2 (left) we show how the system benefits from both stereo video and global optimization.
12 Although our network is only trained on monocular video, it can readily leverage stereo frames if
13 available. In Fig. 2 (right) we show how the number of keyframe affects odometery performance.
14 In Fig. 3 we ablate components of the network architecture. Fig. 3 (left) shows the impact of using
15 global context in the GRU through spatial pooling while 3 (right) demonstrates the importance of

1
1.0 1.0
0.8 0.8
0.6 0.6 RAFT + BA
% runs

% runs
Ours
0.4 0.4
0.2 No Global Pooling 0.2
Global Pooling
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
ATE [m] ATE [m]

Figure 3: (Left) Impact of global context in the update operator. (Right) Impact of using the bundle
adjustment layer during training vs training directly on optical flow, then applying BA at test time.

16 training with DBA as opposed to training on flow and applying BA at inference. We find that the
17 SLAM system is unstable and prone to failure if the DBA is not used during training.

18 C Camera Model and Jacobians


19 We represent 3D points using homogeneous coordinates X = (X, Y, Z, W )T . An image point p
20 with inverse depth d is re-projected from frame i into frame j according to the warping function
p0 = Πc (Gij · Π−1 (p, d)) Gij = Gj ◦ G−1
i (1)
21 where Πc is the pinhole projection function, and Π−1
c is the inverse projection
 px −cx 
fx
 py −cy 
 X 
fx Z + cy −1
Πc (X) = Πc (p, d) =   1y  .
f  (2)
fy YZ + cy
d
22 given camera intrinsic parameters c = (fx , fy , cx , cy ).
23 For optimization, we need the Jacobians with respect to Gi , Gj , and d. We use the local parameteri-
24 zation eξi Gi and eξj Gj and treat d as a vector in R1 . The Jacobians of the projection and inverse
25 projection functions are given as
0
 
−1
 1 X

∂Πc (X) fx Z 0 −fx Z 2 0 ∂Πc (p, d) 0
= =  . (3)
∂X 0 fy Z1 −fy ZY2 0 ∂d 0
1

26 Using the local parameterization, we compute the Jacobian of the 3D point transformation
X0 = Exp(ξj ) · Gj · (Exp(ξi ) · Gi )−1 · X = Exp(ξj ) · Gj · G−1
i · Exp(−ξi ) · X (4)
27 using the adjoint operator to move the ξi term to the front of the expression
X0 = Exp(ξj ) · Exp(− AdjGj G−1 ξi ) · Gj · G−1
i ·X (5)
i

28 allowing us to compute the Jacobians using the generators


W0 0 0 0 Z0 −Y 0
 
0 0 0
∂X  0 W 0 −Z 0 X0 
= 0 0 W0 Y0 −X 0 0  (6)
∂ξj
0 0 0 0 0 0
29
W0 0 0 0 Z0 −Y 0
 
0
∂X  0 W0 0 −Z 0 0 X 0
= − 0 0 W0 Y0 −X 0 0  · AdjGj G−1 (7)
∂ξi i

0 0 0 0 0 0

2
30 Using the chain rule, we can compute the full Jacobians with respect to the variables
∂p0 ∂Πc (X0 ) ∂X0 ∂p0 ∂Πc (X0 ) ∂X0
= , = (8)
∂ξj ∂X0 ∂ξj ∂ξi ∂X0 ∂ξi
31
tx
 
∂p0 ∂Πc (X0 ) ∂X0 ∂Π−1 (p, d) ∂Πc (X0 ) ∂Πc (X0 ) ty 
= = = t  (9)
∂d ∂X0 ∂X ∂d ∂X0 ∂X0 z
1
32 where (tx , ty , tz ) is the translation vector of Gj ◦ G−1
i .

33 D Network Architecture

ResBlock (128)

ResBlock (256)
ResBlock (128)

ResBlock (256)
ResBlock (256)
ResBlock (64)
Conv7x7 (64)

Figure 4: Architecture of the feature and context encoders. Both extract features at 1/8 the input Conv3x3(D)
image resolution using a set of 6 basic residual blocks. Instance normalization is used in the feature
encoder; no normalization is used in the context encoder. The feature encoder outputs features with
dimension D=128 which the context encoder outputs features with dimension D=256.

Context
Conv3x3 (128)

Conv3x3 (128)

Conv3x3 (128)

Conv3x3 (2)

Corr
Conv7x7 (128)

Conv3x3 (128)
Conv3x3 (64)

Conv3x3 (2)

Sigmoid

Flow
3x3 ConvGRU (128)

Figure 5: Architecture of the update operator. During each iteration, context, correlation, and flow
features get injected into the GRU. The revision (r) and confidence weights (w) are predicted from
the updated hidden state.

34 References
35 [1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The euroc
36 micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, 2016.

3
37 [2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós. Orb-slam3: An accurate
38 open-source library for visual, visual-inertial and multi-map slam. arXiv preprint arXiv:2007.11898, 2020.
39 [3] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry for
40 monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016.
41 [4] R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d
42 cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
43 [5] T. Qin and S. Shen. Online temporal calibration for monocular visual-inertial systems. In 2018 IEEE/RSJ
44 International Conference on Intelligent Robots and Systems (IROS), pages 3662–3669. IEEE, 2018.
45 [6] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty for
46 monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
47 Recognition, pages 1281–1292, 2020.

You might also like