Droid Slam Supplemental
Droid Slam Supplemental
2 A Additional Results
MH01 MH02 MH03 MH04 MH05 V101 V102 V103 V201 V202 V203 Avg
D3VO + DSO [6] - - 0.08 - 0.09 - - 0.11 - 0.05 - -
ORB-SLAM2 [4] 0.035 0.018 0.028 0.119 0.060 0.035 0.020 0.048 0.037 0.035 - -
VINS-Fusion [5] 0.540 0.460 0.330 0.780 0.500 0.550 0.230 - 0.230 0.200 - -
SVO [3] 0.040 0.070 0.270 0.170 0.120 0.040 0.040 0.070 0.050 0.090 0.790 0.159
ORB-SLAM3 [2] 0.029 0.019 0.024 0.085 0.052 0.035 0.025 0.061 0.041 0.028 0.521 0.084
Ours 0.015 0.013 0.035 0.048 0.040 0.037 0.011 0.020 0.018 0.015 0.017 0.024
3 We provide stereo results on the EuRoC dataset[1] in Tab. 1 using our network trained on synthetic,
4 monocular video. In the stereo setting, it is possible to recover the trajectory of the camera up to
5 scale. Compared to ORB-SLAM3[2] we reduce the average ATE by 71%.
6 B Ablations
1.0 1.0
0.8 0.8
0.6 0.6
% runs
% runs
1 Keyframe
0.4 Monocular (local) 0.4 2 Keyframes
Monocular (full) 3 Keyframes
0.2 Stereo (local) 0.2 5 Keyframes
Stereo (full) 8 Keyframes
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
ATE [m] ATE [m]
Figure 2: (Left) we show the performance of the system with different inputs (monocular vs. stereo)
and whether global optimization is performed in addition to local BA (local vs. full). (Right) Tracking
accuracy as a function of the number of keyframes. We use 5 keyframes (bold) in our experiments.
7 Ablations We ablate various design choices regarding our SLAM system and network architecture.
8 Ablations are performed on our validation split of the TartanAir dataset. In Fig. 1 we show visu-
9 alizations on the validation set of keyframe depth estimates alongside optical flow and associated
10 confidence weights.
11 In Fig.2 (left) we show how the system benefits from both stereo video and global optimization.
12 Although our network is only trained on monocular video, it can readily leverage stereo frames if
13 available. In Fig. 2 (right) we show how the number of keyframe affects odometery performance.
14 In Fig. 3 we ablate components of the network architecture. Fig. 3 (left) shows the impact of using
15 global context in the GRU through spatial pooling while 3 (right) demonstrates the importance of
1
1.0 1.0
0.8 0.8
0.6 0.6 RAFT + BA
% runs
% runs
Ours
0.4 0.4
0.2 No Global Pooling 0.2
Global Pooling
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
ATE [m] ATE [m]
Figure 3: (Left) Impact of global context in the update operator. (Right) Impact of using the bundle
adjustment layer during training vs training directly on optical flow, then applying BA at test time.
16 training with DBA as opposed to training on flow and applying BA at inference. We find that the
17 SLAM system is unstable and prone to failure if the DBA is not used during training.
26 Using the local parameterization, we compute the Jacobian of the 3D point transformation
X0 = Exp(ξj ) · Gj · (Exp(ξi ) · Gi )−1 · X = Exp(ξj ) · Gj · G−1
i · Exp(−ξi ) · X (4)
27 using the adjoint operator to move the ξi term to the front of the expression
X0 = Exp(ξj ) · Exp(− AdjGj G−1 ξi ) · Gj · G−1
i ·X (5)
i
0 0 0 0 0 0
2
30 Using the chain rule, we can compute the full Jacobians with respect to the variables
∂p0 ∂Πc (X0 ) ∂X0 ∂p0 ∂Πc (X0 ) ∂X0
= , = (8)
∂ξj ∂X0 ∂ξj ∂ξi ∂X0 ∂ξi
31
tx
∂p0 ∂Πc (X0 ) ∂X0 ∂Π−1 (p, d) ∂Πc (X0 ) ∂Πc (X0 ) ty
= = = t (9)
∂d ∂X0 ∂X ∂d ∂X0 ∂X0 z
1
32 where (tx , ty , tz ) is the translation vector of Gj ◦ G−1
i .
33 D Network Architecture
ResBlock (128)
ResBlock (256)
ResBlock (128)
ResBlock (256)
ResBlock (256)
ResBlock (64)
Conv7x7 (64)
Figure 4: Architecture of the feature and context encoders. Both extract features at 1/8 the input Conv3x3(D)
image resolution using a set of 6 basic residual blocks. Instance normalization is used in the feature
encoder; no normalization is used in the context encoder. The feature encoder outputs features with
dimension D=128 which the context encoder outputs features with dimension D=256.
Context
Conv3x3 (128)
Conv3x3 (128)
Conv3x3 (128)
Conv3x3 (2)
Corr
Conv7x7 (128)
Conv3x3 (128)
Conv3x3 (64)
Conv3x3 (2)
Sigmoid
Flow
3x3 ConvGRU (128)
Figure 5: Architecture of the update operator. During each iteration, context, correlation, and flow
features get injected into the GRU. The revision (r) and confidence weights (w) are predicted from
the updated hidden state.
34 References
35 [1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The euroc
36 micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, 2016.
3
37 [2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós. Orb-slam3: An accurate
38 open-source library for visual, visual-inertial and multi-map slam. arXiv preprint arXiv:2007.11898, 2020.
39 [3] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry for
40 monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016.
41 [4] R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d
42 cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
43 [5] T. Qin and S. Shen. Online temporal calibration for monocular visual-inertial systems. In 2018 IEEE/RSJ
44 International Conference on Intelligent Robots and Systems (IROS), pages 3662–3669. IEEE, 2018.
45 [6] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty for
46 monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
47 Recognition, pages 1281–1292, 2020.