Orb Slam
Orb Slam
Orb Slam
I. INTRODUCTION
UNDLE adjustment (BA) is known to provide accurate
estimates of camera localizations as well as a sparse geometrical reconstruction [1], [2], given that a strong network of
matches and good initial guesses are provided. For a long time,
this approach was considered unaffordable for real-time applications such as visual simultaneous localization and mapping
(visual SLAM). Visual SLAM has the goal of estimating the
camera trajectory while reconstructing the environment. Now,
we know that to achieve accurate results at nonprohibitive computational cost, a real-time SLAM algorithm has to provide BA
with the following.
1) Corresponding observations of scene features (map
points) among a subset of selected frames (keyframes).
2) As complexity grows with the number of keyframes, their
selection should avoid unnecessary redundancy.
3) A strong network configuration of keyframes and points
to produce accurate results, that is, a well spread set of
keyframes observing points with significant parallax and
with plenty of loop closure matches.
Manuscript received April 28, 2015; accepted July 27, 2015. This paper was
recommended for publication by Associate Editor D. Scaramuzza and Editor
D. Fox upon evaluation of the reviewers comments. This work was supported
by the Direccion General de Investigacion of Spain under Project DPI201232168, the Ministerio de Educacion Scholarship FPU13/04175, and Gobierno
de Aragon Scholarship B121/13.
The authors are with Instituto de Investigacion en Ingeniera de Aragon (I3A),
Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail: [email protected];
[email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TRO.2015.2463671
1552-3098 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
Fig. 1. ORB-SLAM system overview, showing all the steps performed by the
tracking, local mapping, and loop closing threads. The main components of the
place recognition module and the map are also shown.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
2 https://github.com/dorian3d/DBoW2
xTc Fcr xr = 0
(1)
with the normalized DLT and eight-point algorithms, respectively, as explained in [2] inside a RANSAC scheme.
To make homogeneous the procedure for both models,
the number of iterations is prefixed and the same for both
models, along with the points to be used at each iteration:
eight for the fundamental matrix, and four of them for
the homography. At each iteration, we compute a score
SM for each model M (H for the homography, F for the
fundamental matrix)
M d2cr (xic , xir , M )
SM =
i
M (d2 ) =
+ M (d2r c xic , xir , M )
d2 , if d2 < TM
0,
if d2 TM
(2)
where d2cr and d2r c are the symmetric transfer errors [2]
from one frame to the other. TM is the outlier rejection
threshold based on the 2 test at 95% (TH = 5.99,
TF = 3.84, assuming a standard deviation of 1 pixel in
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
SH
SH + SF
(3)
and select the homography if RH > 0.45, which adequately captures the planar and low parallax cases. Otherwise, we select the fundamental matrix.
4) Motion and structure from motion recovery: Once a model
is selected, we retrieve the motion hypotheses associated.
In the case of the homography, we retrieve eight motion
hypotheses using the method of Faugeras and Lustman
[23]. The method proposes cheriality tests to select the
valid solution. However, these tests fail if there is low
parallax as points easily go in front or back of the cameras, which could yield the selection of a wrong solution.
We propose to directly triangulate the eight solutions and
check if there is one solution with most points seen with
parallax, in front of both cameras and with low reprojection error. If there is not a clear winner solution, we do
not initialize and continue from step 1. This technique to
disambiguate the solutions makes our initialization robust
under low parallax and the twofold ambiguity configuration and could be considered the key of the robustness of
our method.
In the case of the fundamental matrix, we convert it in an
essential matrix using the calibration matrix K as
Er c = KT Fr c K
(4)
V. TRACKING
In this section, we describe the steps of the tracking thread
that are performed with every frame from the camera. The camera pose optimizations, mentioned in several steps, consist in
motion-only BA, which is described in the Appendix.
A. ORB Extraction
We extract FAST corners at eight-scale levels with a scale factor of 1.2. For image resolutions from 512 384 to 752 480
pixels we found suitable to extract 1000 corners, for higher resolutions, as the 1241 376 in the KITTI dataset [40], we extract
2000 corners. In order to ensure an homogeneous distribution,
we divide each scale level in a grid, trying to extract at least five
corners per cell. Then, we detect corners in each cell, adapting the detector threshold if not enough corners are found. The
amount of corners retained per cell is also adapted if some cells
contains no corners (textureless or low contrast). The orientation
and ORB descriptor are then computed on the retained FAST
corners. The ORB descriptor is used in all feature matching, in
contrast with the search by patch correlation in PTAM.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8
matched in others; therefore, it is projected in the rest of connected keyframes, and correspondences are searched as detailed
in Section V-D.
D. Local Bundle Adjustment
The local BA optimizes the currently processed keyframe
Ki , all the keyframes connected to it in the covisibility graph
Kc , and all the map points seen by those keyframes. All other
keyframes that see those points but are not connected to the
currently processed keyframe are included in the optimization
but remain fixed. Observations that are marked as outliers are
discarded at the middle and at the end of the optimization. See
the Appendix for more details about this optimization.
E. Local Keyframe Culling
In order to maintain a compact reconstruction, the local
mapping tries to detect redundant keyframes and delete them.
This is beneficial as BA complexity grows with the number of
keyframes, but also because it enables lifelong operation in the
same environment as the number of keyframes will not grow
unbounded, unless the visual content in the scene changes. We
discard all the keyframes in Kc whose 90% of the map points
have been seen in at least other three keyframes in the same or
finer scale. The scale condition ensures that map points maintain
keyframes from which they are measured with most accuracy.
This policy was inspired by the one proposed in the work of Tan
et al. [24], where keyframes were discarded after a process of
change detection.
VII. LOOP CLOSING
The loop closing thread takes Ki , the last keyframe processed
by the local mapping, and tries to detect and close loops. The
steps are next described.
A. Loop Candidates Detection
First, we compute the similarity between the bag of words
vector of Ki and all its neighbors in the covisibility graph
(m in = 30) and retain the lowest score sm in . Then, we query
the recognition database and discard all those keyframes whose
score is lower than sm in . This is a similar operation to gain
robustness as the normalizing score in DBoW2, which is computed from the previous image, but here we use covisibility
information. In addition, all those keyframes directly connected
to Ki are discarded from the results. To accept a loop candidate, we must detect consecutively three loop candidates that
are consistent (keyframes connected in the covisibility graph).
There can be several loop candidates if there are several places
with similar appearance to Ki .
B. Compute the Similarity Transformation
In monocular SLAM, there are seven DoFs in which the
map can drift: three translations, three rotations, and a scale
factor [6]. Therefore, to close a loop, we need to compute a
similarity transformation from the current keyframe Ki to the
loop keyframe Kl that informs us about the error accumulated
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
Operation
TRACKING
ORB extraction
Initial Pose Est.
Track Local Map
Total
KeyFrame Insertion
Map Point Culling
Map Point Creation
Local BA
KeyFrame Culling
Total
LOCAL MAPPING
Fig. 5. Map before and after a loop closure in the NewCollege sequence. The
loop closure match is drawn in blue, the trajectory in green, and the local map
for the tracking at that moment in red. The local map is extended along both
sides of the loop after it is closed.
Median (ms)
Mean (ms)
Std (ms)
11.10
3.38
14.84
30.57
10.29
0.10
66.79
296.08
8.07
383.59
11.42
3.45
16.01
31.60
11.88
3.18
72.96
360.41
15.79
464.27
1.61
0.99
9.98
10.39
5.03
6.70
31.48
171.11
18.98
217.89
closure. The whole map after processing the full sequence at its
real frame rate is shown in Fig. 6. The big loop on the right
does not perfectly align because it was traversed in opposite
directions, and the place recognizer was not able to find loop
closures.
We have extracted statistics of the times spent by each thread
in this experiment. Table I shows the results for the tracking
and the local mapping. Tracking works at frame rates around
2530 Hz, being the most demanding task to track the local
map. If needed, this time could be reduced limiting the number
of keyframes that are included in the local map. In the local
mapping thread, the most demanding task is local BA. The local
BA time varies if the robot is exploring or in a well-mapped area,
because during exploration, BA is interrupted if tracking inserts
a new keyframe, as explained in Section V-E. In case of not
needing new keyframes, local BA performs a generous number
of prefixed iterations.
Table II shows the results for each of the six loop closures
found. It can be seen how the loop detection increases sublinearly with the number of keyframes. This is due to the efficient
querying of the database that only compares the subset of images with words in common, which demonstrates the potential
of bag of words for place recognition. Our Essential Graph includes edges around five times the number of keyframes, which
is a quite sparse graph.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10
TABLE II
LOOP CLOSING TIMES IN NEWCOLLEGE
Loop Detection (ms)
Loop
1
2
3
4
5
6
KeyFrames
Candidates Detection
Similarity Transformation
Fusion
Total (s)
287
1082
1279
2648
3150
4496
1347
5950
7128
12547
16033
21797
4.71
4.14
9.82
12.37
14.71
13.52
20.77
17.98
31.29
30.36
41.28
48.68
0.20
0.39
0.95
0.97
1.73
0.97
0.26
1.06
1.26
2.30
2.80
3.62
0.51
1.52
2.27
3.33
4.60
4.69
TABLE III
KEYFRAME LOCALIZATION ERROR COMPARISON IN THE
TUM RGB-D BENCHMARK [38]
Absolute KeyFrame Trajectory RMSE (cm)
ORB-SLAM
PTAM
fr1_xyz
0.90
1.15
fr2_xyz
0.30
0.20
fr1_floor
2.99
X
fr1_desk
1.69
X
fr2_360 _kidnap
3.81
2.63
fr2_desk
0.88
X
fr3_long _office
3.45
X
fr3_nstr_tex_far
ambiguity detected 4.92 / 34.74
fr3_nstr_ tex_near
1.39
2.74
fr3_str_tex_far
0.77
0.93
fr3_str_ tex_near
1.58
1.04
fr2_desk_person
0.63
X
fr3_sit_xyz
0.79
0.83
fr3_sit_halfsph
1.34
X
fr3_walk_xyz
1.24
X
fr3_walk_halfsph
1.74
X
1.34 (1.34)
2.61 (1.42)
3.51 (3.51)
2.58 (2.52)
393.3 (100.5)
9.50 (3.94)
6.97 (2.00)
Results for ORB-SLAM, PTAM, and LSD-SLAM are the median over five executions
in each sequence. The trajectories have been aligned with 7 DoFs with the ground truth.
Trajectories for RGBD-SLAM are taken from the benchmark website, only available
for fr1 and fr2 sequences, and have been aligned with 6 DoFs and 7 DoFs (results
between brackets). X means that the tracking is lost at some point and a significant
portion of the sequence is not processed by the system.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
11
TABLE IV
RESULTS FOR THE RELOCALIZATION EXPERIMENTS
Initial Map
System
KFs
RMSE (cm)
Relocalization
Recall (%)
RMSE (cm)
37
24
0.19
0.19
34.9
78.4
0.26
0.38
1.52
1.67
34
31
0.83
0.82
0.0
77.9
1.32
4.95
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12
total number of keyframes in the map, and Fig. 10(b) shows for
each keyframe its frame of creation and destruction, showing
how long the keyframes have survived in the map. It can be seen
that during the first two sequences, the map size grows as all the
views of the scene are being seen for the first time. In Fig. 10(b),
we can see that several keyframes created during these two
first sequences are maintained in the map during the whole
experiment. During the sequences sitting_rpy and walking_xyz,
the map does not grow, because the map created so far explains
well the scene. In contrast, during the last two sequences, more
keyframes are inserted showing that there are some novelties
in the scene that were not yet represented, due probably to
dynamic changes. Finally, Fig. 10(c) shows a histogram of the
keyframes according to the time they have survived with respect
to the remaining time of the sequence from its moment of
creation. It can be seen that most of the keyframes are destroyed
by the culling procedure soon after creation, and only a small
subset survive until the end of the experiment. On one hand,
this shows that our system has a generous keyframe spawning
policy, which is very useful when performing abrupt motions
in exploration. On the other hand, the system is eventually able
to select a small representative subset of those keyframes.
In these lifelong experiments, we have shown that our map
grows with the content of the scene but not with the time and is
able to store the dynamic changes of the scene, which could be
useful to perform some scene understanding by accumulating
experience in an environment.
E. Large-Scale and Large Loop Closing in the KITTI Dataset
The odometry benchmark from the KITTI dataset [40] contains 11 sequences from a car driven around a residential area
with accurate ground truth from GPS and a Velodyne laser scanner. This is a very challenging dataset for monocular vision due
to fast rotations, areas with lot of foliage, which make more difficult data association, and relatively high car speed, being the
sequences recorded at 10 frames/s. We play the sequences at the
real frame rate they were recorded, and ORB-SLAM is able to
process all the sequences by the exception of sequence 01, which
is a highway with few trackable close objects. Sequences 00,
02, 05, 06, 07, and 09 contain loops that were correctly detected
and closed by our system. Sequence 09 contains a loop that can
be detected only in a few frames at the end of the sequence, and
our system not always detects it (the results provided are for the
executions in which it was detected).
Qualitative comparisons of our trajectories and the ground
truth are shown in Figs. 11 and 12. As in the TUM RGB-D
benchmark, we have aligned the keyframe trajectories of our
system and the ground truth with a similarity transformation.
We can compare qualitatively our results from Figs. 11 and 12
with the results provided for sequences 00, 05, 06, 07, and 08 by
the recent monocular SLAM approach of Lim et al. [25, Fig. 10].
ORB-SLAM produces clearly more accurate trajectories for all
those sequences by the exception of sequence 08 in which they
seem to suffer less drift.
Table V shows the median RMSE error of the keyframe trajectory over five executions in each sequence. We also provide the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
13
Fig. 11. Sequences 00, 05, and 07 from the odometry benchmark of the KITTI dataset. (Left) Points and keyframe trajectory. (Center) trajectory and ground
truth. (Right) Trajectory after 20 iterations of full BA. The output of our system is quite accurate, while it can be slightly improved with some iterations of BA.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
14
Fig. 12. ORB-SLAM keyframe trajectories in sequences 02, 03, 04 ,06, 08, 09, and 10 from the odometry benchmark of the KITTI dataset. Sequence 08 does
not contains loops and drift (especially scale) is not corrected. (a) Sequence 02. (b) Sequence 03. (c) Sequence 04. (d) Sequence 06. (e) Sequence 08 (f) Sequence
09. (g) Sequence 10.
TABLE V
RESULTS OF OUR SYSTEM IN THE KITTI DATASET
ORB-SLAM
Sequence
Dimension (mm)
KFs
RMSE (m)
RMSE (m)
Time BA (s)
KITTI 00
KITTI 01
KITTI 02
KITTI 03
KITTI 04
KITTI 05
KITTI 06
KITTI 07
KITTI 08
KITTI 09
KITTI 10
564 496
1157 1827
599 946
471 199
0.5 394
479 426
23 457
191 209
808 391
465 568
671 177
1391
X
1801
250
108
820
373
351
1473
653
411
6.68
X
21.75
1.59
1.79
8.23
14.68
3.36
46.58
7.62
8.68
5.33
X
21.28
1.51
1.62
4.85
12.34
2.26
46.68
6.62
8.80
24.83
X
30.07
4.88
1.58
15.20
7.78
6.28
25.60
11.33
7.64
high-frequency texture like asphalt [45]. Their denser reconstructions, as compared with the sparse point map of our system
or PTAM, could be more useful for other tasks than just camera
localization.
However, direct methods have their own limitations. First,
these methods assume a surface reflectance model that in real
scenes produces its own artifacts. The photometric consistency
limits the baseline of the matches, typically narrower than those
that features allow. This has a great impact in reconstruction
accuracy, which requires wide baseline observations to reduce
depth uncertainty. Direct methods, if not correctly modeled,
are quite affected by rolling-shutter, autogain, and autoexposure
artifacts (as in the TUM RGB-D Benchmark). Finally, because
direct methods are, in general, very computationally demanding,
the map is just incrementally expanded as in DTAM, or map
optimization is reduced to a pose graph, discarding all sensor
measurements as in LSD-SLAM.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
TABLE VI
COMPARISON OF LOOP CLOSING STRATEGIES IN KITTI 09
Method
BA (20)
BA (100)
EG (200)
EG (100)
EG (50)
EG (15)
EG (100) + BA (20)
Time (s)
RMSE (m)
14.64
72.16
0.38
0.48
0.59
0.94
13.40
890
1979
3583
6663
1979
48.77
49.90
18.82
8.84
8.36
8.95
8.88
7.22
The first row shows results without loop closing. The number between
brackets for BA means number of LevenbergMarquardt (LM) iterations
while, for EG (essential graph), it is m i n to build the essential graph.
All EG optimizations perform ten LM iterations.
15
xi,j
+ ci,u
fi,u
zi,j
i (Tiw , Xw ,j ) =
yi,j
fi,v
+ ci,v
zi,j
T
xi,j yi,j zi,j
= Riw Xw ,j + tiw
(5)
(6)
Fig. 13. Comparison of different loop closing strategies in KITTI 09. (a)
Without Loop Closing. (b) BA (20). (c) EG (100). (d) EG (100) + BA (20).
(8)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
16
(10)
(11)
where 1,i and 2,i are the covariance matrices associated with the scale in which keypoints in images 1 and 2
were detected. In this optimization, the points are fixed.
REFERENCES
[1] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, Bundle adjustment a modern synthesis, in Vision Algorithms: Theory and
Practice. New York, NY, USA: Springer, 2000, pp. 298372.
[2] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision,
2nd ed., Cambridge, U.K.: Cambridge Univ. Press, 2004.
[3] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd, Real
time localization and 3D reconstruction, in Proc. IEEE Comput. Soc.
Conf. Comput. Vision Pattern Recog., 2006, vol. 1, pp. 363370.
[4] G. Klein and D. Murray, Parallel tracking and mapping for small AR
workspaces, in Proc. IEEE ACM Int. Symp. Mixed Augmented Reality,
Nara, Japan, Nov. 2007, pp. 225234.
[5] D. Galvez-Lopez and J. D. Tardos, Bags of binary words for fast place
recognition in image sequences, IEEE Trans. Robot., vol. 28, no. 5,
pp. 11881197, Oct. 2012.
[6] H. Strasdat, J. M. M. Montiel, and A. J. Davison, Scale drift-aware
large scale monocular SLAM, presented at the Proc. Robot.: Sci. Syst.,
Zaragoza, Spain, Jun. 2010.
[7] H. Strasdat, A. J. Davison, J. M. M. Montiel, and K. Konolige, Double
window optimisation for constant time visual SLAM, in Proc. IEEE Int.
Conf. Comput. Vision, Barcelona, Spain, Nov. 2011, pp. 23522359.
[8] C. Mei, G. Sibley, and P. Newman, Closing loops without places, in
Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Taipei, Taiwan, Oct. 2010,
pp. 37383744.
[9] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient
alternative to SIFT or SURF, in Proc. IEEE Int. Conf. Comput. Vision,
Barcelona, Spain, Nov. 2011, pp. 25642571.
[10] J. Engel, T. Schops, and D. Cremers, LSD-SLAM: Large-scale direct
monocular SLAM, in Proc. Eur. Conf. Comput. Vision, Zurich, Switzerland, Sep. 2014, pp. 834849.
[11] R. Mur-Artal and J. D. Tardos, Fast relocalisation and loop closing in
keyframe-based SLAM, in Proc. IEEE Int. Conf. Robot. Autom., Hong
Kong, Jun. 2014, pp. 846853.
[12] R. Mur-Artal and J. D. Tardos, ORB-SLAM: Tracking and mapping
recognizable features, presented at the MVIGRO Workshop Robot. Sci.
Syst., Berkeley, CA, USA, Jul. 2014.
[13] B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. D. Tardos,
A comparison of loop closing techniques in monocular SLAM, Robot.
Auton. Syst., vol. 57, no. 12, pp. 11881197, 2009.
[14] D. Nister and H. Stewenius, Scalable recognition with a vocabulary tree,
in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recog., New
York, NY, USA, Jun. 2006, vol. 2, pp. 21612168.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MUR-ARTAL et al.: ORB-SLAM: A VERSATILE AND ACCURATE MONOCULAR SLAM SYSTEM
17