RANSAC Vs ICP PDF
RANSAC Vs ICP PDF
Peter Henry1 , Michael Krainin1 , Evan Herbst1 , Xiaofeng Ren2 , Dieter Fox1,2
Abstract RGB-D cameras are novel sensing systems that capture RGB images
along with per-pixel depth information. In this paper we investigate how such cam-
eras can be used in the context of robotics, specifically for building dense 3D maps
of indoor environments. Such maps have applications in robot navigation, manip-
ulation, semantic mapping, and telepresence. We present RGB-D Mapping, a full
3D mapping system that utilizes a novel joint optimization algorithm combining
visual features and shape-based alignment. Visual and depth information are also
combined for view-based loop closure detection, followed by pose optimization to
achieve globally consistent maps. We evaluate RGB-D Mapping on two large indoor
environments, and show that it effectively combines the visual and shape informa-
tion available from RGB-D cameras.
1 Introduction
Building rich 3D maps of environments is an important task for mobile robotics,
with applications in navigation, manipulation, semantic mapping, and telepresence.
Most 3D mapping systems contain three main components: first, the spatial align-
ment of consecutive data frames; second, the detection of loop closures; third, the
globally consistent alignment of the complete data sequence. While 3D point clouds
are extremely well suited for frame-to-frame alignment and for dense 3D reconstruc-
tion, they ignore valuable information contained in images. Color cameras, on the
other hand, capture rich visual information and are becoming more and more the
sensor of choice for loop closure detection [21, 16, 30]. However, it is extremely
hard to extract dense depth from camera data alone, especially in indoor environ-
ments with very dark or sparsely textured areas.
1
2 Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, Dieter Fox
Fig. 1 (left) RGB image and (right) depth information captured by an RGB-D camera. Recent
systems can capture images at a resolution of up to 640x480 pixels at 30 frames per second. White
pixels in the right image have no depth value, mostly due to occlusion, max distance, relative
surface angle, or surface material.
RGB-D cameras are sensing systems that capture RGB images along with per-
pixel depth information. RGB-D cameras rely on either active stereo [15, 25] or
time-of-flight sensing [3, 12] to generate depth estimates at a large number of pix-
els. While sensor systems with these capabilities have been custom-built for years,
only now are they being packaged in form factors that make them attractive for re-
search outside specialized computer vision groups. In fact, the key drivers for the
most recent RGB-D camera systems are computer gaming and home entertainment
applications [25].
RGB-D cameras allow the capture of reasonably accurate mid-resolution depth
and appearance information at high data rates. In our work we use a camera de-
veloped by PrimeSense [25], which captures 640x480 registered image and depth
points at 30 frames per second. This camera is equivalent to the visual sensors in
the recently available Microsoft Kinect [20]. Fig. 1 shows an example frame ob-
served with this RGB-D camera. As can be seen, the sensor provides dense depth
estimates. However, RGB-D cameras have some important drawbacks with respect
to 3D mapping: they provide depth only up to a limited distance (typically less than
5m), their depth estimates are very noisy and their field of view ( 60 ) is far more
constrained than that of the specialized cameras and laser scanners commonly used
for 3D mapping ( 180 ).
In this paper we introduce RGB-D Mapping, a framework for using RGB-D
cameras to generate dense 3D models of indoor environments. RGB-D Mapping
exploits the integration of shape and appearance information provided by these sys-
tems. Alignment between frames is computed by jointly optimizing over both ap-
pearance and shape matching. Our approach detects loop closures by matching data
frames against a subset of previously collected frames. To generate globally consis-
tent alignments we use TORO, an optimization tool developed for SLAM [9]. The
overall system can accurately align and map large indoor environments in near-real
time and is capable of handling situations such as featureless corridors and com-
pletely dark rooms.
RGB-D Mapping 3
RGB-D Mapping maintains and updates a global model using small planar col-
ored surface patches called surfels [23]. This representation enables the approach to
efficiently reason about occlusions, to estimate the appropriate color extracted for
each part of the environment, and to provide good visualizations of the resulting
model. Furthermore, surfels automatically adapt the resolution of the representation
to the resolution of data available for each patch.
After discussing related work, we introduce RGB-D Mapping in Section 3. Ex-
periments are presented in Section 4, followed by a discussion.
2 Related Work
The robotics and computer vision communities have developed many techniques for
3D mapping using range scans [31, 32, 19, 21], stereo cameras [1, 16], monocular
cameras [5], and even unsorted collections of photos [30, 7]. Most mapping sys-
tems require the spatial alignment of consecutive data frames, the detection of loop
closures, and the globally consistent alignment of all data frames.
The solution to the frame alignment problem strongly depends on the data be-
ing used. For 3D laser data, the iterated closest point (ICP) algorithm and variants
thereof are popular techniques [2, 31, 19, 27]. The ICP algorithm iterates between
associating each point in one time frame to the closest point in the other frame and
computing the rigid transformation that minimizes distance between the point pairs.
The robustness of ICP in 3D has been improved by, e.g., incorporating point-to-
plane associations or point reflectance values [4, 28, 19].
Passive stereo systems can extract depth information for only a subset of feature
points in each stereo pair. These feature points can then be aligned over consec-
utive frames using an optimization similar to a single iteration of ICP, with the
additional advantage that appearance information can be used to solve the data as-
sociation problem more robustly, typically via RANSAC [16, 1]. Monocular SLAM
and mapping based on unsorted image sets are similar to stereo SLAM in that sparse
features are extracted from images to solve the correspondence problem. Projective
geometry is used to define the spatial relationship between features [22, 5, 30], a
much harder problem to solve than correspondence in ICP.
For the loop closure problem, most recent approaches to 3D mapping rely on fast
image matching techniques [30, 5, 16, 21]. Once a loop closure is detected, the new
correspondence between data frames can be used as an additional constraint in the
graph describing the spatial relationship between frames. Optimization of this pose
graph results in a globally aligned set of frames [9].
While RGB-D Mapping follows the overall structure of recent 3D mapping tech-
niques, it differs from existing approaches in the way it performs frame-to-frame
matching. While pure laser-based ICP is extremely robust for the 3D point clouds
collected by 3D laser scanning systems such as panning SICK scanners or 3D Velo-
dyne scanners [21, 28], RGB-D cameras provide depth and color information for a
small field of view (60 in contrast to 180 ) and with less depth precision (3cm at
4 Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, Dieter Fox
Fig. 2 Overview of RGB-D Mapping. The algorithm uses both sparse visual features and dense
point clouds for frame-to-frame alignment and loop closure detection. The surfel representation is
updated incrementally.
3m depth) [25]. The limited field of view can cause problems due to a lack of spatial
structure needed to constrain ICP alignments.
There has been relatively little attention devoted to the problem of combining
shape and visual information for scan alignment. Ramos and colleagues introduce
CRF-matching [26], which uses conditional random fields and adds visual informa-
tion into the matching of 2D laser scans. While this approach provides excellent
results, it is computationally expensive and does not scale to large 3D clouds. May
and colleagues [19] use laser reflectance values to improve ICP but do not take full
advantage of the improved data association provided by visual features. In contrast,
we use the visual channels to locate point features, constrain their correspondences,
and incorporate them into scan matching.
A common addition to ICP is to augment each point in the two point clouds
with additional attributes. The correspondence selection step acts in this higher-
dimensional space. This approach has been applied to point color [13], geometric
descriptors [29], and point-wise reflectance values [19]. In comparison, our algo-
rithm uses rich visual features along with RANSAC verification to add fixed data
associations into the ICP optimization. Additionally, the RANSAC associations act
as an initialization for ICP, which is a local optimizer.
Our objective is not only alignment and registration, but also building 3D models
with both shape and appearance information. In one recent work, Kim et al [14] used
a set of time-of-flight cameras in a fixed calibrated configuration and with no tem-
poral alignment of sensor streams. In contrast, we use a single freely moving camera
to build dense models for large indoor environments. In the vision community, there
has been a large amount of work on dense reconstruction from videos (e.g. [24])
and photos (e.g. [6, 8]), mostly on objects or outdoor scenes. One interesting line of
work [7] attacks the harder problem of indoor reconstruction, using a Manhattan-
world assumption to fit simple geometric models for visualization purposes. Such
approaches are computationally demanding and not very robust in feature-sparse
environments.
RGB-D Mapping 5
Fig. 3 Example frame for RGB-D frame alignment. Left, the locations of SIFT features in the
image. Right, the same SIFT features shown in their position in the point cloud.
3 RGB-D Mapping
This section describes the different components of RGB-D Mapping. A flow chart
of the overall system is shown in Fig. 2.
To align the current frame to the previous frame, the alignment step uses RGBD-
ICP, our enhanced ICP algorithm that takes advantage of the combination of RGB
and depth information. After this alignment step, the new frame is added to the dense
3D model. This step also updates the surfels used for visualization and occlusion
reasoning. A parallel loop closure detection thread uses the sparse feature points
to match the current frame against previous observations, taking spatial constraints
into account. If a loop closure is detected, a constraint is added to the pose graph
and a global alignment process is triggered.
3.1 RGBD-ICP
In the Iterative Closest Point (ICP) algorithm [2], points in a source cloud Ps are
matched with their nearest neighboring points in a target cloud Pt and a rigid trans-
formation is found by minimizing the n-D error between associated points. This
transformation may change the nearest neighbors for points in Ps , so the two steps of
association and optimization are alternated until convergence. ICP has been shown
to be effective when the two clouds are already nearly aligned. Otherwise, the un-
known data association between Ps and Pt can lead to convergence at an incorrect
local minimum.
Alignment of images, by contrast, is typically done using sparse feature-point
matching. A key advantage of visual features is that they can provide alignments
without requiring initialization. One widely used feature detector and descriptor is
the Scale Invariant Feature Transform (SIFT)[18]. Though feature descriptors are
very distinctive, they must be matched heuristically and there can be false matches
selected. The RANSAC algorithm is often used to determine a subset of feature pairs
6 Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, Dieter Fox
RGBD-ICP (Ps , Pt ):
erage distances for the visually associated feature points, and the second part com-
putes a similar error term for the dense point associations. For the dense points we
employ a point-to-plane error term that minimizes the distance error along each
target points normal. These normals, {ntj }, are computed efficiently by principal
component analysis over a small neighborhood of each target point. Point-to-plane
ICP has been shown to generate more accurate alignments than point-to-point ICP
due to an improved interpolation between points [28]. Finally, the two components
are weighted using a factor . Since the point-to-plane error metric has no known
closed-form solution, and thus requires the use of a nonlinear optimizer, RGBD-ICP
performs the minimization using Levenberg-Marquardt.
The loop exits after the error no longer decreases significantly or a maximum
number of iterations is reached. Otherwise, the dense data associations are recom-
puted using the most recent transformation. Note that feature point data associations
are not recomputed after the RANSAC procedure. This avoids that the dense ICP
components might cause the point clouds to drift apart, which can happen in under-
constrained cases such as large flat walls.
We find that downsampling the source and target clouds given to ICP by a factor
of 4 to 10 gives the best compromise between matching speed and accuracy.
If RANSAC fails to find a large number of inliers, we initialize the ICP transfor-
mation using a constant-velocity motion model: assume the motion between frames
n and n + 1 is similar to that between frames n 1 and n.
4 Experiments
We performed several experiments to evaluate different aspects of RGB-D Mapping.
Specifically, we demonstrate the ability of our system to build consistent maps of
large scale indoor environments, we show that our joint ICP algorithm improves
accuracy of frame to frame alignment, and we illustrate the advantageous properties
of the surfel representation.
Fig. 4 (left) 3D maps generated by RGB-D Mapping for large loops in the Intel lab (upper) and
the Allen Center (lower). (right) Accuracy: Maps (red) overlaid on a 2D laser scan map of the Intel
Lab and on a floor plan of the Allen Center. For clarity, most floor and ceiling points were removed
from the 3D maps.
As can be seen, by combining RGB and depth into a joint ICP algorithm, RGBD-
ICP achieves accuracy that is superior to using either component individually. The
results are robust to different weightings between the SIFT and dense point com-
ponents. The larger error of SIFT alone is due to failed alignments in the dark section
of the environment. However, when incorporated into RGBD-ICP, it provides addi-
tional information that improves the alignment accuracy. Though on this sequence
ICP alone performs in nearly as well as RGBD-ICP it should be pointed out that
ICP requires more iterations when not initialized with RANSAC, and that frames
containing a single textured flat wall provide no constraints to ICP.
In a second experiment, we collected a ground-truth dataset with the camera
mounted on a moving Barrett WAM arm, which has accurate motion estimation.
Errors are the average distance between the camera motion determined by the arms
odometry and the poses estimated by the different algorithms. Table 3 summarizes
RGB-D Mapping 11
(a) (b)
(c) (d)
Fig. 5 Surfel updates: a) The initial surfels from the first frame. b,c) As the camera moves closer,
more surfels are added and existing ones are refined. d) The completed surfel model from 95
aligned RGB-D images.
the results achieved by the different alignment algorithms and the additional opti-
mization after loop closure.
This experiment confirms our findings on the large-scale marker sequence, where
RGBD-ICP outperforms the individual components. Furthermore, it shows that loop
closure followed by global optimization via TORO further improves results.
Fig. 6 Surfel representation: (left) raw point clouds (56.6 million points); (right) corresponding
surfel representation (2.2 million surfels).
95 aligned frames, each containing roughly 250,000 RGB-D points. Simply merging
the point clouds would result in a representation containing roughly 23,750,000
points. Furthermore, these points duplicate much information, and are not consistent
with respect to color and pose. In contrast, the surfel representation consists of only
730,000 surfels. This amounts to a reduction in size by a factor of 32. The color
for each is selected from the frame most in line with the surfel normal. The final
surfel representation shown in Fig. 5(d) has concisely and faithfully combined the
information from all input frames.
In a second demonstration of surfels from an indoor map, in Fig. 6, the repre-
sentation on the left shows all points from each cloud, totaling 56.6 million points.
Our surfel representation on the right consists of only 2.2 million surfels, a factor of
26 reduction. In addition to being a more efficient representation, the surfel model
exhibits better color and geometric consistency.
We have created software that allows surfel maps to be navigated in real time,
including a stereoscopic 3D mode which create an immersive experience we believe
is well suited for telepresence and augmented reality applications.
Videos of our results can be found at
http://www.cs.washington.edu/robotics/projects/rgbd-mapping/ .
RANSAC is the faster and more reliable alignment component when considered
individually. However, there are situations where it fails and the joint optimization
is required. For some frames, many detected visual features are out of range of the
depth sensor, so those features have no associated 3D points and do not participate
in the RANSAC procedure. Also, when the majority of the features lie in a small
region of the image, they do not provide very strong constraints on the motion. For
example, in badly lit halls it is common to see features only on one side wall. It is
in situations such as these that RGBD-ICP provides notably better alignment.
5 Conclusion
Building accurate, dense models of indoor environments has many applications in
robotics, telepresence, gaming, and augmented reality. Limited lighting, lack of dis-
tinctive features, repetitive structures, and the demand for rich detail are inherent
to indoor environments, and handling these issues has proved a challenging task
for both robotics and computer vision communities. Laser scanning approaches are
typically expensive and slow and need additional registration to add appearance in-
formation (visual details). Vision-only approaches to dense 3D reconstruction often
require a prohibitive amount of computation, suffer from lack of robustness, and
cannot yet provide dense, accurate 3D models.
We investigate how potentially inexpensive depth cameras developed mainly for
gaming and entertainment applications can be used for building dense 3D maps of
indoor environments. The key insights of this investigation are, first, that existing
frame matching techniques are not sufficient to provide robust visual odometry with
these cameras; second, that a tight integration of depth and color information can
yield robust frame matching and loop closure detection; third, that building on best
practices in SLAM and computer graphics makes it possible to build and visualize
accurate and extremely rich 3D maps with such cameras; and, fourth, that it will be
feasible to build complete robot navigation and interaction systems solely based on
inexpensive depth cameras.
We introduce RGB-D Mapping, a framework that can generate dense 3D maps of
indoor environments despite the limited depth precision and field of view provided
by RGB-D cameras. At the core of this framework is RGBD-ICP, a novel ICP variant
that takes advantage of the richness of information contained in RGB-D data. RGB-
D Mapping also incorporates a surfel representation to enable occlusion reasoning
and visualization.
Given that RGB-D cameras will soon be available to the public at a low price (we
believe less than 100 dollars), an RGB-D-based modeling system will potentially
have a huge impact on everyday life, allowing people to build 3D models of arbi-
trary indoor environments. Furthermore, our results indicate that RGB-D cameras
could be used to build robust robotic mapping, navigation, and interaction systems.
Along with the potential decrease in cost of the resulting navigation platform, the
application of RGB-D cameras might be an important step to enable the develop-
ment of useful, affordable robot platforms.
14 Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, Dieter Fox
Despite these encouraging results, our system has several shortcomings that de-
serve future effort. Our current implementation of RGB-D Mapping is not real-time,
but we believe that an efficient implementation taking advantage of modern GPU
hardware can certainly achieve the speedup needed to operate online. The global
alignment process of RGB-D Mapping is still limited. Instead of optimizing over
camera poses only, a joint optimization over camera poses and 3D points could
result in even more consistent reconstructions. The computer graphics community
has developed extremely sophisticated visualization techniques, and incorporating
these into RGB-D mapping could improve the visual quality of the 3D maps. An-
other interesting avenue for research is the extraction of object representations from
the rich information contained in dense 3D maps. Other areas for future research
include the development of exploration techniques for building complete 3D maps
and the extension to dynamic environments.
We would like to thank Louis LeGrand and Brian Mayton for their support with
the PrimeSense camera, and Marvin Cheng for his work on visualization. This work
was funded in part by an Intel grant, by ONR MURI grants number N00014-07-1-
0749 and N00014-09-1-1052, and by the NSF under contract number IIS-0812671.
Part of this work was also conducted through collaborative participation in the
Robotics Consortium sponsored by the U.S Army Research Laboratory under the
Collaborative Technology Alliance Program, Cooperative Agreement W911NF-10-
2-0016.
References
10. G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A tree parameterization for efficiently
computing maximum likelihood maps using gradient descent. In Proc. of Robotics: Science
and Systems (RSS), 2007.
11. B. K. P. Horn. Closed-form solution of absolute orientation using unit quaternions. J. Opt.
Soc. Am. A, 4(4):629642, 1987.
12. Mesa Imaging. http://www.mesa-imaging.ch/.
13. A. Johnson and S. B. Kang. Registration and integration of textured 3-d data. In International
Conference on Recent Advances in 3-D Digital Imaging and Modeling (3DIM 97), pages 234
241, May 1997.
14. Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Micusik, and S. Thrun. Multi-view image
and ToF sensor fusion for dense 3D reconstruction. In Workshop on 3-D Digital Imaging and
Modeling (3DIM), 2009.
15. K. Konolige. Projected texture stereo. In Proc. of the IEEE International Conference on
Robotics & Automation (ICRA), 2010.
16. K. Konolige and M. Agrawal. FrameSLAM: From bundle adjustment to real-time visual
mapping. IEEE Transactions on Robotics, 25(5), 2008.
17. M. Krainin, P. Henry, X. Ren, and D. Fox. Manipulator and object tracking for in hand 3D
object modeling. Technical Report UW-CSE-10-09-01, University of Washington, 2010.
http://www.cs.washington.edu/ai/Mobile_Robotics/projects/hand_tracking/.
18. D. Lowe. Discriminative image features from scale-invariant keypoints. International Journal
of Computer Vision, 60(2), 2004.
19. S. May, D. Droschel, D. Holz, E. Fuchs, S. Malis, A. Nuchter, and J. Hertzberg. Three-
dimensional mapping with time-of-flight cameras. Journal of Field Robotics (JFR), 26(11-
12), 2009.
20. Microsoft. http://www.xbox.com/en-US/kinect, 2010.
21. P. Newman, G. Sibley, M. Smith, M. Cummins, A. Harrison, C. Mei, I. Posner, R. Shade,
D. Schroter, L. Murphy, W. Churchill, D. Cole, and I. Reid. Navigating, recognising and
describing urban spaces with vision and laser. International Journal of Robotics Research
(IJRR), 28(11-12), 2009.
22. D. Nister. An efficient solution to the five-point relative pose problem. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), 26(6):75677, 2004.
23. H. Pfister, M. Zwicker, J. van Baar, and M. Gross. Surfels: Surface elements as rendering
primitives. In ACM Transactions on Graphics (Proc. of SIGGRAPH), 2000.
24. M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels,
D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewe-
nius, R. Yang, G. Welch, and H. Towles. Detailed real-time urban 3D reconstruction from
video. International Journal of Computer Vision, 72(2):14367, 2008.
25. PrimeSense. http://www.primesense.com/.
26. F. Ramos, D. Fox, and H. Durrant-Whyte. CRF-matching: Conditional random fields for
feature-based scan matching. In Proc. of Robotics: Science and Systems (RSS), 2007.
27. S. Rusinkiewicz and M. Levoy. Efficient variants of the ICP algorithm. In Third International
Conference on 3D Digital Imaging and Modeling, 2001.
28. A. Segal, D. Haehnel, and S. Thrun. Generalized-ICP. In Proc. of Robotics: Science and
Systems (RSS), 2009.
29. G. C. Sharp, S. W. Lee, and D. K. Wehe. ICP registration using invariant features. IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI), 24(1):90102, 2002.
30. N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3D. In
ACM Transactions on Graphics (Proc. of SIGGRAPH), 2006.
31. S. Thrun, W. Burgard, and D. Fox. A real-time algorithm for mobile robot mapping with
applications to multi-robot and 3D mapping. In Proc. of the IEEE International Conference
on Robotics & Automation (ICRA), 2000.
32. R. Triebel and W. Burgard. Improving simultaneous mapping and localization in 3D using
global constraints. In Proc. of the National Conference on Artificial Intelligence (AAAI), 2005.
33. C. Wu. SiftGPU: A GPU implementation of scale invariant feature transform (SIFT). http:
//cs.unc.edu/ccwu/siftgpu, 2007.