0% found this document useful (0 votes)
73 views24 pages

Monocular Vision For Mobile Robot Localization and Autonomous Navigation

1. This paper presents a localization system for autonomous mobile robot navigation using monocular vision. 2. The system uses a three step process: first, the robot is manually guided to build a 3D map from video data using structure from motion; second, this map is used for real-time localization; third, the robot navigates autonomously by following the learned path or a similar path. 3. Experimental results over two years in various environments show the accuracy and robustness of the vision algorithms for map building and localization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views24 pages

Monocular Vision For Mobile Robot Localization and Autonomous Navigation

1. This paper presents a localization system for autonomous mobile robot navigation using monocular vision. 2. The system uses a three step process: first, the robot is manually guided to build a 3D map from video data using structure from motion; second, this map is used for real-time localization; third, the robot navigates autonomously by following the learned path or a similar path. 3. Experimental results over two years in various environments show the accuracy and robustness of the vision algorithms for map building and localization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

International Journal of Computer Vision 74(3), 237–260, 2007


c 2007 Springer Science + Business Media, LLC. Manufactured in the United States.
DOI: 10.1007/s11263-006-0023-y

Monocular Vision for Mobile Robot Localization and Autonomous Navigation

ERIC ROYER,∗ MAXIME LHUILLIER, MICHEL DHOME AND JEAN-MARC LAVEST


LASMEA, UMR6602 CNRS and Université Blaise Pascal, 24 Avenue des Landais, 63177 Aubière, France
[email protected]

Received October 14, 2005; Accepted December 11, 2006


First online version published in January, 2007

Abstract. This paper presents a new real-time localization system for a mobile robot. We show that autonomous
navigation is possible in outdoor situation with the use of a single camera and natural landmarks. To do that, we use
a three step approach. In a learning step, the robot is manually guided on a path and a video sequence is recorded
with a front looking camera. Then a structure from motion algorithm is used to build a 3D map from this learning
sequence. Finally in the navigation step, the robot uses this map to compute its localization in real-time and it follows
the learning path or a slightly different path if desired. The vision algorithms used for map building and localization are
first detailed. Then a large part of the paper is dedicated to the experimental evaluation of the accuracy and robustness
of our algorithms based on experimental data collected during two years in various environments.

Keywords: localization, navigation, vision, mobile robot, structure from motion

1. Introduction In order to navigate autonomously, a robot needs to


know where it must go. It also needs some knowledge
Localization is a key component in mobile robotics. The about the environment so that it is able to compute its cur-
most popular localization sensor for outdoor robotic ve- rent localization. In our application, all these information
hicles is the GPS receiver. A Real Time Kinematic (RTK) are given to the robot in a simple way: the user drives the
GPS allows a localization accurate to 1 cm. Such an ac- robot manually on a trajectory. After that the robot is able
curacy is possible if enough satellites are visible from the to follow the same trajectory in autonomous navigation.
receiver. Unfortunately, in dense urban areas, buildings To do that we use only monocular vision and there is no
can mask some satellites and in this case the accuracy artificial landmark. The robot is equipped with a wide
of the localization drops considerably. For this reason, it angle camera in a front looking position. An overview of
is necessary to develop other localization sensors. The the vision system is presented on Fig. 1. In the learning
use of vision is very attractive to solve this problem be- step, the robot is manually driven and the camera records
cause in places where the GPS is difficult to use such a learning video sequence. This sequence is processed off
as city centers or even indoors, there are usually a lot of line to build a map of the environment with a structure
visual features. So a localization system based on vision from motion algorithm. Then the robot is able to localize
could make a good complementary sensor to the GPS. itself in real-time in the neighborhood of the reference
Of course, for satisfactory reliability of the localization trajectory. It allows the robot to follow the same path as
in a real world application, several sensors should be used in the learning step or to follow a slightly different path
(see for example, Georgiev and Allen, 2004). But each which is useful to avoid an obstacle for example.
sensor itself should offer the best possible performance.
Our purpose in this paper is to show what can be done 1.1. Related Work
with monocular vision only. We did not use odometry
because it may not be available (for a hand held camera Several approaches for robot localization using vision
for example). We focused on monocular vision as op- have been proposed. Simultaneous Localization And
posed to stereo vision because it simplifies the hardware Mapping (SLAM) is attractive because localization is
at the expense of a more complex software of course. We possible as soon as the system is started. But map build-
think, this is a good way to reduce the cost and size of ing is something complex, so real time SLAM using only
the localization system. monocular vision is a difficult task. Some approaches
238 Royer et al.

technique. After that, a displacement vector is computed


so that the robot can reach the next key frame.
The last approach consists in building a map of the en-
vironment first in an off line learning step. After that, the
robot is able to localize itself with this map. The main ad-
vantage of these approaches is that they allow the robot
to localize itself even if it is slightly off the path that
was learnt. Since map building is done offline, compu-
tationally intensive algorithms can be used, resulting in
an accurate and rich map. This is not always the case in
SLAM approaches because map building and localiza-
tion must be done simultaneously in real-time. The map
can be built by using different techniques and different
sensors. Vacchetti et al. (2003) use a CAD model given
by the user and develop a very efficient localization sys-
tem. Cobzas et al. (2003) use a rotating camera along with
a laser range finder to build a set of panoramic images
enhanced with 3D information. With this map, a single
Figure 1. An overview of our vision system. 2D image is enough to localize the camera. Kidono et al.
(2002) build a 3D reconstruction from a stereo camera
and an odometer under the assumption that the ground is
planar. Then they use the map generated in the 3D recon-
using more sensors such as stereo vision and odometry struction process to localize the robot in real-time. Our
have been presented by Se et al. (2002). Real-time SLAM system works on the same principle, but we don’t make
using only monocular vision has been achieved by Davi- the assumption that the ground is planar and we use only
son (2003), but only for a small environment with less one calibrated camera.
than 100 landmarks. This would work well in a room for
example, but it is not suitable if the robot needs to travel 1.2. Paper Structure and Contribution
for hundreds of meters. It is also possible to compute ego
motion by using only visual data as done by Nistér et This article is a synthesis of previously published confer-
al. (2004) or, in the specific case of urban environment ence papers (Royer et al., 2005a,b,c) with new theoreti-
(Simond and Rives, 2004). In this case, maintaining a cal developments and experimental data. The main added
large map is not required but the localization accuracy contribution of this paper is the real-time computation of
decreases as the distance traveled increases. Moreover, the localization uncertainty and its experimental valida-
in two successive navigation experiments, the robot may tion. Navigation on a closed loop sequence or on a path
not use the same landmarks and the resulting trajectory different from the learning path are also discussed.
may be different. Section 2 describes the off-line map building process
Another approach for achieving robot navigation ac- and details the structure from motion algorithm. The
cording to a human guided experience consists in repre- method used for matching images, which is nearly the
senting the trajectory as a set of key images. Then the same for the reconstruction and the localization, is also
robot has to go from on key image to the next. With this detailed here. Section 3 describes how the localization of
approach, the robot will go through a number of well de- the robot is computed from a video frame. The computa-
fined positions and the trajectory is repeatable from one tion of localization uncertainty is presented in Section 4.
navigation experiment to the next. The first possibility to The experimental vehicle and the control law are pre-
do that is to go from one key frame to the next by visual sented in Section 5. Then, in Section 6 we evaluate the
servoing as done by Matsumoto et al. (1996), Remazeilles performance of the algorithms with experimental data
et al. (2004) and Blanc et al. (2005). In those algorithms, obtained with a robotic vehicle equipped with the local-
the current pose of the robot is not computed, so it is not ization system presented in this paper.
possible to have the robot follow a path different from
the one which was learnt. A variant of these approaches 2. Map Building
which uses an omnidirectional camera is described by
Argyros et al. (). Another possibility was presented by 2.1. Overview
Goedemé et al. (2005). In this algorithm, a relative lo-
calization of the robot with reference to the key frame is The goal of the map building process is to obtain the
computed from features matched using a wide baseline pose of a subset of the cameras in the reference sequence
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 239

as well as a set of landmarks and their 3D location in Correlation over a 11 × 11 pixels window. All the pairs
a global coordinate system. This must be done from a with a correlation score above 0.8 are sorted. The best
monocular image sequence. The structure from motion one is kept as a good match and the unicity constraint
problem has been studied for several years and multiple is used to reject matches which have become impossi-
algorithms have been proposed depending on the assump- ble. Then the second best score is kept and so on until
tions we can make (Hartley and Zisserman, 2000). For all the potential matches have been kept or discarded.
our experiments, the camera was calibrated using a planar This image matching scheme is used for the localization
calibration pattern (Lavest et al., 1998). Radial distortion part and as the first step in the reconstruction part. In the
is modelled with a fifth order polynomial. Camera cal- reconstruction process, a second matching is done after
ibration is important especially when using wide angle the epipolar constraint has been computed. In the second
lenses with a strong radial distortion. We have used two matching step, the search region is a narrow strip around
lenses: a wide angle lens with 60◦ field of view and a the epipolar line and the correlation score threshold is
fish-eye lenses with 130◦ field of view. Both lenses per- lowered to 0.82 .
form equally well for reconstruction, but a fish-eye is
preferable for localization because it reduces the chances
of occultations. With a calibrated camera, the structure 2.3. Key Frame Selection
from motion algorithm is more robust and the accuracy
of the reconstruction is increased. This is important for Before computing camera motion, it is necessary to ex-
us because in our video sequences with a front looking tract key frames from the reference video sequence. If
camera, the triangulation of the points must be done with there is not enough camera motion between two frames,
very small angles. In spite of this difficulty, we chose the computation of the epipolar geometry is an ill con-
a forward looking camera because there is always some ditioned problem. So we select images so that there is
free space in front of the vehicle, which means the field of as much camera motion as possible between key frames
views contain both near and far objects. If the camera was while still being able to find enough point correspon-
looking to the side, and the vehicle was moving along a dences between the key frames. Some algorithms have
wall very close to the camera, features would move very been proposed to do that. For example, Nistér (2001)
fast in the image which would make image matching very considers the whole video sequence then drops redun-
difficult (motion blur could also be a problem). dant frames. Another method is possible based on the Ge-
In the first step of the reconstruction, we extract a set of ometric Robust Information Criterion (GRIC) proposed
key frames from the reference sequence. Then we com- by Torr et al. (1999). Our method is much simpler and less
pute the camera pose for every key frame by using interest general than these ones but works well for our purpose.
points matched between key frames. Additionally, the in- The first image of the sequence is always selected as
terest points are reconstructed in 3D. These points will the first key frame I1 . The second key frame I2 is chosen
be the landmarks used for the localization process. so that there are as many video frames between I1 and
I2 while there are at least M common interest points
between I1 and I2 . When key frames I1 . . . In are chosen,
2.2. Image Matching we select In+1 so that:

Every step in the reconstruction as well as the localiza- • there are as many video frames as possible between In
tion relies on image matching. In order to match two and In+1 ,
images, interest points are detected in each image with • there are at least M interest points in common between
Harris corner detector (Harris and Stephens, 1988). Cor- In+1 and In ,
ner response R is computed as in Harris work and local • there are at least N common points between In+1 and
maxima of R are potential interest points. In order to have In−1 .
interest points in each part of the image, we have divided
the image area in 64 buckets. In each frame (512 × 384 In our experiments we choose M = 400 and N = 300.
pixels), we keep the 20 best local maxima in each bucket Figure 2 shows consecutive key frames extracted from the
plus the 500 best maxima on the whole image. So we city sequence.
get at most 1780 interest points in each frame. For an
interest point M 1 at coordinates (x, y) in image 1, we
define a search region in image 2. The search region is a 2.4. Initial Camera Motion Computation
rectangle whose center has coordinates (x, y). For each
interest point Mi2 inside the search region in image 2, We compute an initial solution for camera motion and a
we compute a correlation score between the neighbor- hierarchical bundle adjustment is used to refine this initial
hoods of M 1 and Mi2 . We use a Zero Normalized Cross estimation.
240 Royer et al.

Figure 2. Consecutive key frames extracted from the city sequence.

For the first image triplet, the computation of the cam- 2.5. Hierarchical Bundle Adjustment
era motion is done with the method described in Nistér
(2003) for three views. It involves computing the essential The computation of camera C N depends on the results
matrix between the first and last images of the triplet using of the previous cameras and errors can build up over
a sample of 5 point correspondences. This gives at most the sequence. In order to greatly reduce this problem,
40 solutions for camera motion. The solutions for which we use a bundle adjustment. The bundle adjustment is a
at least one of the 5 points is not reconstructed in front of Levenberg-Marquardt minimization of the cost function
both cameras are discarded. Then the pose of the remain- f (C E1 , . . . , C EN , X 1 , . . . , X M ) where C Ei are the external
ing camera is computed with 3 out of the 5 points in the parameters of camera i, and X j are the world coordinates
sample. This process is done with a RANSAC (Fischler of point j. For this minimization, the radial distortion of
and Bolles, 1981) approach: each 5 point sample pro- the 2D point coordinates is corrected beforehand. The
duces a number of hypothesis for the 3 cameras. The best cost function is the sum of the reprojection errors of all
one is chosen by computing the reprojection error over the inlier reprojections in all the images:
the 3 views for all the matched interest points and keeping  
the one with the higher number of inlier matches. With a f C E1 , . . . , C EN , X 1 , . . . , X M
calibrated camera, three 3D points whose projections in  
N M
 j 
the image are known are enough to compute the pose = d 2 xi , Pi X j
of the second camera. Several methods are compared i=1 j=1, j∈Ji
in Haralick et al. (1994). We chose Grunert’s method
j
with RANSAC. where d 2 (xi , Pi X j ) is the squared euclidian distance
For the next image triplets, we use a different method between Pi X j the projection of point X j by camera i,
j
for computing camera motion. Assume we know the lo- and xi is the corresponding detected point. Pi is the 3 × 4
cation of cameras C1 through C N , we can compute cam- projection matrix built from the parameters values in C Ei
era C N +1 by using the location of cameras C N −1 and and the known internal parameters of the camera. And Ji
C N and point correspondences over the image triplet is the set of points whose reprojection error in image i is
(N − 1, N , N + 1). We match a set of points P i whose less than 2 pixels at the beginning of the minimization.
projections are known in each image of the triplet. From After a few iteration steps, Ji is computed again and more
the projections in images N − 1 and N , we can compute minimization iterations are done. This inlier selection
the 3D coordinates of point P i . Then from the set of P i process is repeated as long as the number of inliers
and their projections in image N + 1, we use Grunert’s increases.
calibrated pose estimation algorithm to compute the lo- Computing all the camera locations and using the
cation of camera C N +1 . In addition the 3D locations of bundle adjustment only once on the whole sequence
the reconstructed interest points are stored because they could cause problems. this is because increasing errors
will be the landmarks used for the localization process. could produce an initial solution too far from the op-
The advantage of this iterative pose estimation process timal one for the bundle adjustment to converge. Thus
is that it can deal with virtually planar scenes. After the it is necessary to use the bundle adjustment throughout
pose computation, a second matching step is done with the reconstruction of the sequence. So we use a hierar-
the epipolar constraint based on the pose that had just chical bundle adjustment as in Hartley and Zisserman
been computed. This second matching step allows to in- (2000). A large sequence is divided into two parts with
crease the number of correctly reconstructed 3D points an overlap of two frames in order to be able to merge
by about 20%. This is important for us for two reasons. the sequence. Each subsequence is recursively divided
Those 3D points are used in the computation of the next in the same way until each final subsequence contains
camera, and they are also the landmarks used in the lo- only three images. Each image triplet is processed as
calization process. So we need to get as many as possible described in Section 2.4. After each triplet has been
which wouldn’t be the case if the goal was just to recover computed we run a bundle adjustment over its three
the motion of the camera. frames.
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 241

tion. So we need to compare the current image to every


key frame to find the best match. This is done by match-
ing interest points between the two images and computing
a camera pose with RANSAC. The pose obtained with
the higher number of inliers is a good estimation of the
camera pose for the first image. This step requires a few
seconds but is needed only at the start. After this step, we
always have an approximate pose for the camera, so we
only need to update the pose and this can be done much
faster.
The localization process can be divided in four steps.
First, a set of the landmarks which should be visible
is selected by finding the closest key frame. Then an
approximate 2D position of each landmark in the current
frame is computed based on the pose of the camera of
the previous frame. With this information the landmarks
are matched to the interest points detected in the current
frame. And finally the pose is computed with these
matches. Figure 4 shows interest points matched between
Figure 3. Top view of the 3D reconstruction computed from the city the current video frame and the closest key frame in
sequence. The squares are the camera position for each key frame. The the reference sequence recorded with different weather
dots are the reconstructed interest points projected on an horizontal conditions.
plane. The current image is denoted I . First we assume that
the camera movement between two successive frames is
small. So an approximate camera pose (we note the as-
In order to merge two sequences S 1 and S 2 , we use the sociated camera matrix P0 ) for image I is the same as
last 2 cameras S N1 −1 and S N1 of S 1 and the first 2 cameras the pose computed for the preceding image. Based on P0
S12 and S22 of S 2 . As the images are the same, the cameras we select the closest key frame Ik in the sense of shortest
associated after merging must be the same. So we apply euclidian distance between the camera centers. Ik gives
a rotation and a translation to S 2 so that S N1 and S22 have us a set of interest points Ak reconstructed in 3D. We
the same position and orientation. Then the scale fac- detect interest points in I and we match them with Ak .
tor is computed so that d(S N1 −1 , S N1 ) = d(S12 , S22 ), where To do that, for each point in Ak , we compute a correla-
j
d(Sni , Sm ) is the euclidian distance between the optical tion score with all the interest points detected in I which
j are in the search region. For each interest point in Ak we
centers of the cameras associated with Sni and Sm . This
doesn’t ensure that S N1 −1 and S12 are the same, so a bundle know a 3D position, so we can project it with P0 so we
adjustment is used on the result of the merging operation. know approximately its 2D position in the current frame
Merging is done until the whole sequence has been re- I . This is illustrated on Fig. 5. In the matching process
constructed. The reconstruction ends with a global bun- the search region is centered around the expected position
dle adjustment. The number of points used in the bun- and its size is small (20 × 12 pixels). After this matching
dle adjustment is on the order of several thousands. For is done, we have a set of 2D points in image I matched
example, Fig. 3 shows the result of the 3D reconstruc- with 2D points in image Ik which are themselves linked
tion computed for the city sequence. There are 308 key to a 3D point obtained during the reconstruction process.
frames and 30584 points. The path is about 600 meters With these 3D/2D matches a better pose is computed us-
long. ing Grunert’s method through RANSAC to reject outliers.
This gives us the camera matrix P1 for I . Then the pose
is refined using the iterative method proposed by Araújo
3. Real Time Localization et al. (1998) with some modifications in order to deal
with outliers. This is a minimization of the reprojection
3.1. Camera Localization error for all the points using Newton’s method. At each
iteration we solve the linear system J δ = e in order to
The output of the learning process is a 3D reconstruction compute a vector of corrections δ to be subtracted from
of the scene: we have the pose of the camera for each the pose parameters. e is the error vector formed with the
key frame and a set of 3D points associated with their 2D reprojection error of each point in x and y. J is the Jaco-
positions in the key frames. At the start of the localization bian matrix of the error. In our implementation, the points
process, we have no assumption on the vehicle localiza- used in the minimization process are computed at each
242 Royer et al.

Figure 4. Matching interest points between the reference key frame learned with snow (left) and the current video frame without snow (right).

3D points

Learning path

Reference frame
Current frame
Target path

Figure 5. Computing an expected position for the landmarks in the current frame.

iteration. We keep only the points whose reprojection er- center of the first camera in the reference sequence, with
ror is less than 2 pixels. As the pose converges towards one axis along the optical axis. Moreover, there is no
the optimal pose, some inliers can become outliers and scale information. In order to control the robot, we need
conversely. Usually, less than five iterations are enough. to provide a position for the robot in a metric coordi-
It would be relatively easy to introduce a motion model nate system. We achieve that by entering manually the
of the vehicle or odometry measurements in this frame- length of the path to set the scale factor. The position
work. For each new frame, we consider that the pose of the camera on the robot has been measured so we
of the camera is the same as the previous pose in or- can enter directly the rigid transformation between the
der to match features. Incorporating odometry would camera and the robot. Once the camera pose has been
allow to have a better initial estimate of the camera computed, we compute the robot position and orienta-
pose. We have not done that in order to keep the pos- tion with 3 degrees of freedom only because the robot
sibility to use the localization algorithm with hand held can be controlled only along three dimensions (transla-
cameras. tion on the ground plane and rotation around the verti-
At this point we have a real-time algorithm that is able cal axis). To do that, we assume that the ground is lo-
to compute the pose of the camera. We always keep track cally planar and horizontal at the current position of the
of the camera pose with 6 degrees of freedom, even if the robot.
control law works in a ground plane. It allows us to track
interest points even if the ground is irregular.
4. Computing the Localization Uncertainty

3.2. Robot Localization Knowing the uncertainty of the robot localization com-
puted for the current frame has some important applica-
Since the localization algorithm is based only on vi- tions. It allows data fusion with other localization sensors
sion the coordinate system has its center at the optical (such as an odometer). It could also be used for temporal
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 243

filtering of the localization result if vision is the only minimum, with:


sensor.
Uncertainty in the localization can come from several G(C, X 1 , . . . , X n )2
sources. The main one comes from the uncertainty in the ⎛ ⎞T ⎛ ⎞
position of interest points. Interest points can be detected n ... ...
= e Tj −1 j ej +
−1 ⎝
⎝ X j − X 0j ⎠ Cov3d X j − X 0j ⎠
with an accuracy on the order of one pixel. In the structure
from motion algorithm, this causes uncertainty in the 3D
j=1 ... ...
positions of the cameras and the points. When computing (1)
the localization of the camera, the result is affected by the
uncertainty of the landmarks positions in 3D space and In this expression, we assume that the n + 1 random vec-
by the uncertainty on the positions of the interest points tors e1 , . . . , en , X are independent. Obviously, we have:
detected in the current frame. This is the only source of
uncertainty we considered in our computation. So the G(C, X 1 , . . . , X n )
first step is to compute the uncertainty in the 3D recon-  −1 −1 − 1 T
struction which is detailed in Section 4.2, and we use = e1T 1 2 , . . . , enT n 2 , (X − X 0 )T Cov3d2 (2)
that to compute the uncertainty for the localization result
in Section 4.1. In both cases, we have a minimization −1 −1
Computing  j 2 and Cov3d2 is possible because  j
problem and the covariance matrix is obtained from the and Cov3d are real symetric definite positive matrices.
inverse or the pseudo-inverse of the hessian (Hartley and The right member of Eq. (2) is a random variable with
Zisserman, 2000). Propagating the uncertainties through- the normal distribution N (0, I5n ). Once the minimum is
out the reconstruction and the localization process is pos- found, The result on the backward tranport of covari-
sible but very time consuming. Since we want to be able ance given in Hartley and Zisserman (2000) gives us the
to compute the localization uncertainty in real-time, we covariance matrix (JGT JG )−1 with JG the Jacobian ma-
had to make some trade-offs between speed and accu- trix of G. However, even by exploiting the sparseness of
racy. Therefore, some sources of uncertainty were not the matrices, doing a full minimization with these equa-
taken into account in our computation. These include the tions would be too slow for the real-time localization.
inaccuracy of the internal parameters of the camera for So, as we have seen in Section 3, the pose is computed
the vision part, and for the robot localization part, un- by simply minimizing G̃(C)2 = n T −1
j=1 e j  j e j . Sev-
certainty in the measurement of the transformation be- eral iterations can be done in a few milliseconds when
tween the camera coordinate system and the robot coor- using this approximation. The more complex formula-
dinate system (we assumed a rigid transformation here, tion of Eq. (1) is used only for the covariance matrix
but in reality it is not rigid because of the suspension computation because it needs to be done only once per
system). frame.
Computing JG gives a sparse matrix :

4.1. Uncertainty in the Localization ⎛ ⎞


A1 B1 0 ... 0
⎜ .. ⎟
Let’s assume that the uncertainties of the 3D points have ⎜ A2 0 B2 . 0⎟
⎜ ⎟
been computed (the method used to compute them will be ⎜ .. .. .. .. ⎟
⎜ . . . . 0⎟
detailed in Section 4.2). We want to compute the covari- ⎜ ⎟
JG = ⎜ An 0 0 0 Bn ⎟ (3)
ance matrix Covcam associated to the pose of the camera ⎜ ⎟
⎜0 ... ... ... . . .⎟
computed from n 3D/2D correspondences. The projec- ⎜ ⎟
⎜0 ... ... Ci, j . . .⎟
tion of point X j is detected in the current frame at the ⎝ ⎠
..
2D position m j . The reprojection error for point X j is . ... ... ... ...
e j = π (C X j ) − m j where π(C X j ) denotes the projec-
− 1 ∂e ∂e
tion of point X j with the camera parameters C computed with A j =  j 2 ∂Cj a 2 × 6 matrix, B j = ∂ Xjj a 2 × 3
for the current frame. We assume that e j follows the nor- − 12
matrix, Ci, j a 3 × 3 matrix. Cov3d is a 3n × 3n matrix
mal distribution N (0,  j ). The vector made with the 3D defined by blocks Ci, j .
points visible in the current frame is X = (. . . , X j , . . . )T . The covariance matrix Covcam associated to the current
This vector follows a normal distribution with mean camera pose is the 6 × 6 upper left block of (JGT JG )−1 .
X 0 = (. . . , X 0j , . . . )T given by the bundle adjustment Computing JGT JG gives:
and covariance matrix Cov3d (which will be computed
in Section 4.2). Computing the maximum likelihood es-
timation of the camera pose means finding the best val- U W
JGT JG = (4)
ues for (C, X 1 , . . . , X n ) so that G(C, X 1 , . . . , X n )2 is WT V
244 Royer et al.

with U a 6 × 6 block, W a 6 × 3n block and V a 3n × 3n the full covariance matrix relating all the points and all
matrix: the cameras.
We use the general method of gauge constraints to
U = A T A = A1T A1 + A2T A2 + · · · + AnT An (5) fix the reconstruction coordinate system and choose our
 T  covariance matrix (Triggs et al., 2000). Among other
W = A1 B1 A2 B2 . . . An Bn
T T
(6)
⎛ T ⎞ choices, the symmetric constraint on the reference cam-
B1 B1 0 0 0 era location (e.g. Lhuillier et al., 2006) is chosen since it
⎜ 0 T
B2 B2 0 0 ⎟ usually spreads evenly and minimizes the uncertainties
⎜ ⎟
V =⎜⎜ ⎟ + Cov −1 (7) on all reference cameras. Many details about a practical
.. ⎟ 3d
⎝ 0 0 . 0 ⎠ computation of points and cameras covariances for com-
0 0 0 BnT Bn plex reconstructions like ours are given in Lhuillier et al.
(2006).
Finally, the covariance matrix associated with the cam- A simplification of this method can also be used if we
era pose is: make the additional assumption that there is no uncer-
tainty for the cameras in the 3D reconstruction. In this
Covcam = (U − W V −1 W T )−1 (8) case, the covariance matrix for each point can be com-
puted independently. We consider a point X j seen by n
cameras Ci , i ∈ {1, . . . , n}. We want to compute its co-
In the general case, V is not a sparse matrix and the
variance matrix C j, j . We assume that the n reprojection
computation of W V −1 W T can take a lot of time. In or-
errors of this point denoted ei in image i follow indepen-
der to be able to compute the localization uncertainty in
dent and normal distributions N (0, i ). Furthermore, we
real-time, an additional hypothesis should be done. If we
assume that i = σ 2 I2 , with σ 2 an unknown parameter.
assume that the positions of the 3D points are independent
Thus, the probability density function of the statistical
random variables, then Cov3d becomes a block-diagonal
model is:
matrix and matrix V can be rewritten :
⎛ ⎞ 1

i ei 
2
B1T B1 + C1,1
T
C1,1 0 0 f (X j , σ 2 ) = e− 2σ 2 (10)
⎜ .. ⎟ (2π σ 2 )n
V =⎝ 0 . 0 ⎠
0 0 BnT Bn + Cn,n
T
Cn,n The MLE of X j and σ 2 are Xˆ j minimizing X j →
(9) e 2
i ei  and σ̂ = i 2n i . We replace
2 2

the MLE of
i ei 
2
σ by the unbiased estimator σ̂ = 2n−3 , with 2n the
2 2
With this assumption Covcam can be computed very number of observations, and 3 the number of degrees
quickly. Additionally, the computation of the covariance of freedom of X j . This gives us i = σ̂ 2 I2 . Finally,
matrix of the 3D points can also be simplified because we obtain the estimation of C j, j by the inverse of JFT JF
we need to compute only the blocks on the diagonal. Sec- where JF is the jacobian matrix of the vector error F with
tion 4.2 details the computation of the covariance matrix −1 −1
F = (e1T 1 2 , . . . , enT n 2 )T .
of the 3D points, then Section 4.3 compares the various
Figure 6 shows the confidence ellipsoids computed
methods that can be used.
with this method for the points visible in one key frame of
the loop sequence (see Fig. 20). Because of the forward
4.2. Uncertainty in the 3D Reconstruction motion of the camera, most ellipsoids have a major axis
much longer than the minor axis.
Note that the reconstruction of the reference sequence is
obtained by minimizing the sum of squared reprojection
errors using bundle adjustment. We obtain a maximum 4.3. Comparison
likelihood estimation (MLE) of 3D points and extrin-
sic camera parameters assuming that the reprojection er- In order to choose between all the possible trade-offs we
rors obeys independent and identical zero mean Gaussian made a comparison of four methods based on different
distributions. This perturbation of errors propagates to a assumptions:
Gaussian perturbation of the estimated parameters, such
that the covariance matrix may be estimated (Hartley and • In method 1, Covcam is computed with the full matrix
Zisserman, 2000). Thus, for each key-frame of the ref- Cov3d .
erence sequence, we estimate the covariance matrix of • In method 2, Cov3d is computed with the general
the 3D points seen by the key-frame (named Cov3d in method, but only the diagonal blocks are used in the
the previous paragraph) as a 3n × 3n diagonal block of computation of Covcam .
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 245

Figure 6. 99% confidence ellipsoids computed for the points visible in one frame (view from the top).

• In method 3, Cov3d is computed with the assumption 5. Experimental Vehicle and Control Law
that the camera have no uncertainty in the reconstruc-
tion process. This part is presented only for the completeness of the
• In method 4, we simply ignore the uncertainty of the paper. It refers to a work made by another team of the
3D positions of the points and the only source of un- laboratory (see Thuilot et al. (2004) for more details).
certainty comes from the positions of the 2D points in
the current frame.
5.1. Vehicle Modeling
Method 1 takes too much time to be used in a real-
The experimental vehicle is called the Cycab. With its
time system, but we included it in this comparison to
small dimensions (length: 1.90 m, width: 1.20 m), it can
know if using the non diagonal blocks of Cov3d was
transport simultaneously two passengers. This vehicle is
mandatory. We computed the length of the major semi-
entirely under computer control. For the experimenta-
axis of the 90% confidence ellipsoid associated to the
tions detailed in this paper, only the front wheels are
camera positions for all four methods. The result is
steered, so the mobility is the same as a common car.
shown on Fig. 7 for a few frames of the video used in
A classical kinematic model, the tricycle model, where
Section 6.6.
the two actual front wheels are merged as a unique virtual
As we could expect, taking into account the uncer-
wheel is used (see Fig. 8). The vehicle configuration can
tainty of the 3D points increases the size of the ellip-
be described without ambiguity by a 3 dimensional state
soids. Taking into account the uncertainty of the cam-
vector composed of s, curvilinear coordinate along the
eras in the reconstruction has the same effect. However,
reference path of the projection of vehicle rear axle center,
the difference between method 1 and method 2 is small,
and of y and θ̃ , vehicle lateral and angular deviations
and since method 2 can be used in real-time, it seems
with respect to this path. On the other hand, the control
that it is a good method for computing localization un-
vector is defined by the vehicle linear velocity and the
certainty. Computation time is the same for methods 2
front wheel steering angle, denoted respectively v and δ.
and 3, the difference lies in the complexity of the imple-
Vehicle model is then given (see e.g. de Wit (1996)) by:
mentation for computing the diagonal elements of Cov3d .
With the simplification made in method 3, the size of the ⎧
⎪ cos θ̃
confidence ellipsoid is slightly underestimated, but the ⎪
⎪ ṡ = v

⎪ 1 − yc(s)
result is still much closer to the one obtained with the ⎪

full computation than with method 4. A comparison be-
ẏ = v sin θ̃ (11)
tween the size of the confidence ellipsoid and the local- ⎪


⎪  
ization error measured with the RTK GPS is also done in ⎪
⎪ tan δ c(s) cos θ̃
⎩ θ̃˙ = v −
Section 6.6.3. l 1 − y c(s)
246 Royer et al.

0.13
method 1
method 2
0.12 method 3
method 4

0.11

0.1

Major semiaxis (meters)


0.09

0.08

0.07

0.06

0.05

0.04

0.03
0 5 10 15 20 25 30 35
Frame number

Figure 7. Major semiaxis (in meters) of the 90% confidence ellipsoid with four computation methods for a few frames localized with reference to
a given key-frame. Frame to frame variations are mostly due to changes of the 3D-2D correspondences.

Figure 8. Tricycle model description.

where l is the vehicle wheelbase and c(s) denotes the ref- absolute position and heading, provided by the vision
erence path curvature at coordinate s. It is assumed that: algorithm, with the reference path. Via invertible state
y = c(s)
1
(i.e. the vehicle is not on the reference path cur- and control transformations, the nonlinear vehicle model
vature center) and θ̃ = π2 [π]. In practical situations, if the (11) can be converted, in an exact way, into the so-
vehicle is well initialized, such difficulties never arise. called chained form (see Samson, 1995). (a1 , a2 , a3 ) =
(s, y, (1 − yc(s)) tan θ̃) is the chained state vector and
M = (m 1 , m 2 )T = ϒ(v, δ)T is the chained control vec-
5.2. Control Law tor. From this, a large part of linear systems theory can be
used (but, since the transformations are exact, it is not re-
The control objective is to bring and maintain state vari- quired that the vehicle state is in a specific configuration,
ables y and θ̃ to 0, relying uniquely on control variable contrarily to tangent linearization techniques). More pre-
δ (v is considered as a possibly varying free parame- cisely, it can be noticed that path following (i.e. control
ter). The whole vehicle state vector (s, y, θ̃) is avail- of a2 and a3 ) can be achieved by designing only m 2 as a
able with a satisfactory accuracy by comparing vehicle linear proportional derivative controller. The expression
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 247

of the actual control variable δ can then be obtained by reported in this article are measured in an horizontal plane
inverting the chained control transformation. Computa- only. In any case, vertical errors could be interesting for
tions, detailed in Thuilot et al. (2004), lead to: a flying robot but not for our application.
 
cos3 θ̃ d c(s) 6.1.2. Reconstruction and Localization Error. We
δ(y, θ̃) = arctan l y tan θ̃ want to distinguish between the error that is attributed to
(1 − c(s) y)2 ds
the reconstruction process and the error coming from the
− K d (1 − c(s) y) tan θ̃ − K p y
  localization algorithm. So we define two errors to mea-
c(s) cos θ̃ sure the reconstruction and the localization accuracy. The
+ c(s) (1 − c(s) y) tan θ̃ +
2
1 − c(s) y reconstruction error is the average distance between the
(12) camera positions computed in the structure from motion
algorithm and the true positions given by the RTK GPS
with K p , K d > 0 the proportional derivative gains. (after the two trajectories have been expressed in the same
coordinate system). This error is mostly caused by a slow
drift of the reconstruction process. It increases with the
6. Experimental Results length and complexity of the trajectory. That means the
3D model we build is not perfectly matched to the real
6.1. Methods for Performance Evaluation 3D world.
The definition of the localization error is a bit more
6.1.1. Using GPS Data as the Ground Truth. Most of complex. To understand why, suppose the robot is exactly
the results presented in this paper, show the accuracy of on the reference path and the localization algorithm indi-
the system by comparing the results of the algorithms cates that the robot is on the reference path. In this case,
with the data recorded by a Real Time Kinematic (RTK) the localization error should be zero, and controlling the
GPS mounted on the Cycab. But comparing data obtained robot so that this error is kept zero would result in the
by two different sensors is not completely straightfor- robot following the reference path. But if we computed
ward. This paragraph explains how results from the vi- directly the difference between the 3D position given by
sion algorithms are compared to GPS measurements. the localization algorithm and the position given by the
Two operations are needed so that both data sets can GPS, we would get the reconstruction error which is not
be compared. First the GPS sensor is not mounted on zero in most cases. In fact, in our application, a global
the vehicle at the same place as the camera. The GPS localization is not necessary, only a relative position with
receiver is located above the mid-point between the rear respect to the reference trajectory is needed.
wheels of the car, while the camera is in front of the ve- We define the localization error in order to measure the
hicle (1.15 meters in front of the GPS sensor). So the two error we make in computing this relative localization with
sensors don’t have the same trajectory. From the GPS the vision algorithm. First we compute the lateral devia-
positions, we computed a “virtual” GPS which indicates tion between the current robot position G 1 and the closest
what a GPS would record if it was mounted on the Cycab robot position G 0 on the reference trajectory. This is il-
at the same place as the camera. In addition, the 3D re- lustrated on Fig. 9. The robot position is always defined
construction is done in an arbitrary euclidian coordinate by the position of the middle point of the rear axle of the
system, whereas the GPS positions are given in another vehicle. This position is directly given by the RTK GPS.
coordinate system. So the whole 3D reconstruction has When working with vision it must be computed from the
to be transformed using a global rotation, translation and camera position and orientation. The 3D reconstruction
scale change. The approach described by Faugeras and used as the reference has already been transformed so that
Herbert (1986) is used to compute this global transforma- it is in the same metric coordinate system as the GPS data.
tion. After this transformation has been made, the cam- We start with the localization of the camera C1 given by
era and GPS positions are available in the same metric the localization part of the vision algorithm. From C1
coordinate system. This process is done for the 3D recon- we compute the corresponding GPS position G 1 (it is
struction of the reference video sequence. After that, the possible because we measured the positions of the GPS
localization algorithm gives directly the camera positions receiver and the camera on the vehicle). Then we find the
in the same metric coordinate system. closest GPS position in the reference trajectory: we call it
The GPS sensor we use is a Real Time Kinematics G 0 . At point G 0 of the reference trajectory, we compute

→ −→
Differential GPS (Thalès Sagitta model). It is accurate the tangent T and normal N to the trajectory. The lat-
−−−→ − →
to 1 cm (standard deviation) in an horizontal plane when eral deviation computed with vision is yv = G 0 G 1 · N .
enough satellites are available. The accuracy on a vertical The lateral deviation is computed from the GPS mea-
axis is only 20 cm on our hardware platform. So we dis- surements as well and we get yg (in this case we have
card the vertical readings and all the localization errors directly G 0 and G 1 ). yg and yv are the same physical
248 Royer et al.

T Four sequences called outdoor1 through outdoor4


were recorded by driving manually the vehicle on a 80 m
long trajectory. The four sequences were made approxi-
mately on the same trajectory ( with at most a 1 m lateral
deviation), the same day. The lens used was a standard
C0 C1 wide angle lens with a field of view of roughly 60◦ . We
computed a 3D reconstruction from each of the four se-
quences. Depending on the sequence, the automatic key
1.15 m frame selection gave between 113 and 121 key frames.
G1
And at the end of the reconstruction there were between
G0 N
y
14323 and 15689 3D points. A few images extracted
from outdoor1 are shown on Fig. 10. The positions of
the key frames computed from this sequence are shown
on Fig. 11 (as seen from the top) along with the trajectory
Reference trajectory
recorded by the GPS. The reconstruction error for each
Replayed trajectory
of the sequences was 25, 40, 34 and 24 cm for a 80 m
long trajectory with two large turns. This error is mostly
Figure 9. Computing the lateral deviation from the reference trajec-
tory.
caused by a slow drift of the reconstruction process.

6.3. Localization Accuracy


distance measured with two different sensors. Then the
localization error is defined as = yv − yg . 6.3.1. Positional Accuracy. The localization accuracy
was computed from the same sequences outdoor1
through outdoor4 we used in Section 6.2. Each sequence
was used in turn as the reference sequence. We computed
6.2. 3D Reconstruction Results
a localization for outdoori using outdoor j as the refer-
ence sequence for each j ∈ {1, 2, 3, 4} and i = j. So we
The problem we want to solve is robot localization.
made twelve experiments.
Our goal is not to build the best reconstruction of the
The localization error was computed with the method
environment. In fact, we are convinced that an accurate
explained in Section 6.1.2. From this we can compute
reconstruction is not absolutely necessary for robot nav-
the standard deviation of for a whole trajectory : we
igation, especially if the robot stays on the learning path.
call this the average localization error. We computed the
The map should be accurate on a small scale (less than
average localization error for each of the twelve experi-
about 50 meters) but long term drift is acceptable. We will
ments : the smallest was 1.4 cm, the largest was 2.2 cm
discuss this point in more detail when we examine the
and the mean over the twelve videos was 1.9 cm. Fig. 13
case of closed trajectories. Nevertheless, it is interesting
shows the lateral deviation and localization error for one
to see how accurate our reconstruction is, and to make
experiment with a 1.9 cm average localization error. The
sure it is accurate on a small scale. The results presented
error measured in this experiment is very close to the GPS
in this section concern only the reconstruction of the
accuracy (1 cm of standard deviation in the best condi-
camera motion. The accuracy of the structure was not
tions). So the noise in the GPS measurements contributes
measured.
to the localization error given in this paragraph.

Figure 10. A few images from outdoor1 .


Monocular Vision for Mobile Robot Localization and Autonomous Navigation 249

Figure 11. Position of the key frames (circles) with reference to the trajectory recorded by the RTK GPS (continuous line). Units in meters.

1
Vision
GPS
0.8

0.6

0.4
Lateral deviation (meters)

0.2

20 40 60 80 100 120 140


Time (seconds)

Figure 12. Lateral deviation measured with the RTK GPS yg (blue) or with vision yv (red). The two curves are nearly the same.

6.3.2. Rotational Accuracy. In order to evaluate the ro- a 2◦ increment. The reference trajectory was a straight
tational accuracy, we made an indoor experiment bec- line (1 m long) oriented along the optical axis (which
ause the GPS can’t give accurate angular measurements. was in the 0◦ direction). The result of this experiment ap-
The camera was mounted on a rotating platform, with pears on Fig. 14. The algorithm was not able to provide
the optical center on the rotational axis. The angle of the a reliable pose of the camera when the angle reached 95◦
platform can be read with about ±0.1◦ accuracy. We com- because there were not enough point correspondences.
pared the orientation α provided by the localization part The angular accuracy measured with this setup is about
of the vision algorithm to the angle α0 given by the plat- ±0.1◦ , which is about the same as what can be read on
form. For this experiment (and the following ones till the the platform. The algorithm provides a useful angular in-
end of the article), we used a fish eye lens providing a 130◦ formation for a deviation up to 94◦ on either side with
field of view (in the diagonal) and we made a measure- this camera. Of course, with such an angular deviation
ment for each angle from α0 = −94◦ to α0 = 94◦ with from the reference frame, the part of the image which
250 Royer et al.

0.08

0.06

0.04
Localization error (meters)

0.02

20 40 60 80 100 120 140


Time (seconds)

Figure 13. Localization error .

0.4

0.3

0.2
Angular error (degrees)

0.1

-0.1

-0.2

-0.3

-0.4
-80 -60 -40 -20 0 20 40 60 80

Rotation angle (degrees)

Figure 14. Angular error.


Monocular Vision for Mobile Robot Localization and Autonomous Navigation 251

Figure 15. Images taken at −90◦ , −60◦ , −30◦ , 30◦ , 60◦ and 90◦ orientation, with interest points correctly matched.

can be used is very small, and the localization becomes of the vision algorithms) to provide the vehicle state vec-
impossible if there is an occultation in this area. Images tor which is used in the control law. Lateral deviations
captured at different orientations are shown on Fig. 15. from the reference path recorded during 3 of these ex-
periments are displayed with the same scale on Fig. 18.
6.4. Autonomous Navigation Letters enable to identify each part of the trajectory, with
respect to the letters shown on Fig. 16.
We also made some experiments where we tested the It can be observed that the vision algorithms detailed
whole system. A reference video sequence was recorded in this paper appear as a very attractive alternative to
and a 3D reconstruction was computed from it. Then the RTK GPS sensor, since they can provide with roughly
robot had to follow the same path in autonomous naviga- the same guidance accuracy. It can also be noticed that
tion. We recorded GPS measurements at the same time these vision algorithms are reliable with respect to out-
we recorded the learning video sequence. Then we also door applications since they appear robust to weather
recorded GPS measurements in autonomous navigation. conditions: guidance accuracy is not significantly altered
The learning path was 127 meters long. It was chosen in as harsh conditions as the sunny ones. More precisely,
so that there are both straight lines and tight turns, and guidance performances during straight lines and curves
because the buildings are sometimes far (less visual fea- following are investigated separately on Tables 1 and 2.
tures) and sometimes closer. Vehicle speed was chosen Table 1 reports the mean value of the lateral deviation
constant and equal to 2 km/h. The result of the struc- (|y|) during straight lines part of the trajectory, denoted
ture from motion algorithm is displayed on Fig. 16 as B and D. In the most favorable situation (cloudy weather),
seen from the top. There were 182 key frames and 16295 vision algorithms meet the performances obtained with
points correctly reconstructed. the RTK GPS sensor. In the worst case (sunny weather),
The reference video sequence was recorded on a performances are slightly damaged, but are still very sat-
cloudy day. The first two navigation experiments were isfactory. Table 2 displays the extremum values of y
made a few days later with a cloudy weather too. But the recorded during curved parts of the trajectory, denoted
second set of two was made on a clear day with the sun C and E. Once more, it can be observed that guidance
low in the sky and sometimes in the field of view of the performances are roughly similar.
camera. A few images from the video recorded during For these experiments, guidance accuracy seems to be
the last navigation experiment as well as the correspond- limited by the ability of the control law to keep the robot
ing key frame are displayed on Fig. 17. The last image on the trajectory. This explains the similarity between
outlines the necessity of having a wide field of view and the two vision based lateral deviations (“Cloudy 2” and
local visual features all over the frame. It shows the inter- “Sunny 1” on Fig. 18). The difference between the RTK
est points which are used in the localization. The center GPS and vision based control may be explained because
of the image is completely overexposed because the sun vehicle heading is not computed in the same way with
is in front of the camera, but the building on the left can both sensors. This is especially visible in curves. When
still be used for computing the camera pose. relying on the RTK GPS sensor, vehicle heading is in-
For comparison purposes, a fifth experiment has also ferred from velocities measurements (obtained by dif-
been performed, relying on the RTK GPS sensor (instead ferentiating successive position data) under non-slipping
252 Royer et al.

Figure 16. 3D reconstruction computed from the reference video sequence (top view). Black squares are the position of the key frames. The
landmarks appear as dots. Letters indicate different parts of the trajectory.

Figure 17. Three frames taken during the autonomous navigation (bottom) and the corresponding key frames (top).
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 253

Table 1. Mean of the lateral deviation in straight lines.


Cloudy 2
0.2
Sunny 1 Sunny 2 Cloudy 1 Cloudy 2 GPS
0
B 3.5 cm 4.8 cm 3.4 cm 2.8 cm 2.7 cm
D 2.4 cm 1.9 cm 1.8 cm 2.3 cm 1.8 cm
Lateral deviations in meters

0 20 40 60 80 100 120

0.2
Sunny 1 Table 2. Maximum and minimum deviation in curves.

Sunny 1 Sunny 2 Cloudy 1 Cloudy 2 GPS


0

C max 22.0 cm 26.8 cm 20.1 cm 20.4 cm 37.9 cm


C min −20.2 cm −25.4 cm −22.2 cm −21.1 cm −14.3 cm
0 20 40 60 80 100 120 E max 29.1 cm 35.4 cm 30.0 cm 29.2 cm 13.9 cm
E min −16.5 cm −19.7 cm −16.5 cm −16.1 cm −16.3 cm
GPS
0.2

0
to treat specifically the case of closed loops by searching
point correspondences between the last image of the se-
A B C D E quence and the first one in order to reduce reconstruction
0 20 40 60 80 100 120
Curvilinear abscissa along reference trajectory in meters error. But that’s not our purpose. We want to use closed
loops as a way to illustrate what happens in the presence
Figure 18. Lateral deviation from the reference trajectory. of reconstruction error. An example of a closed loop tra-
jectory appears on Fig. 19 with a few images from the
video sequence on Fig. 20. The position of the first and
assumptions, smoothed via a Kalman filter. The filter in-
last cameras should be the same. But because of some
troduces a delay and the result may not be as accurate as
drift in the reconstruction process, the reconstruction of
the orientation computed with vision.
the loop is not exactly closed (the gap measures 1.5 m).
In that case, some points are reconstructed twice : once
6.5. Navigation on a Closed Loop when they are seen at the beginning of the sequence and
once when they are seen at the end of the loop. In spite
The case of a closed loop is very interesting because it is a of this reconstruction error, the robot could navigate con-
good way to visualize the influence of the reconstruction tinuously on this trajectory (for several loops), without a
error on robot navigation. Of course, it would be possible hint of jerky motion at the seam.

Figure 19. 3D Reconstruction for the loop sequence.


254 Royer et al.

Figure 20. Some frames from the loop sequence.

I I I IN
TN I4
I3
I2
yv I1
T1
CN yv
NN
C1
3D points associated
N1
with Key frame N
3D points associated
with key frame 1

Figure 21. Lateral deviation computed at the beginning or at the end of a loop.

To understand why, we have to recall which variables see here that a global optimization of this scale should
are used in the control law. Along with the curvature of not be necessary. Good accuracy on the small scale (50 m
the path, the two other variables are the lateral devia- to 100 m) is enough.
tion from the reference path yv and the angular devia-
tion. Figure 21 helps to understand what happens. The 6.6. Navigation on a Path Different from the Learning
localization is computed with reference to the points Path
cloud visible at one moment by the camera (the points
which appear on only one key frame). The absolute local- 6.6.1. Experiment. From a practical point of view, ex-
ization is not used for the computation of the lateral and actly following the path that was recorded can cause some
angular deviations. What is important is the localization problems. Ideally, if there is an obstacle on the trajectory,
in a local reference frame whose center is the position of the robot should be able to go around it. That’s were the
the robot at the key frame. On Fig. 21, this local reference expense of computing the 3D structure of the environ-

→ − →
frame is drawn for the beginning of the loop (I1 , T1 , N1 ) ment pays off. Knowing the 3D position of the interest

→ −→ −

and for the end of the loop (I N , TN , N N ), where T and points allows the robot to localize itself outside the learn-


N are the unit vectors tangent and normal to the refer- ing path and to navigate in the neighborhood of this path.
ence trajectory. The change from the end of a loop to the We made an experiment to see how much the robot could
beginning of the next one corresponds to a change of the depart from the learning trajectory.
local reference frame which affects simultaneously the First, a learning video sequence was recorded on a
reference trajectory, the current localization, and the 3D 70 meters long trajectory. From this video sequence, a
points currently visible. In this case, the lateral deviation map was computed. A few images from the learning
yv has no discontinuity. The same is true for the angular video sequence are shown on Fig. 22. This sequence was
deviation. reconstructed with 102 key frames and 9579 points. For
This characteristic suggests that it should be possi- this experiment we defined a new trajectory (target tra-
ble to link several short trajectories (of about 100 m) to jectory), slightly different from the learning trajectory.
provide the robot with enough information to travel for The robot had to follow the target trajectory while using
several kilometers. A global bundle adjustment would be the learning trajectory and video sequence to localize it-
too costly for trajectories of several kilometers, but we self. The experiment was conducted two weeks after the
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 255

Figure 22. Images extracted from the learning video sequence.

Figure 23. 3D reconstruction of the learning trajectory (black squares) and target trajectory (broad green line) viewed from the top.

learning trajectory was recorded. Figure 23 shows the 3D 6.6.2. Localization and Navigation Results. Let’s ex-
reconstruction of the learning sequence (black squares) amine the path that was really followed by the robot dur-
and the target trajectory that was defined in a graphic ed- ing this experiment. We will call it the result path. It was
itor (broad green line). The target trajectory was defined accurately recorded by the RTK GPS sensor. The robot
so that it reproduces what would happen if the robot had nearly followed the target path. This is visible on Fig. 25.
to avoid two obstacles. The target path departs from the The blue curve shows the lateral deviation of the target
learning path on the right at the beginning and then on path with reference to the learning path. The red curve
the left a few meters later. The maximum lateral devia- is the lateral deviation of the result path with reference
tion from the learning trajectory is about 3 meters. The to the learning path. The red curve was obtained by us-
angular deviation is at most 20◦ . We parked a vehicle ing the GPS measurements. The lateral deviation from
on the learning trajectory to add some occultations (see the learning path directly computed from the result of
Fig. 24) to simulate the images the robot would use if it the localization algorithm is displayed on the top row of
had to avoid a real obstacle. Some pedestrians were also Fig. 26. At this time, we don’t use any filters to smooth
passing by, occulting some parts of the image from time the result of the localization algorithm. The result path
to time. The target trajectory was simply defined with the deviates slightly from the target path in the turns. That is
mouse in a graphical editor. Defining a new trajectory probably because we defined the target path in a graph-
doesn’t require expensive computations so this could be ical editor without knowing if the robot could steer fast
done online if necessary. For example, we could use a enough to really follow the path. The lateral deviation
sensor (radar, sonar or laser range finder) to detect ob- computed with the vision algorithm matches the GPS
stacles in front of the robot. If an obstacle was detected measurements well in those parts of the trajectory.
and localized by this sensor in the robot’s coordinate sys- The localization is more noisy when the robot departs
tem, it would be possible to modify the trajectory in 3D from the learning trajectory but the localization is still ac-
space in the same coordinate system without stopping the curate enough to drive the robot. We recorded the ground
vehicle. truth with the GPS so were are able to measure the error
256 Royer et al.

Figure 24. Interest points correctly matched between the reference key frame and the current image. Left: when the robot is nearly on the learning
path. Right: when the lateral deviation is maximal. The reference frame is on top.

Figure 25. Lateral deviation of the target trajectory (blue), and lateral deviation measured by the GPS during the navigation experiment (red).
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 257

Lateral deviation (meters)


4

0 20 40 60 80 100 120 140 160 180 200


Localization error (meters)
1

0.5

0 20 40 60 80 100 120 140 160 180 200

300
Number of points

200
A B C
100

0
0 20 40 60 80 100 120 140 160 180 200
Time (seconds)

Figure 26. Lateral deviation computed by the vision algorithm (top), Lateral deviation error (middle) and number of interest points correctly matched
(bottom).

that was made by the localization algorithm when com- of points matched. The graph of Fig. 26 was divided in
puting the lateral deviation from the target path. This error three parts. Zones A and C correspond to parts where
appears on the middle of Fig. 26. When the robot is on the the robot was far from the learning trajectory and zone B
learning path, the error is less than 2 centimeters (the same correspond to a part where the robot was on the reference
accuracy as our GPS). When the robot is farther from the trajectory. For each zone, the average number of points
learning path, this error increases to more than 10 cen- matched was computed as well as the standard deviation
timeters. The reason lies in the number of points matched. of the error of the vision algorithm.
Figure 24 shows the interest points that were correctly The robot can depart from the learning trajectory up to
matched between the current frame and the reference a limit which depends on the number of points that can
frame for two robot positions during the navigation ex- be matched between the reference video sequence and
periment. We can see that the algorithm matches less the current image. Points can be lost because they may
points when the robot is far from the learning path. This be seen from a different point of view and the correlation
is also visible at the bottom of Fig. 26 which displays the score is not high enough to make a valid match. This is
number of points used in the localization process for the the case if the lateral deviation is large. When angular
whole experiment. When the robot is on the learning path, deviation is important, points are lost because they go
more than 200 points are matched. This number drops to out of the field of view of the camera. Nothing can be
100 when the robot is 3 meters away. What is also im- done to correct that except going to an omnidirectional
portant to notice is that the points close to the camera are camera. On the other hand, we could take into account
lost first, and those points are those which provide the the difference in point of view by using wide baseline
best positional accuracy. Points at infinity on the other matching techniques or by predicting the appearance of
hand can only give some rotational information. Table 3 the landmarks based on the camera position.
shows how localization accuracy is related to the number
6.6.3. Validation of the Localization Uncertainty. The
aim of this analysis is to make sure that the size of the con-
Table 3. Number of points matched and localization error. fidence ellipsoids varies in the same way as the localiza-
Zone A B C tion errors measured with the RTK GPS. The experiment
presented in Section 6.6 gives us an opportunity to make
Time limits 20−70 s 70−115 s 115−170 s an experimental validation of the computation of the lo-
Average number of points 119 220 110
Localization error 8 cm 1 cm 8 cm
calization uncertainty. In the experiments described in
paragraph 6.3.1, the confidence ellipsoids were too small
258 Royer et al.

Figure 27. Comparison between the major semiaxis of the 90% confidence ellipsoid computed with the vision algorithm (black) and localization
error measured with the RTK GPS (grey). Vertical scale in meters.

compared to the accuracy of the RTK GPS to have some 6.7. Computation Times
meaningful results.
For each video frame taken during the autonomous The computation times given here were measured on a
navigation, we computed the length a of the major semi- 3 GHz Pentium 4 processor with an image size of 512 ×
axis of the 90% confidence ellipsoid with methods 2, 3 384 pixels. The code of the interest point detection and
and 4 as explained in Section 4. Method 1 was not in- matching uses the SSE2 instruction set. Off line map
cluded in this comparison because it is not a real-time building for a sequence with 100 key frames such as the
method. The localization error was also computed by one on Fig. 23 takes about 15 min. For a larger sequence
using the RTK GPS measurements in a way slightly such as the one on Fig. 3 with approximately 300 key
different from what is explained in Section 6.1.2. In frames, map building is done in 60 min. Most of the time
Section 6.1.2, the measurements from both sensors were is spent in the bundle adjustment. The localization runs
brought in the same coordinate system and the position at 15 frames per second (which is also the frame rate
of the cycab was computed. Then the localization error of our camera). The computation time can be roughly
was measured along the normal to the trajectory. Here, divided as: 35 ms for interest points detection, 15 ms for
the same computations are done to bring the data sets matching, 10 ms for the pose computation and 5 ms for
from both sensors in the same coordinate system, but the uncertainty computation.
the localization error is the distance between the position
computed with vision and the position computed with the
GPS. It includes the error along the normal and the error 6.8. Database Management
along the tangent to the trajectory. Figure 28 shows the
localization error compared to the length a2 of the major The memory needed to store the 3D model of the environ-
semiaxis of the 90% confidence ellipsoid computed with ment is very reasonable because we don’t need to keep
method 2. The major semiaxis computed with method 2 is the whole key frames. We need to keep only image data
roughly of the same size as the localization error and that around the interest points to compute the ZNCC as well
both quantities vary in the same way. This experiment as the 2D and 3D coordinates of the points. With about
shows that method 2 is a valid method for computing the 150 bytes per interest point and about 500 interest point
localization uncertainty. per key frame, each key frame takes less than 100 Kb of
A comparison can be made between the three meth- memory. Since the model is built with one key frame per
ods used to compute the localization uncertainty. meter on average, storing a 1 kilometer long trajectory
Figure 28 shows for each method the histogram of log2 ( a ) needs about 100 Mb. Computers can handle this amount
for every frame in the video sequence. If we assume zero of data without difficulty.
uncertainty on the cameras in the 3D reconstruction (in From a practical point of view, keeping the model of
method 3), the length of the major semiaxis is nearly half the environment up to date is needed even if localization
of the length computed with method 2. Ignoring the un- is possible with some modifications of the environment
certainty on the 3D points (in method 4) conducts to an (see Royer et al. (2005c) or Fig. 24 for example). We
even larger underestimation of the size of the ellipsoid. have made a lot of experiments during the past two years
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 259

400 500

350 450
400
300
350
250 300
200 250

150 200
150
100
100
50 50
0
-4 -3 -2 -1 0 1 2 3 4 0 -4 -3 -2 -1 0 1 2 3 4
method 2 method 3

600

500

400

300

200

100

0
-4 -3 -2 -1 0 1 2 3 4
method 4

Figure 28. Histogram of log2 ( a ) for methods 2, 3 and 4.

in different places. We found that once a model is built, same accuracy. When the robot departs from the learning
it can be used during a few weeks or more depending of path, the performance of the vision algorithm is some-
the season and the kind of place. Buildings don’t change what degraded but still satisfactory in the case of obsta-
much, but trees can change quickly in spring and fall. We cle avoidance for example. These two sensing devices
have not developed a method to update the model yet, appear complementary: autonomous navigation in urban
but we have some ideas about how to do that. During au- environments cannot satisfactorily be addressed by RTK
tonomous navigation, the pose of the camera is computed GPS sensors since tall buildings can disturb satellite re-
and the interest points are detected for each frame. We ceiving. These buildings however offer a lot of visual
can store this data every time the camera is near a key features which can be used to feed vision algorithms.
frame so that at the end of a navigation experiment we The main difficulty with the vision algorithm is to keep
have all the necessary information to update the database. a map of the environment up to date. Even if the experi-
Then we could match the interest points in the sequence ments presented in this paper have shown that the local-
recorded and compute the 3D position of the points. A ization algorithm is robust to some changes, it may not be
bundle adjustment could be used to refine the structure. enough for an ever changing environment. For example
New points could be added to the model, and points in a city, cars parked along the side of the road change
which have not been used since a long time could be from day to day, trees evolve according to the season,
removed. some buildings are destroyed while others are built or
modified. So our goal is to have a method to update the
map automatically in order to take these modifications
7. Conclusion into account.

We have presented a sensing device which enables a


robotic vehicle to follow a trajectory obtained from a Acknowledgment
human guided experience, relying uniquely on monocu-
lar vision. When following the same path, the localiza- This work was done as part of the ROBEA CNRS Bodega
tion accuracy is approximately 2 cm, nearly the same project and the PREDIT MobiVip project. We are grateful
as a RTK GPS sensor. The vision system also provides to Heudiasyc at Université de Technologie de Compiègne
with an orientation accurate to 0.1◦ . In practice, both sen- for providing some of the video sequences used in this
sors allow the robot to navigate autonomously with the paper.
260 Royer et al.

References 17. Matsumoto, Y., Inaba, M., and Inoue, H. 1996. Visual navigation
using view-sequenced route representation. In International Confer-
1. Araújo, H., Carceroni, R.J., and Brown, C.M. 1998. A fully projective ence on Robotics and Automation, pp. 83–88.
formulation to improve the accuracy of Lowe’s pose estimation algo- 18. Nistér, D. 2001. Frame decimation for structure and motion. In 2nd
rithm. Computer Vision and Image Understanding, 70(2):227–238. Workshop on Structure from Multiple Images of Large Environments,
2. Argyros, A., Bekris, K., Orphanoudakis, S., and Kavraki, L. 2005. Springer Lecture Notes on Computer Science, vol. 2018, pp. 17–34.
Robot homing by exploiting panoramic vision. Journal of Au- 19. Nistér, D. 2003. An efficient solution to the five-point relative pose
tonomous Robots, 19(1):7–25. problem. In Conference on Computer Vision and Pattern Recogni-
3. Blanc, G., Mezouar, Y., and Martinet, P. 2005. Indoor navigation of tion, pp. 147–151.
a wheeled mobile robot along visual routes. In IEEE International 20. Nistér, D., Naroditsky, O., and Bergen, J. 2004. Visual odometry. In
Conference on Robotics and Automation, ICRA’05, Barcelonne, Es- Conference on Computer Vision and Pattern Recognition, pp. 652–
pagne. 659.
4. Cobzas, D., Zhang, H., and Jagersand, M. 2003. Image-based local- 21. Remazeilles, A., Chaumette, F., and Gros, P. 2004. Robot mo-
ization with depth-enhanced image map. In International Conference tion control from a visual memory. In International Conference on
on Robotics and Automation. Robotics and Automation, vol. 4, pp. 4695–4700.
5. Davison, A.J. 2003. Real-time simultaneous localisation and map- 22. Royer, E., Bom, J., Dhome, M., Thuilot, B., Lhuillier, M., and
ping with a single camera. In Proceedings of the 9th International Marmoiton, F. 2005a. Outdoor autonomous navigation using monoc-
Conference on Computer Vision, Nice. ular vision. In International Conference on Intelligent Robots and
6. de Wit, C.C., Siciliano, B., and Bastin, G. 1996. The Zodiac, Theory Systems, pp. 3395–3400.
of Robot Control. Springer Verlag. 23. Royer, E., Lhuillier, M., Dhome, M., and Chateau, T. 2005b. Lo-
7. Faugeras, O. and Herbert, M. 1986. The representation, recognition, calization in urban environments: Monocular vision compared to a
and locating of 3-d objects. International Journal of Robotic Re- differential gps sensor. In International Conference on Computer
search, 5(3):27–52. Vision and Pattern Recognition, CVPR.
8. Fischler, O. and Bolles, R. 1981. Random sample consensus: A 24. Royer, E., Lhuillier, M., Dhome, M., and Lavest, J.-M. 2005c. Per-
paradigm for model fitting with application to image analysis and formance evaluation of a localization system relying on monocular
automated cartography. Communications of the Association for Com- vision and natural landmarks. In Proceedings of the ISPRS Workshop
puting Machinery, 24:381–395. BenCOS (Towards Benchmarking Automated Calibration, Orienta-
9. Georgiev, A. and Allen, P.K. 2004. Localization methods for a mo- tion and Surface Reconstruction from Images).
bile robot in urban environments. IEEE Transactions on Robotics, 25. Samson, C. 1995. Control of chained systems. application to path fol-
20(5):851–864. lowing and time-varying point stabilization of mobile robots. IEEE
10. Goedemé, T., Tuytelaars, T., Van Gool, L., Vanacker, G., and Nuttin, Transactions on Automatic Control, 40.
M. 2005. Feature based omnidirectional sparse visual following. 26. Se, S., Lowe, D., and Little, J. 2002. Mobile robot localization and
In International Conference on Intelligent Robots and Systems, pp. mapping with uncertainty using scale-invariant visual landmarks.
1003–1008. International Journal of Robotic Research, 21(8):735–760.
11. Haralick, R., Lee, C., Ottenberg, K., and Nolle, M. 1994. Review and 27. Simond, N. and Rives, P. 2004. Trajectography of an uncalibrated
analysis of solutions of the three point perspective pose estimation stereo rig in urban environments. In International Conference on
problem. International Journal of Computer Vision, 13(3):331–356. Intelligent Robot and System, pp. 3381–3386.
12. Harris, C. and Stephens, M. 1988. A combined corner and edge 28. Thuilot, B., Bom, J., Marmoiton, F., and Martinet, P. 2004. Accurate
detector. In Alvey Vision Conference, pp. 147–151. automatic guidance of an urban vehicle relying on a kinematic gps
13. Hartley, R. and Zisserman, A. 2000. Multiple View Geometry in sensor. In Symposium on Intelligent Autonomous Vehicles IAV04.
Computer Vision. Cambridge University Press. 29. Torr, P., Fitzgibbon, A., and Zisserman, A. 1999. The problem of
14. Kidono, K., Miura, J., and Shirai, Y. 2002. Autonomous visual navi- degeneracy in structure and motion recovery from uncalibrated image
gation of a mobile robot using a human-guided experience. Robotics sequences. International Journal of Computer Vision, 32(1):27–44.
and Autonomous Systems, 40(2–3):124–1332. 30. Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. 2000.
15. Lavest, J.M., Viala, M., and Dhome, M. 1998. Do we need an accu- Bundle adjustment—A modern synthesis. In Vision Algorithms: The-
rate calibration pattern to achieve a reliable camera calibration ? In ory and Practice, W. Triggs, A. Zisserman, and R. Szeliski (Eds.),
European Conference on Computer Vision, pp. 158–174. Lecture Notes in Computer Science, pp. 298–375. Springer Verlag.
16. Lhuillier, M. and Perriollat, M. 2006. Uncertainty ellipsoids calcu- 31. Vacchetti, L., Lepetit, V., and Fua, P. 2003. Stable 3-d tracking in
lations for complex 3d reconstructions. In International Conference real-time using integrated context information. In Conference on
on Robotic and Automation. Computer Vision and Pattern Recognition, Madison, WI.

You might also like