Monocular Vision For Mobile Robot Localization and Autonomous Navigation
Monocular Vision For Mobile Robot Localization and Autonomous Navigation
c 2007 Springer Science + Business Media, LLC. Manufactured in the United States.
DOI: 10.1007/s11263-006-0023-y
Abstract. This paper presents a new real-time localization system for a mobile robot. We show that autonomous
navigation is possible in outdoor situation with the use of a single camera and natural landmarks. To do that, we use
a three step approach. In a learning step, the robot is manually guided on a path and a video sequence is recorded
with a front looking camera. Then a structure from motion algorithm is used to build a 3D map from this learning
sequence. Finally in the navigation step, the robot uses this map to compute its localization in real-time and it follows
the learning path or a slightly different path if desired. The vision algorithms used for map building and localization are
first detailed. Then a large part of the paper is dedicated to the experimental evaluation of the accuracy and robustness
of our algorithms based on experimental data collected during two years in various environments.
as well as a set of landmarks and their 3D location in Correlation over a 11 × 11 pixels window. All the pairs
a global coordinate system. This must be done from a with a correlation score above 0.8 are sorted. The best
monocular image sequence. The structure from motion one is kept as a good match and the unicity constraint
problem has been studied for several years and multiple is used to reject matches which have become impossi-
algorithms have been proposed depending on the assump- ble. Then the second best score is kept and so on until
tions we can make (Hartley and Zisserman, 2000). For all the potential matches have been kept or discarded.
our experiments, the camera was calibrated using a planar This image matching scheme is used for the localization
calibration pattern (Lavest et al., 1998). Radial distortion part and as the first step in the reconstruction part. In the
is modelled with a fifth order polynomial. Camera cal- reconstruction process, a second matching is done after
ibration is important especially when using wide angle the epipolar constraint has been computed. In the second
lenses with a strong radial distortion. We have used two matching step, the search region is a narrow strip around
lenses: a wide angle lens with 60◦ field of view and a the epipolar line and the correlation score threshold is
fish-eye lenses with 130◦ field of view. Both lenses per- lowered to 0.82 .
form equally well for reconstruction, but a fish-eye is
preferable for localization because it reduces the chances
of occultations. With a calibrated camera, the structure 2.3. Key Frame Selection
from motion algorithm is more robust and the accuracy
of the reconstruction is increased. This is important for Before computing camera motion, it is necessary to ex-
us because in our video sequences with a front looking tract key frames from the reference video sequence. If
camera, the triangulation of the points must be done with there is not enough camera motion between two frames,
very small angles. In spite of this difficulty, we chose the computation of the epipolar geometry is an ill con-
a forward looking camera because there is always some ditioned problem. So we select images so that there is
free space in front of the vehicle, which means the field of as much camera motion as possible between key frames
views contain both near and far objects. If the camera was while still being able to find enough point correspon-
looking to the side, and the vehicle was moving along a dences between the key frames. Some algorithms have
wall very close to the camera, features would move very been proposed to do that. For example, Nistér (2001)
fast in the image which would make image matching very considers the whole video sequence then drops redun-
difficult (motion blur could also be a problem). dant frames. Another method is possible based on the Ge-
In the first step of the reconstruction, we extract a set of ometric Robust Information Criterion (GRIC) proposed
key frames from the reference sequence. Then we com- by Torr et al. (1999). Our method is much simpler and less
pute the camera pose for every key frame by using interest general than these ones but works well for our purpose.
points matched between key frames. Additionally, the in- The first image of the sequence is always selected as
terest points are reconstructed in 3D. These points will the first key frame I1 . The second key frame I2 is chosen
be the landmarks used for the localization process. so that there are as many video frames between I1 and
I2 while there are at least M common interest points
between I1 and I2 . When key frames I1 . . . In are chosen,
2.2. Image Matching we select In+1 so that:
Every step in the reconstruction as well as the localiza- • there are as many video frames as possible between In
tion relies on image matching. In order to match two and In+1 ,
images, interest points are detected in each image with • there are at least M interest points in common between
Harris corner detector (Harris and Stephens, 1988). Cor- In+1 and In ,
ner response R is computed as in Harris work and local • there are at least N common points between In+1 and
maxima of R are potential interest points. In order to have In−1 .
interest points in each part of the image, we have divided
the image area in 64 buckets. In each frame (512 × 384 In our experiments we choose M = 400 and N = 300.
pixels), we keep the 20 best local maxima in each bucket Figure 2 shows consecutive key frames extracted from the
plus the 500 best maxima on the whole image. So we city sequence.
get at most 1780 interest points in each frame. For an
interest point M 1 at coordinates (x, y) in image 1, we
define a search region in image 2. The search region is a 2.4. Initial Camera Motion Computation
rectangle whose center has coordinates (x, y). For each
interest point Mi2 inside the search region in image 2, We compute an initial solution for camera motion and a
we compute a correlation score between the neighbor- hierarchical bundle adjustment is used to refine this initial
hoods of M 1 and Mi2 . We use a Zero Normalized Cross estimation.
240 Royer et al.
For the first image triplet, the computation of the cam- 2.5. Hierarchical Bundle Adjustment
era motion is done with the method described in Nistér
(2003) for three views. It involves computing the essential The computation of camera C N depends on the results
matrix between the first and last images of the triplet using of the previous cameras and errors can build up over
a sample of 5 point correspondences. This gives at most the sequence. In order to greatly reduce this problem,
40 solutions for camera motion. The solutions for which we use a bundle adjustment. The bundle adjustment is a
at least one of the 5 points is not reconstructed in front of Levenberg-Marquardt minimization of the cost function
both cameras are discarded. Then the pose of the remain- f (C E1 , . . . , C EN , X 1 , . . . , X M ) where C Ei are the external
ing camera is computed with 3 out of the 5 points in the parameters of camera i, and X j are the world coordinates
sample. This process is done with a RANSAC (Fischler of point j. For this minimization, the radial distortion of
and Bolles, 1981) approach: each 5 point sample pro- the 2D point coordinates is corrected beforehand. The
duces a number of hypothesis for the 3 cameras. The best cost function is the sum of the reprojection errors of all
one is chosen by computing the reprojection error over the inlier reprojections in all the images:
the 3 views for all the matched interest points and keeping
the one with the higher number of inlier matches. With a f C E1 , . . . , C EN , X 1 , . . . , X M
calibrated camera, three 3D points whose projections in
N M
j
the image are known are enough to compute the pose = d 2 xi , Pi X j
of the second camera. Several methods are compared i=1 j=1, j∈Ji
in Haralick et al. (1994). We chose Grunert’s method
j
with RANSAC. where d 2 (xi , Pi X j ) is the squared euclidian distance
For the next image triplets, we use a different method between Pi X j the projection of point X j by camera i,
j
for computing camera motion. Assume we know the lo- and xi is the corresponding detected point. Pi is the 3 × 4
cation of cameras C1 through C N , we can compute cam- projection matrix built from the parameters values in C Ei
era C N +1 by using the location of cameras C N −1 and and the known internal parameters of the camera. And Ji
C N and point correspondences over the image triplet is the set of points whose reprojection error in image i is
(N − 1, N , N + 1). We match a set of points P i whose less than 2 pixels at the beginning of the minimization.
projections are known in each image of the triplet. From After a few iteration steps, Ji is computed again and more
the projections in images N − 1 and N , we can compute minimization iterations are done. This inlier selection
the 3D coordinates of point P i . Then from the set of P i process is repeated as long as the number of inliers
and their projections in image N + 1, we use Grunert’s increases.
calibrated pose estimation algorithm to compute the lo- Computing all the camera locations and using the
cation of camera C N +1 . In addition the 3D locations of bundle adjustment only once on the whole sequence
the reconstructed interest points are stored because they could cause problems. this is because increasing errors
will be the landmarks used for the localization process. could produce an initial solution too far from the op-
The advantage of this iterative pose estimation process timal one for the bundle adjustment to converge. Thus
is that it can deal with virtually planar scenes. After the it is necessary to use the bundle adjustment throughout
pose computation, a second matching step is done with the reconstruction of the sequence. So we use a hierar-
the epipolar constraint based on the pose that had just chical bundle adjustment as in Hartley and Zisserman
been computed. This second matching step allows to in- (2000). A large sequence is divided into two parts with
crease the number of correctly reconstructed 3D points an overlap of two frames in order to be able to merge
by about 20%. This is important for us for two reasons. the sequence. Each subsequence is recursively divided
Those 3D points are used in the computation of the next in the same way until each final subsequence contains
camera, and they are also the landmarks used in the lo- only three images. Each image triplet is processed as
calization process. So we need to get as many as possible described in Section 2.4. After each triplet has been
which wouldn’t be the case if the goal was just to recover computed we run a bundle adjustment over its three
the motion of the camera. frames.
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 241
Figure 4. Matching interest points between the reference key frame learned with snow (left) and the current video frame without snow (right).
3D points
Learning path
Reference frame
Current frame
Target path
Figure 5. Computing an expected position for the landmarks in the current frame.
iteration. We keep only the points whose reprojection er- center of the first camera in the reference sequence, with
ror is less than 2 pixels. As the pose converges towards one axis along the optical axis. Moreover, there is no
the optimal pose, some inliers can become outliers and scale information. In order to control the robot, we need
conversely. Usually, less than five iterations are enough. to provide a position for the robot in a metric coordi-
It would be relatively easy to introduce a motion model nate system. We achieve that by entering manually the
of the vehicle or odometry measurements in this frame- length of the path to set the scale factor. The position
work. For each new frame, we consider that the pose of the camera on the robot has been measured so we
of the camera is the same as the previous pose in or- can enter directly the rigid transformation between the
der to match features. Incorporating odometry would camera and the robot. Once the camera pose has been
allow to have a better initial estimate of the camera computed, we compute the robot position and orienta-
pose. We have not done that in order to keep the pos- tion with 3 degrees of freedom only because the robot
sibility to use the localization algorithm with hand held can be controlled only along three dimensions (transla-
cameras. tion on the ground plane and rotation around the verti-
At this point we have a real-time algorithm that is able cal axis). To do that, we assume that the ground is lo-
to compute the pose of the camera. We always keep track cally planar and horizontal at the current position of the
of the camera pose with 6 degrees of freedom, even if the robot.
control law works in a ground plane. It allows us to track
interest points even if the ground is irregular.
4. Computing the Localization Uncertainty
3.2. Robot Localization Knowing the uncertainty of the robot localization com-
puted for the current frame has some important applica-
Since the localization algorithm is based only on vi- tions. It allows data fusion with other localization sensors
sion the coordinate system has its center at the optical (such as an odometer). It could also be used for temporal
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 243
with U a 6 × 6 block, W a 6 × 3n block and V a 3n × 3n the full covariance matrix relating all the points and all
matrix: the cameras.
We use the general method of gauge constraints to
U = A T A = A1T A1 + A2T A2 + · · · + AnT An (5) fix the reconstruction coordinate system and choose our
T covariance matrix (Triggs et al., 2000). Among other
W = A1 B1 A2 B2 . . . An Bn
T T
(6)
⎛ T ⎞ choices, the symmetric constraint on the reference cam-
B1 B1 0 0 0 era location (e.g. Lhuillier et al., 2006) is chosen since it
⎜ 0 T
B2 B2 0 0 ⎟ usually spreads evenly and minimizes the uncertainties
⎜ ⎟
V =⎜⎜ ⎟ + Cov −1 (7) on all reference cameras. Many details about a practical
.. ⎟ 3d
⎝ 0 0 . 0 ⎠ computation of points and cameras covariances for com-
0 0 0 BnT Bn plex reconstructions like ours are given in Lhuillier et al.
(2006).
Finally, the covariance matrix associated with the cam- A simplification of this method can also be used if we
era pose is: make the additional assumption that there is no uncer-
tainty for the cameras in the 3D reconstruction. In this
Covcam = (U − W V −1 W T )−1 (8) case, the covariance matrix for each point can be com-
puted independently. We consider a point X j seen by n
cameras Ci , i ∈ {1, . . . , n}. We want to compute its co-
In the general case, V is not a sparse matrix and the
variance matrix C j, j . We assume that the n reprojection
computation of W V −1 W T can take a lot of time. In or-
errors of this point denoted ei in image i follow indepen-
der to be able to compute the localization uncertainty in
dent and normal distributions N (0, i ). Furthermore, we
real-time, an additional hypothesis should be done. If we
assume that i = σ 2 I2 , with σ 2 an unknown parameter.
assume that the positions of the 3D points are independent
Thus, the probability density function of the statistical
random variables, then Cov3d becomes a block-diagonal
model is:
matrix and matrix V can be rewritten :
⎛ ⎞ 1
i ei
2
B1T B1 + C1,1
T
C1,1 0 0 f (X j , σ 2 ) = e− 2σ 2 (10)
⎜ .. ⎟ (2π σ 2 )n
V =⎝ 0 . 0 ⎠
0 0 BnT Bn + Cn,n
T
Cn,n The MLE of X j and σ 2 are Xˆ j minimizing X j →
(9) e 2
i ei and σ̂ = i 2n i . We replace
2 2
the MLE of
i ei
2
σ by the unbiased estimator σ̂ = 2n−3 , with 2n the
2 2
With this assumption Covcam can be computed very number of observations, and 3 the number of degrees
quickly. Additionally, the computation of the covariance of freedom of X j . This gives us i = σ̂ 2 I2 . Finally,
matrix of the 3D points can also be simplified because we obtain the estimation of C j, j by the inverse of JFT JF
we need to compute only the blocks on the diagonal. Sec- where JF is the jacobian matrix of the vector error F with
tion 4.2 details the computation of the covariance matrix −1 −1
F = (e1T 1 2 , . . . , enT n 2 )T .
of the 3D points, then Section 4.3 compares the various
Figure 6 shows the confidence ellipsoids computed
methods that can be used.
with this method for the points visible in one key frame of
the loop sequence (see Fig. 20). Because of the forward
4.2. Uncertainty in the 3D Reconstruction motion of the camera, most ellipsoids have a major axis
much longer than the minor axis.
Note that the reconstruction of the reference sequence is
obtained by minimizing the sum of squared reprojection
errors using bundle adjustment. We obtain a maximum 4.3. Comparison
likelihood estimation (MLE) of 3D points and extrin-
sic camera parameters assuming that the reprojection er- In order to choose between all the possible trade-offs we
rors obeys independent and identical zero mean Gaussian made a comparison of four methods based on different
distributions. This perturbation of errors propagates to a assumptions:
Gaussian perturbation of the estimated parameters, such
that the covariance matrix may be estimated (Hartley and • In method 1, Covcam is computed with the full matrix
Zisserman, 2000). Thus, for each key-frame of the ref- Cov3d .
erence sequence, we estimate the covariance matrix of • In method 2, Cov3d is computed with the general
the 3D points seen by the key-frame (named Cov3d in method, but only the diagonal blocks are used in the
the previous paragraph) as a 3n × 3n diagonal block of computation of Covcam .
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 245
Figure 6. 99% confidence ellipsoids computed for the points visible in one frame (view from the top).
• In method 3, Cov3d is computed with the assumption 5. Experimental Vehicle and Control Law
that the camera have no uncertainty in the reconstruc-
tion process. This part is presented only for the completeness of the
• In method 4, we simply ignore the uncertainty of the paper. It refers to a work made by another team of the
3D positions of the points and the only source of un- laboratory (see Thuilot et al. (2004) for more details).
certainty comes from the positions of the 2D points in
the current frame.
5.1. Vehicle Modeling
Method 1 takes too much time to be used in a real-
The experimental vehicle is called the Cycab. With its
time system, but we included it in this comparison to
small dimensions (length: 1.90 m, width: 1.20 m), it can
know if using the non diagonal blocks of Cov3d was
transport simultaneously two passengers. This vehicle is
mandatory. We computed the length of the major semi-
entirely under computer control. For the experimenta-
axis of the 90% confidence ellipsoid associated to the
tions detailed in this paper, only the front wheels are
camera positions for all four methods. The result is
steered, so the mobility is the same as a common car.
shown on Fig. 7 for a few frames of the video used in
A classical kinematic model, the tricycle model, where
Section 6.6.
the two actual front wheels are merged as a unique virtual
As we could expect, taking into account the uncer-
wheel is used (see Fig. 8). The vehicle configuration can
tainty of the 3D points increases the size of the ellip-
be described without ambiguity by a 3 dimensional state
soids. Taking into account the uncertainty of the cam-
vector composed of s, curvilinear coordinate along the
eras in the reconstruction has the same effect. However,
reference path of the projection of vehicle rear axle center,
the difference between method 1 and method 2 is small,
and of y and θ̃ , vehicle lateral and angular deviations
and since method 2 can be used in real-time, it seems
with respect to this path. On the other hand, the control
that it is a good method for computing localization un-
vector is defined by the vehicle linear velocity and the
certainty. Computation time is the same for methods 2
front wheel steering angle, denoted respectively v and δ.
and 3, the difference lies in the complexity of the imple-
Vehicle model is then given (see e.g. de Wit (1996)) by:
mentation for computing the diagonal elements of Cov3d .
With the simplification made in method 3, the size of the ⎧
⎪ cos θ̃
confidence ellipsoid is slightly underestimated, but the ⎪
⎪ ṡ = v
⎪
⎪ 1 − yc(s)
result is still much closer to the one obtained with the ⎪
⎨
full computation than with method 4. A comparison be-
ẏ = v sin θ̃ (11)
tween the size of the confidence ellipsoid and the local- ⎪
⎪
⎪
⎪
ization error measured with the RTK GPS is also done in ⎪
⎪ tan δ c(s) cos θ̃
⎩ θ̃˙ = v −
Section 6.6.3. l 1 − y c(s)
246 Royer et al.
0.13
method 1
method 2
0.12 method 3
method 4
0.11
0.1
0.08
0.07
0.06
0.05
0.04
0.03
0 5 10 15 20 25 30 35
Frame number
Figure 7. Major semiaxis (in meters) of the 90% confidence ellipsoid with four computation methods for a few frames localized with reference to
a given key-frame. Frame to frame variations are mostly due to changes of the 3D-2D correspondences.
where l is the vehicle wheelbase and c(s) denotes the ref- absolute position and heading, provided by the vision
erence path curvature at coordinate s. It is assumed that: algorithm, with the reference path. Via invertible state
y = c(s)
1
(i.e. the vehicle is not on the reference path cur- and control transformations, the nonlinear vehicle model
vature center) and θ̃ = π2 [π]. In practical situations, if the (11) can be converted, in an exact way, into the so-
vehicle is well initialized, such difficulties never arise. called chained form (see Samson, 1995). (a1 , a2 , a3 ) =
(s, y, (1 − yc(s)) tan θ̃) is the chained state vector and
M = (m 1 , m 2 )T = ϒ(v, δ)T is the chained control vec-
5.2. Control Law tor. From this, a large part of linear systems theory can be
used (but, since the transformations are exact, it is not re-
The control objective is to bring and maintain state vari- quired that the vehicle state is in a specific configuration,
ables y and θ̃ to 0, relying uniquely on control variable contrarily to tangent linearization techniques). More pre-
δ (v is considered as a possibly varying free parame- cisely, it can be noticed that path following (i.e. control
ter). The whole vehicle state vector (s, y, θ̃) is avail- of a2 and a3 ) can be achieved by designing only m 2 as a
able with a satisfactory accuracy by comparing vehicle linear proportional derivative controller. The expression
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 247
of the actual control variable δ can then be obtained by reported in this article are measured in an horizontal plane
inverting the chained control transformation. Computa- only. In any case, vertical errors could be interesting for
tions, detailed in Thuilot et al. (2004), lead to: a flying robot but not for our application.
cos3 θ̃ d c(s) 6.1.2. Reconstruction and Localization Error. We
δ(y, θ̃) = arctan l y tan θ̃ want to distinguish between the error that is attributed to
(1 − c(s) y)2 ds
the reconstruction process and the error coming from the
− K d (1 − c(s) y) tan θ̃ − K p y
localization algorithm. So we define two errors to mea-
c(s) cos θ̃ sure the reconstruction and the localization accuracy. The
+ c(s) (1 − c(s) y) tan θ̃ +
2
1 − c(s) y reconstruction error is the average distance between the
(12) camera positions computed in the structure from motion
algorithm and the true positions given by the RTK GPS
with K p , K d > 0 the proportional derivative gains. (after the two trajectories have been expressed in the same
coordinate system). This error is mostly caused by a slow
drift of the reconstruction process. It increases with the
6. Experimental Results length and complexity of the trajectory. That means the
3D model we build is not perfectly matched to the real
6.1. Methods for Performance Evaluation 3D world.
The definition of the localization error is a bit more
6.1.1. Using GPS Data as the Ground Truth. Most of complex. To understand why, suppose the robot is exactly
the results presented in this paper, show the accuracy of on the reference path and the localization algorithm indi-
the system by comparing the results of the algorithms cates that the robot is on the reference path. In this case,
with the data recorded by a Real Time Kinematic (RTK) the localization error should be zero, and controlling the
GPS mounted on the Cycab. But comparing data obtained robot so that this error is kept zero would result in the
by two different sensors is not completely straightfor- robot following the reference path. But if we computed
ward. This paragraph explains how results from the vi- directly the difference between the 3D position given by
sion algorithms are compared to GPS measurements. the localization algorithm and the position given by the
Two operations are needed so that both data sets can GPS, we would get the reconstruction error which is not
be compared. First the GPS sensor is not mounted on zero in most cases. In fact, in our application, a global
the vehicle at the same place as the camera. The GPS localization is not necessary, only a relative position with
receiver is located above the mid-point between the rear respect to the reference trajectory is needed.
wheels of the car, while the camera is in front of the ve- We define the localization error in order to measure the
hicle (1.15 meters in front of the GPS sensor). So the two error we make in computing this relative localization with
sensors don’t have the same trajectory. From the GPS the vision algorithm. First we compute the lateral devia-
positions, we computed a “virtual” GPS which indicates tion between the current robot position G 1 and the closest
what a GPS would record if it was mounted on the Cycab robot position G 0 on the reference trajectory. This is il-
at the same place as the camera. In addition, the 3D re- lustrated on Fig. 9. The robot position is always defined
construction is done in an arbitrary euclidian coordinate by the position of the middle point of the rear axle of the
system, whereas the GPS positions are given in another vehicle. This position is directly given by the RTK GPS.
coordinate system. So the whole 3D reconstruction has When working with vision it must be computed from the
to be transformed using a global rotation, translation and camera position and orientation. The 3D reconstruction
scale change. The approach described by Faugeras and used as the reference has already been transformed so that
Herbert (1986) is used to compute this global transforma- it is in the same metric coordinate system as the GPS data.
tion. After this transformation has been made, the cam- We start with the localization of the camera C1 given by
era and GPS positions are available in the same metric the localization part of the vision algorithm. From C1
coordinate system. This process is done for the 3D recon- we compute the corresponding GPS position G 1 (it is
struction of the reference video sequence. After that, the possible because we measured the positions of the GPS
localization algorithm gives directly the camera positions receiver and the camera on the vehicle). Then we find the
in the same metric coordinate system. closest GPS position in the reference trajectory: we call it
The GPS sensor we use is a Real Time Kinematics G 0 . At point G 0 of the reference trajectory, we compute
−
→ −→
Differential GPS (Thalès Sagitta model). It is accurate the tangent T and normal N to the trajectory. The lat-
−−−→ − →
to 1 cm (standard deviation) in an horizontal plane when eral deviation computed with vision is yv = G 0 G 1 · N .
enough satellites are available. The accuracy on a vertical The lateral deviation is computed from the GPS mea-
axis is only 20 cm on our hardware platform. So we dis- surements as well and we get yg (in this case we have
card the vertical readings and all the localization errors directly G 0 and G 1 ). yg and yv are the same physical
248 Royer et al.
Figure 11. Position of the key frames (circles) with reference to the trajectory recorded by the RTK GPS (continuous line). Units in meters.
1
Vision
GPS
0.8
0.6
0.4
Lateral deviation (meters)
0.2
Figure 12. Lateral deviation measured with the RTK GPS yg (blue) or with vision yv (red). The two curves are nearly the same.
6.3.2. Rotational Accuracy. In order to evaluate the ro- a 2◦ increment. The reference trajectory was a straight
tational accuracy, we made an indoor experiment bec- line (1 m long) oriented along the optical axis (which
ause the GPS can’t give accurate angular measurements. was in the 0◦ direction). The result of this experiment ap-
The camera was mounted on a rotating platform, with pears on Fig. 14. The algorithm was not able to provide
the optical center on the rotational axis. The angle of the a reliable pose of the camera when the angle reached 95◦
platform can be read with about ±0.1◦ accuracy. We com- because there were not enough point correspondences.
pared the orientation α provided by the localization part The angular accuracy measured with this setup is about
of the vision algorithm to the angle α0 given by the plat- ±0.1◦ , which is about the same as what can be read on
form. For this experiment (and the following ones till the the platform. The algorithm provides a useful angular in-
end of the article), we used a fish eye lens providing a 130◦ formation for a deviation up to 94◦ on either side with
field of view (in the diagonal) and we made a measure- this camera. Of course, with such an angular deviation
ment for each angle from α0 = −94◦ to α0 = 94◦ with from the reference frame, the part of the image which
250 Royer et al.
0.08
0.06
0.04
Localization error (meters)
0.02
0.4
0.3
0.2
Angular error (degrees)
0.1
-0.1
-0.2
-0.3
-0.4
-80 -60 -40 -20 0 20 40 60 80
Figure 15. Images taken at −90◦ , −60◦ , −30◦ , 30◦ , 60◦ and 90◦ orientation, with interest points correctly matched.
can be used is very small, and the localization becomes of the vision algorithms) to provide the vehicle state vec-
impossible if there is an occultation in this area. Images tor which is used in the control law. Lateral deviations
captured at different orientations are shown on Fig. 15. from the reference path recorded during 3 of these ex-
periments are displayed with the same scale on Fig. 18.
6.4. Autonomous Navigation Letters enable to identify each part of the trajectory, with
respect to the letters shown on Fig. 16.
We also made some experiments where we tested the It can be observed that the vision algorithms detailed
whole system. A reference video sequence was recorded in this paper appear as a very attractive alternative to
and a 3D reconstruction was computed from it. Then the RTK GPS sensor, since they can provide with roughly
robot had to follow the same path in autonomous naviga- the same guidance accuracy. It can also be noticed that
tion. We recorded GPS measurements at the same time these vision algorithms are reliable with respect to out-
we recorded the learning video sequence. Then we also door applications since they appear robust to weather
recorded GPS measurements in autonomous navigation. conditions: guidance accuracy is not significantly altered
The learning path was 127 meters long. It was chosen in as harsh conditions as the sunny ones. More precisely,
so that there are both straight lines and tight turns, and guidance performances during straight lines and curves
because the buildings are sometimes far (less visual fea- following are investigated separately on Tables 1 and 2.
tures) and sometimes closer. Vehicle speed was chosen Table 1 reports the mean value of the lateral deviation
constant and equal to 2 km/h. The result of the struc- (|y|) during straight lines part of the trajectory, denoted
ture from motion algorithm is displayed on Fig. 16 as B and D. In the most favorable situation (cloudy weather),
seen from the top. There were 182 key frames and 16295 vision algorithms meet the performances obtained with
points correctly reconstructed. the RTK GPS sensor. In the worst case (sunny weather),
The reference video sequence was recorded on a performances are slightly damaged, but are still very sat-
cloudy day. The first two navigation experiments were isfactory. Table 2 displays the extremum values of y
made a few days later with a cloudy weather too. But the recorded during curved parts of the trajectory, denoted
second set of two was made on a clear day with the sun C and E. Once more, it can be observed that guidance
low in the sky and sometimes in the field of view of the performances are roughly similar.
camera. A few images from the video recorded during For these experiments, guidance accuracy seems to be
the last navigation experiment as well as the correspond- limited by the ability of the control law to keep the robot
ing key frame are displayed on Fig. 17. The last image on the trajectory. This explains the similarity between
outlines the necessity of having a wide field of view and the two vision based lateral deviations (“Cloudy 2” and
local visual features all over the frame. It shows the inter- “Sunny 1” on Fig. 18). The difference between the RTK
est points which are used in the localization. The center GPS and vision based control may be explained because
of the image is completely overexposed because the sun vehicle heading is not computed in the same way with
is in front of the camera, but the building on the left can both sensors. This is especially visible in curves. When
still be used for computing the camera pose. relying on the RTK GPS sensor, vehicle heading is in-
For comparison purposes, a fifth experiment has also ferred from velocities measurements (obtained by dif-
been performed, relying on the RTK GPS sensor (instead ferentiating successive position data) under non-slipping
252 Royer et al.
Figure 16. 3D reconstruction computed from the reference video sequence (top view). Black squares are the position of the key frames. The
landmarks appear as dots. Letters indicate different parts of the trajectory.
Figure 17. Three frames taken during the autonomous navigation (bottom) and the corresponding key frames (top).
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 253
0 20 40 60 80 100 120
0.2
Sunny 1 Table 2. Maximum and minimum deviation in curves.
0
to treat specifically the case of closed loops by searching
point correspondences between the last image of the se-
A B C D E quence and the first one in order to reduce reconstruction
0 20 40 60 80 100 120
Curvilinear abscissa along reference trajectory in meters error. But that’s not our purpose. We want to use closed
loops as a way to illustrate what happens in the presence
Figure 18. Lateral deviation from the reference trajectory. of reconstruction error. An example of a closed loop tra-
jectory appears on Fig. 19 with a few images from the
video sequence on Fig. 20. The position of the first and
assumptions, smoothed via a Kalman filter. The filter in-
last cameras should be the same. But because of some
troduces a delay and the result may not be as accurate as
drift in the reconstruction process, the reconstruction of
the orientation computed with vision.
the loop is not exactly closed (the gap measures 1.5 m).
In that case, some points are reconstructed twice : once
6.5. Navigation on a Closed Loop when they are seen at the beginning of the sequence and
once when they are seen at the end of the loop. In spite
The case of a closed loop is very interesting because it is a of this reconstruction error, the robot could navigate con-
good way to visualize the influence of the reconstruction tinuously on this trajectory (for several loops), without a
error on robot navigation. Of course, it would be possible hint of jerky motion at the seam.
I I I IN
TN I4
I3
I2
yv I1
T1
CN yv
NN
C1
3D points associated
N1
with Key frame N
3D points associated
with key frame 1
Figure 21. Lateral deviation computed at the beginning or at the end of a loop.
To understand why, we have to recall which variables see here that a global optimization of this scale should
are used in the control law. Along with the curvature of not be necessary. Good accuracy on the small scale (50 m
the path, the two other variables are the lateral devia- to 100 m) is enough.
tion from the reference path yv and the angular devia-
tion. Figure 21 helps to understand what happens. The 6.6. Navigation on a Path Different from the Learning
localization is computed with reference to the points Path
cloud visible at one moment by the camera (the points
which appear on only one key frame). The absolute local- 6.6.1. Experiment. From a practical point of view, ex-
ization is not used for the computation of the lateral and actly following the path that was recorded can cause some
angular deviations. What is important is the localization problems. Ideally, if there is an obstacle on the trajectory,
in a local reference frame whose center is the position of the robot should be able to go around it. That’s were the
the robot at the key frame. On Fig. 21, this local reference expense of computing the 3D structure of the environ-
−
→ − →
frame is drawn for the beginning of the loop (I1 , T1 , N1 ) ment pays off. Knowing the 3D position of the interest
−
→ −→ −
→
and for the end of the loop (I N , TN , N N ), where T and points allows the robot to localize itself outside the learn-
−
→
N are the unit vectors tangent and normal to the refer- ing path and to navigate in the neighborhood of this path.
ence trajectory. The change from the end of a loop to the We made an experiment to see how much the robot could
beginning of the next one corresponds to a change of the depart from the learning trajectory.
local reference frame which affects simultaneously the First, a learning video sequence was recorded on a
reference trajectory, the current localization, and the 3D 70 meters long trajectory. From this video sequence, a
points currently visible. In this case, the lateral deviation map was computed. A few images from the learning
yv has no discontinuity. The same is true for the angular video sequence are shown on Fig. 22. This sequence was
deviation. reconstructed with 102 key frames and 9579 points. For
This characteristic suggests that it should be possi- this experiment we defined a new trajectory (target tra-
ble to link several short trajectories (of about 100 m) to jectory), slightly different from the learning trajectory.
provide the robot with enough information to travel for The robot had to follow the target trajectory while using
several kilometers. A global bundle adjustment would be the learning trajectory and video sequence to localize it-
too costly for trajectories of several kilometers, but we self. The experiment was conducted two weeks after the
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 255
Figure 23. 3D reconstruction of the learning trajectory (black squares) and target trajectory (broad green line) viewed from the top.
learning trajectory was recorded. Figure 23 shows the 3D 6.6.2. Localization and Navigation Results. Let’s ex-
reconstruction of the learning sequence (black squares) amine the path that was really followed by the robot dur-
and the target trajectory that was defined in a graphic ed- ing this experiment. We will call it the result path. It was
itor (broad green line). The target trajectory was defined accurately recorded by the RTK GPS sensor. The robot
so that it reproduces what would happen if the robot had nearly followed the target path. This is visible on Fig. 25.
to avoid two obstacles. The target path departs from the The blue curve shows the lateral deviation of the target
learning path on the right at the beginning and then on path with reference to the learning path. The red curve
the left a few meters later. The maximum lateral devia- is the lateral deviation of the result path with reference
tion from the learning trajectory is about 3 meters. The to the learning path. The red curve was obtained by us-
angular deviation is at most 20◦ . We parked a vehicle ing the GPS measurements. The lateral deviation from
on the learning trajectory to add some occultations (see the learning path directly computed from the result of
Fig. 24) to simulate the images the robot would use if it the localization algorithm is displayed on the top row of
had to avoid a real obstacle. Some pedestrians were also Fig. 26. At this time, we don’t use any filters to smooth
passing by, occulting some parts of the image from time the result of the localization algorithm. The result path
to time. The target trajectory was simply defined with the deviates slightly from the target path in the turns. That is
mouse in a graphical editor. Defining a new trajectory probably because we defined the target path in a graph-
doesn’t require expensive computations so this could be ical editor without knowing if the robot could steer fast
done online if necessary. For example, we could use a enough to really follow the path. The lateral deviation
sensor (radar, sonar or laser range finder) to detect ob- computed with the vision algorithm matches the GPS
stacles in front of the robot. If an obstacle was detected measurements well in those parts of the trajectory.
and localized by this sensor in the robot’s coordinate sys- The localization is more noisy when the robot departs
tem, it would be possible to modify the trajectory in 3D from the learning trajectory but the localization is still ac-
space in the same coordinate system without stopping the curate enough to drive the robot. We recorded the ground
vehicle. truth with the GPS so were are able to measure the error
256 Royer et al.
Figure 24. Interest points correctly matched between the reference key frame and the current image. Left: when the robot is nearly on the learning
path. Right: when the lateral deviation is maximal. The reference frame is on top.
Figure 25. Lateral deviation of the target trajectory (blue), and lateral deviation measured by the GPS during the navigation experiment (red).
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 257
0.5
300
Number of points
200
A B C
100
0
0 20 40 60 80 100 120 140 160 180 200
Time (seconds)
Figure 26. Lateral deviation computed by the vision algorithm (top), Lateral deviation error (middle) and number of interest points correctly matched
(bottom).
that was made by the localization algorithm when com- of points matched. The graph of Fig. 26 was divided in
puting the lateral deviation from the target path. This error three parts. Zones A and C correspond to parts where
appears on the middle of Fig. 26. When the robot is on the the robot was far from the learning trajectory and zone B
learning path, the error is less than 2 centimeters (the same correspond to a part where the robot was on the reference
accuracy as our GPS). When the robot is farther from the trajectory. For each zone, the average number of points
learning path, this error increases to more than 10 cen- matched was computed as well as the standard deviation
timeters. The reason lies in the number of points matched. of the error of the vision algorithm.
Figure 24 shows the interest points that were correctly The robot can depart from the learning trajectory up to
matched between the current frame and the reference a limit which depends on the number of points that can
frame for two robot positions during the navigation ex- be matched between the reference video sequence and
periment. We can see that the algorithm matches less the current image. Points can be lost because they may
points when the robot is far from the learning path. This be seen from a different point of view and the correlation
is also visible at the bottom of Fig. 26 which displays the score is not high enough to make a valid match. This is
number of points used in the localization process for the the case if the lateral deviation is large. When angular
whole experiment. When the robot is on the learning path, deviation is important, points are lost because they go
more than 200 points are matched. This number drops to out of the field of view of the camera. Nothing can be
100 when the robot is 3 meters away. What is also im- done to correct that except going to an omnidirectional
portant to notice is that the points close to the camera are camera. On the other hand, we could take into account
lost first, and those points are those which provide the the difference in point of view by using wide baseline
best positional accuracy. Points at infinity on the other matching techniques or by predicting the appearance of
hand can only give some rotational information. Table 3 the landmarks based on the camera position.
shows how localization accuracy is related to the number
6.6.3. Validation of the Localization Uncertainty. The
aim of this analysis is to make sure that the size of the con-
Table 3. Number of points matched and localization error. fidence ellipsoids varies in the same way as the localiza-
Zone A B C tion errors measured with the RTK GPS. The experiment
presented in Section 6.6 gives us an opportunity to make
Time limits 20−70 s 70−115 s 115−170 s an experimental validation of the computation of the lo-
Average number of points 119 220 110
Localization error 8 cm 1 cm 8 cm
calization uncertainty. In the experiments described in
paragraph 6.3.1, the confidence ellipsoids were too small
258 Royer et al.
Figure 27. Comparison between the major semiaxis of the 90% confidence ellipsoid computed with the vision algorithm (black) and localization
error measured with the RTK GPS (grey). Vertical scale in meters.
compared to the accuracy of the RTK GPS to have some 6.7. Computation Times
meaningful results.
For each video frame taken during the autonomous The computation times given here were measured on a
navigation, we computed the length a of the major semi- 3 GHz Pentium 4 processor with an image size of 512 ×
axis of the 90% confidence ellipsoid with methods 2, 3 384 pixels. The code of the interest point detection and
and 4 as explained in Section 4. Method 1 was not in- matching uses the SSE2 instruction set. Off line map
cluded in this comparison because it is not a real-time building for a sequence with 100 key frames such as the
method. The localization error was also computed by one on Fig. 23 takes about 15 min. For a larger sequence
using the RTK GPS measurements in a way slightly such as the one on Fig. 3 with approximately 300 key
different from what is explained in Section 6.1.2. In frames, map building is done in 60 min. Most of the time
Section 6.1.2, the measurements from both sensors were is spent in the bundle adjustment. The localization runs
brought in the same coordinate system and the position at 15 frames per second (which is also the frame rate
of the cycab was computed. Then the localization error of our camera). The computation time can be roughly
was measured along the normal to the trajectory. Here, divided as: 35 ms for interest points detection, 15 ms for
the same computations are done to bring the data sets matching, 10 ms for the pose computation and 5 ms for
from both sensors in the same coordinate system, but the uncertainty computation.
the localization error is the distance between the position
computed with vision and the position computed with the
GPS. It includes the error along the normal and the error 6.8. Database Management
along the tangent to the trajectory. Figure 28 shows the
localization error compared to the length a2 of the major The memory needed to store the 3D model of the environ-
semiaxis of the 90% confidence ellipsoid computed with ment is very reasonable because we don’t need to keep
method 2. The major semiaxis computed with method 2 is the whole key frames. We need to keep only image data
roughly of the same size as the localization error and that around the interest points to compute the ZNCC as well
both quantities vary in the same way. This experiment as the 2D and 3D coordinates of the points. With about
shows that method 2 is a valid method for computing the 150 bytes per interest point and about 500 interest point
localization uncertainty. per key frame, each key frame takes less than 100 Kb of
A comparison can be made between the three meth- memory. Since the model is built with one key frame per
ods used to compute the localization uncertainty. meter on average, storing a 1 kilometer long trajectory
Figure 28 shows for each method the histogram of log2 ( a ) needs about 100 Mb. Computers can handle this amount
for every frame in the video sequence. If we assume zero of data without difficulty.
uncertainty on the cameras in the 3D reconstruction (in From a practical point of view, keeping the model of
method 3), the length of the major semiaxis is nearly half the environment up to date is needed even if localization
of the length computed with method 2. Ignoring the un- is possible with some modifications of the environment
certainty on the 3D points (in method 4) conducts to an (see Royer et al. (2005c) or Fig. 24 for example). We
even larger underestimation of the size of the ellipsoid. have made a lot of experiments during the past two years
Monocular Vision for Mobile Robot Localization and Autonomous Navigation 259
400 500
350 450
400
300
350
250 300
200 250
150 200
150
100
100
50 50
0
-4 -3 -2 -1 0 1 2 3 4 0 -4 -3 -2 -1 0 1 2 3 4
method 2 method 3
600
500
400
300
200
100
0
-4 -3 -2 -1 0 1 2 3 4
method 4
in different places. We found that once a model is built, same accuracy. When the robot departs from the learning
it can be used during a few weeks or more depending of path, the performance of the vision algorithm is some-
the season and the kind of place. Buildings don’t change what degraded but still satisfactory in the case of obsta-
much, but trees can change quickly in spring and fall. We cle avoidance for example. These two sensing devices
have not developed a method to update the model yet, appear complementary: autonomous navigation in urban
but we have some ideas about how to do that. During au- environments cannot satisfactorily be addressed by RTK
tonomous navigation, the pose of the camera is computed GPS sensors since tall buildings can disturb satellite re-
and the interest points are detected for each frame. We ceiving. These buildings however offer a lot of visual
can store this data every time the camera is near a key features which can be used to feed vision algorithms.
frame so that at the end of a navigation experiment we The main difficulty with the vision algorithm is to keep
have all the necessary information to update the database. a map of the environment up to date. Even if the experi-
Then we could match the interest points in the sequence ments presented in this paper have shown that the local-
recorded and compute the 3D position of the points. A ization algorithm is robust to some changes, it may not be
bundle adjustment could be used to refine the structure. enough for an ever changing environment. For example
New points could be added to the model, and points in a city, cars parked along the side of the road change
which have not been used since a long time could be from day to day, trees evolve according to the season,
removed. some buildings are destroyed while others are built or
modified. So our goal is to have a method to update the
map automatically in order to take these modifications
7. Conclusion into account.
References 17. Matsumoto, Y., Inaba, M., and Inoue, H. 1996. Visual navigation
using view-sequenced route representation. In International Confer-
1. Araújo, H., Carceroni, R.J., and Brown, C.M. 1998. A fully projective ence on Robotics and Automation, pp. 83–88.
formulation to improve the accuracy of Lowe’s pose estimation algo- 18. Nistér, D. 2001. Frame decimation for structure and motion. In 2nd
rithm. Computer Vision and Image Understanding, 70(2):227–238. Workshop on Structure from Multiple Images of Large Environments,
2. Argyros, A., Bekris, K., Orphanoudakis, S., and Kavraki, L. 2005. Springer Lecture Notes on Computer Science, vol. 2018, pp. 17–34.
Robot homing by exploiting panoramic vision. Journal of Au- 19. Nistér, D. 2003. An efficient solution to the five-point relative pose
tonomous Robots, 19(1):7–25. problem. In Conference on Computer Vision and Pattern Recogni-
3. Blanc, G., Mezouar, Y., and Martinet, P. 2005. Indoor navigation of tion, pp. 147–151.
a wheeled mobile robot along visual routes. In IEEE International 20. Nistér, D., Naroditsky, O., and Bergen, J. 2004. Visual odometry. In
Conference on Robotics and Automation, ICRA’05, Barcelonne, Es- Conference on Computer Vision and Pattern Recognition, pp. 652–
pagne. 659.
4. Cobzas, D., Zhang, H., and Jagersand, M. 2003. Image-based local- 21. Remazeilles, A., Chaumette, F., and Gros, P. 2004. Robot mo-
ization with depth-enhanced image map. In International Conference tion control from a visual memory. In International Conference on
on Robotics and Automation. Robotics and Automation, vol. 4, pp. 4695–4700.
5. Davison, A.J. 2003. Real-time simultaneous localisation and map- 22. Royer, E., Bom, J., Dhome, M., Thuilot, B., Lhuillier, M., and
ping with a single camera. In Proceedings of the 9th International Marmoiton, F. 2005a. Outdoor autonomous navigation using monoc-
Conference on Computer Vision, Nice. ular vision. In International Conference on Intelligent Robots and
6. de Wit, C.C., Siciliano, B., and Bastin, G. 1996. The Zodiac, Theory Systems, pp. 3395–3400.
of Robot Control. Springer Verlag. 23. Royer, E., Lhuillier, M., Dhome, M., and Chateau, T. 2005b. Lo-
7. Faugeras, O. and Herbert, M. 1986. The representation, recognition, calization in urban environments: Monocular vision compared to a
and locating of 3-d objects. International Journal of Robotic Re- differential gps sensor. In International Conference on Computer
search, 5(3):27–52. Vision and Pattern Recognition, CVPR.
8. Fischler, O. and Bolles, R. 1981. Random sample consensus: A 24. Royer, E., Lhuillier, M., Dhome, M., and Lavest, J.-M. 2005c. Per-
paradigm for model fitting with application to image analysis and formance evaluation of a localization system relying on monocular
automated cartography. Communications of the Association for Com- vision and natural landmarks. In Proceedings of the ISPRS Workshop
puting Machinery, 24:381–395. BenCOS (Towards Benchmarking Automated Calibration, Orienta-
9. Georgiev, A. and Allen, P.K. 2004. Localization methods for a mo- tion and Surface Reconstruction from Images).
bile robot in urban environments. IEEE Transactions on Robotics, 25. Samson, C. 1995. Control of chained systems. application to path fol-
20(5):851–864. lowing and time-varying point stabilization of mobile robots. IEEE
10. Goedemé, T., Tuytelaars, T., Van Gool, L., Vanacker, G., and Nuttin, Transactions on Automatic Control, 40.
M. 2005. Feature based omnidirectional sparse visual following. 26. Se, S., Lowe, D., and Little, J. 2002. Mobile robot localization and
In International Conference on Intelligent Robots and Systems, pp. mapping with uncertainty using scale-invariant visual landmarks.
1003–1008. International Journal of Robotic Research, 21(8):735–760.
11. Haralick, R., Lee, C., Ottenberg, K., and Nolle, M. 1994. Review and 27. Simond, N. and Rives, P. 2004. Trajectography of an uncalibrated
analysis of solutions of the three point perspective pose estimation stereo rig in urban environments. In International Conference on
problem. International Journal of Computer Vision, 13(3):331–356. Intelligent Robot and System, pp. 3381–3386.
12. Harris, C. and Stephens, M. 1988. A combined corner and edge 28. Thuilot, B., Bom, J., Marmoiton, F., and Martinet, P. 2004. Accurate
detector. In Alvey Vision Conference, pp. 147–151. automatic guidance of an urban vehicle relying on a kinematic gps
13. Hartley, R. and Zisserman, A. 2000. Multiple View Geometry in sensor. In Symposium on Intelligent Autonomous Vehicles IAV04.
Computer Vision. Cambridge University Press. 29. Torr, P., Fitzgibbon, A., and Zisserman, A. 1999. The problem of
14. Kidono, K., Miura, J., and Shirai, Y. 2002. Autonomous visual navi- degeneracy in structure and motion recovery from uncalibrated image
gation of a mobile robot using a human-guided experience. Robotics sequences. International Journal of Computer Vision, 32(1):27–44.
and Autonomous Systems, 40(2–3):124–1332. 30. Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. 2000.
15. Lavest, J.M., Viala, M., and Dhome, M. 1998. Do we need an accu- Bundle adjustment—A modern synthesis. In Vision Algorithms: The-
rate calibration pattern to achieve a reliable camera calibration ? In ory and Practice, W. Triggs, A. Zisserman, and R. Szeliski (Eds.),
European Conference on Computer Vision, pp. 158–174. Lecture Notes in Computer Science, pp. 298–375. Springer Verlag.
16. Lhuillier, M. and Perriollat, M. 2006. Uncertainty ellipsoids calcu- 31. Vacchetti, L., Lepetit, V., and Fua, P. 2003. Stable 3-d tracking in
lations for complex 3d reconstructions. In International Conference real-time using integrated context information. In Conference on
on Robotic and Automation. Computer Vision and Pattern Recognition, Madison, WI.