KleinMurray2007ISMAR PDF

Parallel Tracking and Mapping for Small AR Workspaces
Georg Klein David Murray
Active Vision Laboratory

Department of Engineering Science
University of Oxford
A BSTRACT
This paper presents a method of estimating camera pose in an un-
known scene. While this has previously been attempted by adapting
SLAM algorithms developed for robotic exploration, we propose a
system specifically designed to track a hand-held camera in a small
AR workspace. We propose to split tracking and mapping into two
separate tasks, processed in parallel threads on a dual-core com-
puter: one thread deals with the task of robustly tracking erratic
hand-held motion, while the other produces a 3D map of point fea-
tures from previously observed video frames. This allows the use of
computationally expensive batch optimisation techniques not usu-
ally associated with real-time operation: The result is a system that
produces detailed maps with thousands of landmarks which can be
tracked at frame-rate, with an accuracy and robustness rivalling that
of state-of-the-art model-based systems.
1 I NTRODUCTION
The majority of Augmented Reality (AR) systems operate with
prior knowledge of the users environment - i.e, some form of map. Figure 1: Typical operation of the system: Here, a desktop is
This could be a map of a city, CAD model of a component requiring tracked. The on-line generated map contains close to 3000 point
maintenance, or even a sparse map of fiducials known to be present features, of which the system attempted to find 1000 in the current
in the scene. The application then allows the user to interact with frame. The 660 successful observations are shown as dots. Also
this environment based on prior information on salient parts of this
shown is the maps dominant plane, drawn as a grid, on which vir-
model (e.g. This location is of interest or remove this nut from
tual characters can interact. This frame was tracked in 18ms.
this component). If the map or model provided is comprehensive,
registration can be performed directly from it, and this is the com-
mon approach to camera-based AR tracking.
Unfortunately, a comprehensive map is often not available, often based AR applications. One approach for providing the user with
a small map of only an object of interest is available - for exam- meaningful augmentations is to employ a remote expert [4, 16] who
ple, a single physical object in a room or a single fiducial marker. can annotate the generated map. In this paper we take a different
Tracking is then limited to the times when this known feature can be approach: we treat the generated map as a sandbox in which virtual
measured by some sensor, and this limits range and quality of reg- simulations can be created. In particular, we estimate a dominant
istration. This has led to the development of a class of techniques plane (a virtual ground plane) from the mapped points - an example
known (in the AR context) as extensible tracking [21, 14, 4, 28, 2] of this is shown in Figure 1 - and allow this to be populated with
in which the system attempts to add previously unknown scene el- virtual characters. In essence, we would like to transform any flat
ements to its initial map, and these then provide registration even (and reasonably textured) surface into a playing field for VR sim-
when the original map is out of sensing range. In [4], the initial ulations (at this stage, we have developed a simple but fast-paced
map is minimal, consisting only of a template which provides met- action game). The hand-held camera then becomes both a viewing
ric scale; later versions of this monocular SLAM algorithm can now device and a user interface component.
operate without this initialisation template. To further provide the user with the freedom to interact with the
The logical extension of extensible tracking is to track in scenes simulation, we require fast, accurate and robust camera tracking, all
without any prior map, and this is the focus of this paper. Specifi- while refining the map and expanding it if new regions are explored.
cally, we aim to track a calibrated hand-held camera in a previously This is a challenging problem, and to simplify the task somewhat
unknown scene without any known objects or initialisation target, we have imposed some constraints on the scene to be tracked: it
while building a map of this environment. Once a rudimentary map should be mostly static, i.e. not deformable, and it should be small.
has been built, it is used to insert virtual objects into the scene, and By small we mean that the user will spend most of his or her time
these should be accurately registered to real objects in the environ- in the same place: for example, at a desk, in one corner of a room,
ment. or in front of a single building. We consider this to be compatible
Since we do not use a prior map, the system has no deep under- with a large number of workspace-related AR applications, where
standing of the users environment and this precludes many task- the user is anyway often tethered to a computer. Exploratory tasks
such as running around a city are not supported.
e-mail: [email protected]
e-mail:[email protected] The next section outlines the proposed method and contrasts
this to previous methods. Subsequent sections describe in detail
the method used, present results and evaluate the methods perfor-
mance.
2 M ETHOD OVERVIEW IN THE C ONTEXT OF SLAM to maintain real-time performance), achieving exceptional accuracy
Our method can be summarised by the following points: over long distances. While we adopt the stereo initialisation, and
occasionally make use of local bundle updates, our method is dif-
Tracking and Mapping are separated, and run in two parallel ferent in that we attempt to build a long-term map in which features
threads. are constantly re-visited, and we can afford expensive full-map op-
timisations. Finally, in our hand-held camera scenario, we cannot
Mapping is based on keyframes, which are processed using rely on long 2D feature tracks being available to initialise features
batch techniques (Bundle Adjustment). and we replace this with an epipolar feature search.
The map is densely intialised from a stereo pair (5-Point Al-
gorithm) 3 F URTHER R ELATED W ORK
Efforts to improve the robustness of monocular SLAM have re-
New points are initialised with an epipolar search. cently been made by [22] and [3]. [22] replace the EKF typical
Large numbers (thousands) of points are mapped. of many SLAM problems with a particle filter which is resilient to
rapid camera motions; however, the mapping procedure does not
To put the above in perspective, it is helpful to compare this in any way consider feature-to-feature or camera-to-feature cor-
approach to the current state-of-the-art. To our knowledge, the relations. An alternative approach is taken by [3] who replace
two most convincing systems for tracking-while-mapping a sin- correlation-based search with a more robust image descriptor which
gle hand-held camera are those of Davison et al [5] and Eade and greatly reduces the probabilities of outlier measurements. This al-
Drummond [8, 7]. Both systems can be seen as adaptations of allows the system to operate with large feature search regions without
gorithms developed for SLAM in the robotics domain (respectively, compromising robustness. The system is based on the unscented
these are EKF-SLAM [26] and FastSLAM 2.0 [17]) and both are in- Kalman filter which scales poorly (O(N 3 )) with map size and hence
cremental mapping methods: tracking and mapping are intimately no more than a few dozen points can be mapped. However the re-
linked, so current camera pose and the position of every landmark placement of intensity-patch descriptors with a more robust alter-
are updated together at every single video frame. native appears to have merit.
Here, we argue that tracking a hand-held camera is more difficult Extensible tracking using batch techniques has previously been
than tracking a moving robot: firstly, a robot usually receives some attempted by [11, 28]. An external tracking system or fiducial
form of odometry; secondly a robot can be driven at arbitrarily slow markers are used in a learning stage to triangulate new feature
speeds. By contrast, this is not the case for hand-held monocu- points, which can later be used for tracking. [11] employs clas-
lar SLAM and so data-association errors become a problem, and sic bundle adjustment in the training stage and achieve respectable
can irretrievably corrupt the maps generated by incremental sys- tracking performance when later tracking the learned features, but
tems. For this reason, both monocular SLAM methods mentioned no attempt is made to extend the map after the learning stage. [28]
go to great lengths to avoid data association errors. Starting with introduces a different estimator which is claimed to be more robust
covariance-driven gating (active search), they must further per- and accurate, however this comes at a severe performance penalty,
form binary inlier/outlier rejection with Joint Compatibility Branch slowing the system to unusable levels. It is not clear if the latter
and Bound (JCBB) [19] (in the case of [5]) or Random Sample Con- system continues to grow the map after the initial training phase.
sensus (RANSAC) [10] (in the case of [7]). Despite these efforts, Most recently, [2] triangulate new patch features on-line while
neither system provides the robustness we would like for AR use. tracking a previously known CAD model. The system is most no-
This motivates a split between tracking and mapping. If these table for the evident high-quality patch tracking, which uses a high-
two processes are separated, tracking is no longer probabilisti- DOF minimisation technique across multiple scales, yielding con-
cally slaved to the map-making procedure, and any robust tracking vincingly better patch tracking results than the NCC search often
method desired can be used (here, we use a coarse-to-fine approach used in SLAM. However, it is also computationally expensive, so
with a robust estimator.) Indeed, data association between tracking the authors simplify map-building by discarding feature-feature co-
and mapping need not even be shared. Also, since modern comput- variances effectively an attempt at FastSLAM 2.0 with only a
ers now typically come with more than one processing core, we can single particle.
split tracking and mapping into two separately-scheduled threads. We notice that [15] have recently described a system which also
Freed from the computational burden of updating a map at every employs SfM techniques to map and track an unknown environ-
frame, the tracking thread can perform more thorough image pro- ment - indeed, it also employs two processors, but in a different
cessing, further increasing performance. way: the authors decouple 2D feature tracking from 3D pose esti-
Next, if mapping is not tied to tracking, it is not necessary to mation. Robustness to motion is obtained through the use of inertial
use every single video frame for mapping. Many video frames sensors and a fish-eye lens. Finally, our implementation of an AR
contain redundant information, particularly when the camera is not application which takes place on a planar playing field may invite
moving. While most incremental systems will waste their time re- a comparison with [25] in which the authors specifically choose to
filtering the same data frame after frame, we can concentrate on track and augment a planar structure: It should be noted that while
processing some smaller number of more useful keyframes. These the AR game described in this system uses a plane, the focus lies
new keyframes then need not be processed within strict real-time on the tracking and mapping strategy, which makes no fundamental
limits (although processing should be finished by the time the next assumption of planarity.
keyframe is added) and this allows operation with a larger numer-
ical map size. Finally, we can replace incremental mapping with 4 T HE M AP
a computationally expensive but highly accurate batch method, i.e.
bundle adjustment. This section describes the systems representation of the users en-
While bundle adjustment has long been a proven method for off- vironment. Section 5 will describe how this map is tracked, and
line Structure-from-Motion (SfM), we are more directly inspired Section 6 will describe how the map is built and updated.
by its recent successful applications to real-time visual odometry The map consists of a collection of M point features located in a
and tracking [20, 18, 9]. These methods build an initial map from world coordinate frame W. Each point feature represents a locally
five-point stereo [27] and then track a camera using local bundle ad- planar textured patch in the world. The jth point in the map (pj )
justment over the N most recent camera poses (where N is selected has coordinates pjW = (xjW yjW zjW 1)T in coordinate frame
W. Each point also has a unit patch normal nj and a reference to The subscript CW may be read as frame C from frame W. The
the patch source pixels. matrix ECW contains a rotation and a translation component and is
The map also contains N keyframes: These are snapshots taken a member of the Lie group SE(3), the set of 3D rigid-body transfor-
by the handheld camera at various points in time. Each keyframe mations.
has an associated camera-centred coordinate frame, denoted Ki for To project points in the camera frame into image, a calibrated
the ith keyframe. The transformation between this coordinate frame camera projection model CamProj() is used:
and the world is then EKi W . Each keyframe also stores a four-
level pyramid of greyscale 8bpp images; level zero stores the full ui
= CamProj(ECW piW ) (2)
640480 pixel camera snapshot, and this is sub-sampled down to vi
level three at 8060 pixels.
The pixels which make up each patch feature are not stored indi- We employ a pin-hole camera projection function which supports
vidually, rather each point feature has a source keyframe - typically lenses exhibiting barrel radial distortion. The radial distortion
the first keyframe in which this point was observed. Thus each map model which transforms r r is the FOV-model of [6]. The cam-
point stores a reference to a single source keyframe, a single source era parameters for focal length (fu , fv ), principal point (u0 , v0 ) and
pyramid level within this keyframe, and pixel location within this distortion () are assumed to be known:
level. In the source pyramid level, patches correspond to 88 pixel
x
0 1
squares; in the world, the size and shape of a patch depends on the x
pyramid level, distance from source keyframe camera centre, and y u0 fu 0 r
CamProj @ = + z (3)
B C
y
orientation of the patch normal. z A v0 0 fv r z
In the examples shown later the map might contain some 1
M =2000 to 6000 points and N =40 to 120 keyframes. r
x2 + y 2
5 T RACKING r= (4)
z2
This section describes the operation of the point-based tracking sys-
1
tem, with the assumption that a map of 3D points has already been r = arctan(2r tan ) (5)
created. The tracking system receives images from the hand-held 2
video camera and maintains a real-time estimate of the camera pose A fundamental requirement of the tracking (and also the map-
relative to the built map. Using this estimate, augmented graphics ping) system is the ability to differentiate Eq. 2 with respect to
can then be drawn on top of the video frame. At every frame, the changes in camera pose ECW . Changes to camera pose are rep-
system performs the following two-stage tracking procedure: resented by left-multiplication with a 44 camera motion M :
1. A new frame is acquired from the camera, and a prior pose
ECW = M ECW = exp()ECW (6)
estimate is generated from a motion model.
2. Map points are projected into the image according to the where the camera motion is also a member of SE(3) and can be
frames prior pose estimate. minimally parametrised with a six-vector using the exponential
map. Typically the first three elements of represent a translation
3. A small number (50) of the coarsest-scale features are and the latter three elements represent a rotation axis and magni-
searched for in the image. tude. This representation of camera state and motion allows for
trivial differentiation of Eq. 6, and from this, partial differentials of
4. The camera pose is updated from these coarse matches. Eq. 2 of the form u v
, are readily obtained in closed form. De-
i i
5. A larger number (1000) of points is re-projected and searched tails of the Lie group SE(3) and its representation may be found in
for in the image. [29].
6. A final pose estimate for the frame is computed from all the 5.3 Patch Search
matches found. To find a single map point p in the current frame, we perform a
fixed-range image search around the points predicted image loca-
5.1 Image acquisition tion. To perform this search, the corresponding patch must first be
Images are captured from a Unibrain Fire-i video camera equipped warped to take account of viewpoint changes between the patchs
with a 2.1mm wide-angle lens. The camera delivers 640480 pixel first observation and the current camera position. We perform an
YUV411 frames at 30Hz. These frames are converted to 8bpp affine warp characterised by a warping matrix A, where
greyscale for tracking and an RGB image for augmented display. " #
The tracking system constructs a four-level image pyramid as uc uc
us vs
described in section 4. Further, we run the FAST-10 [23] corner A= vc vc (7)
detector on each pyramid level. This is done without non-maximal us vs
suppression, resulting in a blob-like clusters of corner regions.

A prior for the frames camera pose is estimated. We use a decay- and {us , vs } correspond to horizontal and vertical pixel displace-
ing velocity model; this is similar to a simple alpha-beta constant ments in the patchs source pyramid level, and {uc , vc } correspond
velocity model, but lacking any new measurements, the estimated to pixel displacements in the current camera frames zeroth (full-
camera slows and eventually stops. size) pyramid level. This matrix is found by back-projecting unit
pixel displacements in the source keyframe pyramid level onto the
5.2 Camera pose and projection patchs plane, and then projecting these into the current (target)
To project map points into the image plane, they are first trans- frame. Performing these projections ensures that the warping ma-
formed from the world coordinate frame to the camera-centred co- trix compensates (to first order) not only changes in perspective and
ordinate frame C. This is done by left-multiplication with a 44 scale but also the variations in lens distortion across the image.
matrix denoted ECW , which represents camera pose. The determinant of matrix A is used to decide at which pyramid
level of the current frame the patch should be searched. The de-
pjC = ECW pjW (1) terminant of A corresponds to the area, in square pixels, a single
source pixel would occupy in the full-resolution image; det(A)/4 is not allowed to send new keyframes to the map. Such frames
is the corresponding area in pyramid level one, and so on. The tar- would likely be of poor quality, i.e. compromised by motion blur,
get pyramid level l is chosen so that det(A)/4l is closest to unity, occlusion, or an incorrect position estimate.
i.e. we attempt to find the patch in the pyramid level which most If the fraction falls below an even lower threshold for more than
closely matches its scale. a few frames (during which the motion model might successfully
An 88-pixel patch search template is generated from the source bridge untrackable frames) then tracking is considered lost, and a
level using the warp A/2l and bilinear interpolation. The mean tracking recovery procedure is initiated. We implement the recov-
pixel intensity is subtracted from individual pixel values to provide ery method of [30]. After this method has produced a pose estimate,
some resilience to lighting changes. Next, the best match for this tracking proceeds as normal.
template within a fixed radius around its predicted position is found
in the target pyramid level. This is done by evaluating zero-mean 6 M APPING
SSD scores at all FAST corner locations within the circular search This section describes the process by which the 3D point map is
region and selecting the location with the smallest difference score. built. Map-building occurs in two distinct stages: First, an initial
If this is beneath a preset threshold, the patch is considered found. map is built using a stereo technique. After this, the map is continu-
In some cases, particularly at high pyramid levels, an integer ally refined and expanded by the mapping thread as new keyframes
pixel location is not sufficiently accurate to produce smooth track- are added by the tracking systems. The operation of the mapping
ing results. The located patch position can be refined by performing thread is illustrated in Figure 2. The map-making steps are now
an iterative error minimisation. We use the inverse compositional individually described in detail.
approach of [1], minimising over translation and mean patch inten-
sity difference. However, this is too computationally expensive to
perform for every patch tracked.
5.4 Pose update
Given a set S of successful patch observations, a camera pose up-
date can be computed. Each observation yields a found patch po-
sition (u v)T (referred to level zero pixel units) and is assumed to
have measurement noise of 2 = 22l times the 22 identity matrix
(again in level zero pixel units). The pose update is computed itera-
tively by minimising a robust objective function of the reprojection
error:
X |ej |
= argmin Obj , T (8)

jS
j
where ej is the reprojection error vector:

uj
ej = CamProj (exp () ECW pj ) . (9)
vj
Obj(, T ) is the Tukey biweight objective function [13] and T a

robust (median-based) estimate of the distributions standard de-
viation derived from all the residuals. We use ten iterations of
reweighted least squares to allow the M-estimator to converge from
any one set of measurements.
5.5 Two-stage coarse-to-fine tracking
To increase the tracking systems resilience to rapid camera accel-
Figure 2: The asynchronous mapping thread. After initialisation,
erations, patch search and pose update are done twice. An initial
this thread runs in an endless loop, occasionally receiving new
coarse search searches only for 50 map points which appear at the
frames from the tracker.
highest levels of the current frames image pyramid, and this search
is performed (with subpixel refinement) over a large search radius.
A new pose is then calculated from these measurements. After this,
up to 1000 of the remaining potentially visible image patches are 6.1 Map initialisation
re-projected into the image, and now the patch search is performed When the system is first started, we employ the five-point stereo
over a far tighter search region. Subpixel refinement is performed algorithm of [27] to initialise the map in a manner similar to
only on a high-level subset of patches. The final frame pose is cal- [20, 18, 9]. User cooperation is required: the user first places the
culated from both coarse and fine sets of image measurements to- camera above the workspace to be tracked and presses a key. At this
gether. stage, they systems first keyframe is stored in the map, and 1000
2D patch-tracks are initialised in the lowest pyramid level at salient
5.6 Tracking quality and failure recovery image locations (maximal FAST corners.) The user then smoothly
Despite efforts to make tracking as robust as possible, eventual translates (and possibly rotates) the camera to a slightly offset posi-
tracking failure can be considered inevitable. For this reason, the tion makes a further key-press. The 2D patches are tracked through
tracking system estimates the quality of tracking at every frame, the smooth motion, and the second key-press thus provides a second
using the fraction of feature observations which have been success- keyframe and feature correspondences from which the five-point al-
ful. gorithm and RANSAC can estimate an essential matrix and trian-
If this fraction falls below a certain threshold, tracking quality gulate the base map. The resulting map is refined through bundle
is considered poor. Tracking continues as normal, but the system adjustment.
This initial map has an arbitrary scale and is aligned with one minimise the robust objective function:
camera at the origin. To enable augmentations in a meaningful
place and scale, the map is first scaled to metric units. This is done N X
X |eji |
{2 ..N }, {p1 ..pM } = argmin

by assuming that the camera translated 10cm between the stereo Obj , T
{{},{p}} i=1 jS ji
pair. Next, the map is rotated and translated so that the dominant i
detected plane lies at z=0 in the world. This is done by RANSAC: (10)
many sets of three points are randomly selected to hypothesise a Apart from the inclusion of the Tukey M-estimator, we use an al-
plane while the remaining points are tested for consensus. The most textbook implementation of Levenberg-Marquardt bundle ad-
winning hypothesis is refined by evaluating the spatial mean and justment (as described in Appendix 6.6 of [12]).
variance of the consensus set, and the smallest eigenvector of the Full bundle adjustment as described above adjusts the pose for
covariance matrix forms the detected plane normal. all keyframes (apart from the first, which is a fixed datum) and
Including user interaction, map initialisation takes around three all map point positions. It exploits the sparseness inherent in the
seconds. structure-from-motion problem to reduce the complexity of cubic-
cost matrix factorisations from O((N + M )3 ) to O(N 3 ), and so the
6.2 Keyframe insertion and epipolar search system ultimately scales with the cube of keyframes; in practice,
for the map sizes used here, computation is in most cases domi-
The map initially contains only two keyframes and describes a rel- nated by O(N 2 M ) camera-point-camera outer products. One way
atively small volume of space. As the camera moves away from or the other, it becomes an increasingly expensive computation as
its initial pose, new keyframes and map features are added to the map size increases: For example, tens of seconds are required for a
system, to allow the map to grow. map with more than 150 keyframes to converge. This is acceptable
Keyframes are added whenever the following conditions are met: if the camera is not exploring (i.e. the tracking system can work
Tracking quality must be good; time since the last keyframe was with the existing map) but becomes quickly limiting during explo-
added must exceed twenty frames; and the camera must be a min- ration, when many new keyframes and map features are initialised
imum distance away from the nearest keypoint already in the map. (and should be bundle adjusted) in quick succession.
The minimum distance requirement avoids the common monocular For this reason we also allow the mapping thread to perform lo-
SLAM problem of a stationary camera corrupting the map, and en- cal bundle adjustment; here only a subset of keyframe poses are
sures a stereo baseline for new feature triangulation. The minimum adjusted. Writing the set of keyframes to adjust as X, a further set
distance used depends on the mean depth of observed features, so of fixed keyframes Y and subset of map points Z, the minimisation
that keyframes are spaced closer together when the camera is very (abbreviating the objective function) becomes
near a surface, and further apart when observing distant walls.
Each keyframe initially assumes the tracking systems camera
X X
{xX }, {pzZ } = argmin

Obj(i, j).
pose estimate, and all feature measurements made by the tracking {{},{p}} iXY jZS
i
system. Owing to real-time constraints, the tracking system may (11)
only have measured a subset of the potentially visible features in This is similar to the operation of constantly-exploring visual
the frame; the mapping thread therefore re-projects and measures odometry implementations [18] which optimise over the last (say)
the remaining map features, and adds successful observations to its 3 frames using measurements from the last 7 before that. How-
list of measurements. ever there is an important difference in the selection of parameters
The tracking system has already calculated a set of FAST cor- which are optimised, and the selection of measurements used for
ners for each pyramid level of the keyframe. Non-maximal sup- constraints.
pression and thresholding based on Shi-Tomasi [24] score are now The set X of keyframes to optimise consists of five keyframes:
used to narrow this set to the most salient points in each pyramid the newest keyframe and the four other keyframes nearest to it in
level. Next, salient points near successful observations of existing the map. All of the map points visible in any of these keyframes
features are discarded. Each remaining salient point is a candidate forms the set Z. Finally, Y contains any keyframe for which a
to be a new map point. measurement of any point in Z has been made. That is, local bundle
New map points require depth information. This is not available adjustment optimises the pose of the most recent keyframe and its
from a single keyframe, and triangulation with another view is re- closest neighbours, and all of the map points seen by these, using
quired. We select the closest (in terms of camera position) keyframe all of the measurements ever made of these points.
already existing in the map as the second view. Correspondences The complexity of local bundle adjustment still scales with map
between the two views are established using epipolar search: pixel size, but does so at approximately O(N M ) in the worst case, and
patches around corners points which lie along the epipolar line in this allows a reasonable rate of exploration. Should a keyframe be
the second view are compared to the candidate map points using added to the map while bundle adjustment is in progress, adjust-
zero-mean SSD. No template warping is performed, and matches ment is interrupted so that the new keyframe can be integrated in
are searched in equal pyramid levels only. Further, we do not search the shortest possible time.
an infinite epipolar line, but use a prior hypothesis on the likely
depth of new candidate points (which depends on the depth distri- 6.4 Data association refinement
bution of existing points in the new keyframe). If a match has been
found, the new map point is triangulated and inserted into the map. When bundle adjustment has converged and no new keyframes are
needed - i.e. when the camera is in well-explored portions of the
6.3 Bundle adjustment map - the mapping thread has free time which can be used to im-
prove the map. This is primarily done by making new measure-
Associated with the the ith keyframe in the map is a set Si of ments in old key-frames; either to measure newly created map fea-
image measurements. For example, the jth map point measured tures in older keyframes, or to re-measure outlier measurements.
in keyframe i would have been found at (uji vji )T with stan- When a new feature is added by epipolar search, measurements
dard deviation of ji pixels. Writing the current state of the map for it initially exist only in two keyframes. However it is possible
as {EK1 W , ...EKN W } and {p1 , ...pM }, each image measurement that this feature is visible in other keyframes as well. If this is the
also has an associated reprojection error eji calculated as for equa- case then measurements are made and if they are successful added
tion (9). Bundle adjustment iteratively adjusts the map so as to to the map.
Likewise, measurements made by the tracking system may be are two apparent exceptions: tracking is lost around frame 1320,
incorrect. This frequently happens in regions of the world contain- and the system attempts to relocalise for several frames, which
ing repeated patterns. Such measurements are given low weights by takes up to 90ms per frame. Also, around frame 1530, tracking
the M-estimator used in bundle adjustment. If they lie in the zero- takes around 30ms per frame during normal operation; this is when
weight region of the Tukey estimator, they are flagged as outliers. the camera moves far away from the desk at the end of the sequence,
Each outlier measurement is given a second chance before dele- and a very large number of features appear in the frame. Here, the
tion: it is re-measured in the keyframe using the features predicted time taken to decide which features to measure becomes significant.
position and a far tighter search region than used for tracking. If a
new measurement is found, this is re-inserted into the map. Should Keyframe preparation 2.2 ms
such a measurement still be considered an outlier, it is permanently Feature projection 3.5 ms
removed from the map. Patch search 9.8 ms
These data association refinements are given a low priority in the Iterative pose update 3.7 ms
mapping thread, and are only performed if there is no other more Total 19.2 ms
pressing task at hand. Like bundle adjustment, they are interrupted
as soon as a new keyframe arrives. Table 1: Tracking timings for a map of size M =4000.
6.5 General implementation notes
The system described above was implemented on a desktop PC with Table 1 shows a break-down of the time required to track a typi-
an Intel Core 2 Duo 2.66 GHz processor running Linux. Software cal frame. Keyframe preparation includes frame capture, YUV411
was written in C++ using the libCVD and TooN libraries. It has to greyscale conversion, building the image pyramid and detect-
not been highly optimised, although we have found it beneficial ing FAST corners. Feature projection is the time taken to initially
to implement two tweaks: A row look-up table is used to speed project all features in to the frame, decide which are visible, and de-
up access to the array of FAST corners at each pyramid level, and cide which features to measure. This step scales linearly with map
the tracker only re-calculates full nonlinear point projections and size, but is also influenced by the number of features currently in
jacobians every fourth M-estimator iteration (this is still multiple the cameras field-of-view, as the determinant of the warping ma-
times per single frame.) trix is calculated for these points. The bulk of a frames budget is
Some aspects of the current implementation of the mapping sys- spent on the 1000 image searches for patch correspondences, and
tem are rather low-tech: we use a simple set of heuristics to remove the time spent for this is influenced by corner density in the image,
outliers from the map; further, patches are initialised with a nor- and the distribution of features over pyramid levels.
mal vector parallel to the imaging plane of the first frame they were 7.2 Mapping scalability
observed in, and the normal vector is currently not optimised.
Like any method attempting to increase a tracking systems ro- While the tracking system scales fairly well with increasing map
bustness to rapid motion, the two-stage approach described section size, this is not the case for the mapping thread. The largest map
5.5 can lead to increased tracking jitter. We mitigate this with the we have produced is a full 360 map of a single office containing
simple method of turning off the coarse tracking stage when the 11000 map points and 280 keyframes. This is beyond our small
motion model believes the camera to be nearly stationary. This can workspace design goal and at this map size the systems ability to
be observed in the results video by a colour change in the reference add new keyframes and map points is impaired (but tracking still
grid. runs at frame-rate). A more practical limit at which the system
remains well usable is around 6000 points and 150 keyframes.
Timings of individual mapping steps are difficult to obtain, they
7 R ESULTS vary wildly not only with map size but also scene structure (both
Evaluation was mostly performed during live operation using a global and local); further, the asynchronous nature of our method
hand-held camera, however we also include comparative results us- does not facilitate obtaining repeatable results from disk sequences.
ing a synthetic sequence read from disk. All results were obtained Nevertheless, typical timings for bundle adjustment are presented
with identical tunable parameters. in Table 2.
7.1 Tracking performance on live video Keyframes 2-49 50-99 100-149

An example of the systems operation is provided in the accom- Local Bundle Adjustment 170ms 270ms 440ms
panying video file1 . The camera explores a cluttered desk and its Global Bundle Adjustment 380ms 1.7s 6.9s
immediate surroundings over 1656 frames of live video input. The
camera performs various panning motions to produce an overview Table 2: Bundle adjustment timings with various map sizes.
of the scene, and then zooms closer to some areas to increase de-
tail in the map. The camera then moves rapidly around the mapped
The above timings are mean quantities. As the map grows be-
scene. Tracking is purposefully broken by shaking the camera, and
yond 100 keyframes, global bundle adjustment cannot keep up with
the system recovers from this. This video represents the size of a
exploration and is almost always aborted, converging only when the
typical working volume which the system can handle without great
camera remains stationary (or returns to a well-mapped area) for
difficulty. Figure 3 illustrates the map generated during tracking.
some time. Global convergence for maps larger than 150 keyframes
At the end of the sequence the map consists of 57 keyframes and
can require tens of seconds.
4997 point features: from finest level to coarsest level, the feature
Compared with bundle adjustment, the processing time required
distributions are 51%, 33%, 9% and 7% respectively.
for epipolar search and occasional data association refinement is
The sequence can mostly be tracked in real-time. Figure 4 shows
small. Typically all other operations needed to insert a keyframe
the evolution of tracking time with frame number. Also plotted is
require less than 40ms.
the size of the map. For most of the sequence, tracking can be per-
formed in around 20ms despite the map increasing in size. There 7.3 Synthetic comparison with EKF-SLAM
1 This
video file can also be obtained from http://www.robots. To evaluate the systems accuracy, we compare it to an implementa-
ox.ac.uk/gk/videos/klein 07 ptam ismar.avi tion [30] of EKF-SLAM based on Davisons SceneLib library with
Figure 3: The map and keyframes produced in the desk video. Top: two views of the map with point features and keyframes drawn. Certain
parts of the scene are clearly distinguishable, e.g. the keyboard and the frisbee. Bottom: the 57 keyframes used to generate the map.
100 6000
Map size and procesing time for the desk video
5000
80
Tracking time (ms)
4000
60
Map size
3000
40
2000
20
1000
0
0 200 400 600 800 1000 1200 1400 1600
Frame number
Figure 4: Map size (right axis) and tracking timings (left axis) for the desk video included in the video attachment. The timing spike occurs
when tracking is lost and is attempting relocalisation.
3D Trajectories from synthetic data (all axes metric)
3.05
z
3
Ground truth
2.95 EKFSLAM
Proposed method 0
0 2
5 4
y 6 x
10 8
Figure 5: Comparison with EKF-SLAM on a synthetic sequence. The left image shows the map produced by the system described here,
the centre image shows the map produced by an up-to-date implementation of EKF-SLAM [30]. Trajectories compared to ground truth are
shown on the right. NB. the different scale of the z-axis, as ground truth lies on z=3.
up-to-date enhancements such as JCBB [19]. We use a synthetic tolerance to rapid motions and associated motion blur. Further, it
scene produced in a 3D rendering package. The scene consists of allows mapped points to be useful across a wide range of distances.
two textured walls at right angles, plus the initialisation target for In practice, this means that our system allows a user to zoom in
the SLAM system. The camera moves sideways along one wall to- much closer (and more rapidly) to objects in the environment. This
ward the corner, then along the next wall, for a total of 600 frames is illustrated in Figure 6 and also in the accompanying video file. At
at 600480 resolution. the same time, the use of a larger number of features reduces visible
The synthetic scenario tested here (continual exploration with tracking jitter and improves performance when some features are
no re-visiting of old features, pauses or slam wiggles) is neither occluded or otherwise corrupted.
systems strong point, and not typical usage in an AR context. Nev- The system scales with map size in a different way. In EKF-
ertheless it effectively demonstrates some of the differences in the SLAM, the frame-rate will start to drop; in our system, the frame-
systems behaviours. Figure 5 illustrates the maps output from the rate is not as affected, but the rate at which new parts of the envi-
two systems. The method proposed here produces a relatively dense ronment can be explored slows down.
map of 6600 features, of which several are clearly outliers. By con-
trast, EKF-SLAM produces a sparse map of 114 features with fully 7.5 AR with a hand-held camera
accessible covariance information (our system also implicitly en-
codes the full covariance, but it is not trivial to access) of which all To investigate the suitability of the proposed system for AR tasks,
appear to be inliers. we have developed two simple table-top applications. Both assume
To compare the calculated trajectories, these are first aligned so a flat operating surface, and use the hand-held camera as a tool for
as to minimise their sum-squared error to ground truth. This is interaction. AR applications are usable as soon as the map has been
necessary because our system uses an (almost) arbitrary coordinate initialised from stereo; mapping proceeds in the background in a
frame and scale. Both trajectories are aligned by minimising error manner transparent to the user, unless particularly rapid exploration
over a 6-DOF rigid body transformation and 1-DOF scale change. causes tracking failure.
The resulting trajectories are shown in the right panel of figure 5. The first application is Ewok Rampage, which gives the player
For both trajectories, the error is predominantly in the z-direction control over Darth Vader, who is assaulted by a horde of ewoks. The
(whose scale is exaggerated in the plot) although EKF-SLAM also player can control Darth Vaders movements using the keyboard,
fractionally underestimates the angle between the walls. Numeri- while a laser pistol can be aimed with the camera: The projection of
cally, the standard deviation from ground truth is 135mm for EKF- the cameras optical axis onto the playing surface forms the players
SLAM and 6mm for our system (the camera travels 18.2m through cross-hairs. This game demonstrates the systems ability to cope
the virtual sequence). Frames are tracked in a relatively constant with fast camera translations as the user rapidly changes aim.
20ms by our system, whereas EKF-SLAM scales quadratically The second application simulates the effects of a virtual magni-
from 3ms when the map is empty to 40ms at the end of the sequence fying glass and sun. A virtual convex lens is placed at the camera
(although of course this includes mapping as well as tracking.) centre and simple ray-tracing used to render the caustics onto the
playing surface. When the light converges onto a small enough dot
7.4 Subjective comparison with EKF-SLAM i.e., when user has the camera at the correct height and angle
virtual burn-marks (along with smoke) are added to the surface. In
When used on live video with a hand-held camera, our system han- this way the user can annotate the environment using just the cam-
dles quite differently than iterative SLAM implementations, and era. This game demonstrates tracking accuracy.
this affects the way in which an operator will use the system to These applications are illustrated in Figure 7 and are also demon-
achieve effective mapping and tracking. strated in the accompanying video file.
This system does not require the SLAM wiggle: incremental
systems often need continual smooth camera motion to effectively
initialise new features at their correct depth. If the camera is sta- 8 L IMITATIONS AND F UTURE W ORK
tionary, tracking jitter can initialise features at the wrong depth. By This section describes some of the known issues with the system
contrast, our system works best if the camera is focused on a point presented. This system requires fairly powerful computing hard-
of interest, the user then pauses briefly, and then proceeds (not nec- ware and this has so far limited live experiments to a single office;
essarily smoothly) to the next point of interest, or a different view we expect that with some optimisations we will be able to run at
of the same point. frame-rate on mobile platforms and perform experiments in a wider
The use of multiple pyramid levels greatly increases the systems range of environments. Despite current experimental limitations
initially extracting the dominant plane as an AR surface, the map
becomes purely a tool for camera tracking. This is not ideal: virtual
entities should be able to interact with features in the map in some
way. For example, out-of-plane real objects should block and oc-
clude virtual characters running into or behind them. This is a very
complex and important area for future research.
Several aspects of mapping could be improved to aid tracking
performance: the system currently has no notion of self-occlusion
by the map. While the tracking system is robust enough to track
a map despite self-occlusion, the unexplained absence of features
it expects to be able to measure impacts tracking quality estimates,
and may unnecessarily remove features as outliers. Further, an effi-
cient on-line estimation of patch normals would likely be of benefit
(our initial attempts at this have been too slow.)
Finally, the system is not designed to close large loops in the
SLAM sense. While the mapping module is statistically able to
handle loop closure (and loops can indeed be closed by judicious
placement of the camera near the boundary), the problem lies in the
fact that the trackers M-Estimator is not informed of feature-map
Figure 6: The system can easily track across multiple scales. Here,
uncertainties. In practical AR use, this is not an issue.
the map is initialised at the top-right scale; the user moves closer in
and places a label, which is still accurately registered when viewed
from far away. 9 C ONCLUSION
This paper has presented an alternative to the SLAM approaches
previously employed to track and map unknown environments.
some failure modes and some avenues for further work have Rather then being limited by the frame-to-frame scalability of in-
become evident. cremental mapping approaches which mandate a sparse map of
high quality features [5], we implement the alternative approach,
8.1 Failure modes using a far denser map of lower-quality features.
There are various ways in which tracking can fail. Some of these Results show that on modern hardware, the system is capable of
are due to the systems dependence on corner features: rapid cam- providing tracking quality adequate for small-workspace AR appli-
era motions produce large levels of motion blur which can decimate cations - provided the scene tracked is reasonably textured, rela-
most corners features in the image, and this will cause tracking fail- tively static, and not substantially self-occluding. No prior model
ure. In general, tracking can only proceed when the FAST corner of the scene is required, and the system imposes only a minimal ini-
detector fires, and this limits the types of textures and environments tialisation burden on the user (the procedure takes three seconds.)
supported. Future work might aim to include other types of features We believe the level of tracking robustness and accuracy we achieve
for example, image intensity edges are not as affected by motion significantly advances the state-of-the art.
blur, and often conveniently delineate geometric entities in the map. Nevertheless, performance is not yet good enough for any un-
The system is somewhat robust to repeated structure and light- trained user to simply pick up and use in an arbitrary environment.
ing changes (as illustrated by figures showing a keyboard and CD- Future work will attempt to address some of the shortcomings of
ROM disc being tracked) but such this is purely the happy result of the system and expand its potential applications.
the system using many features with a robust estimator. Repeated
structure in particular still produces large numbers of outliers in the ACKNOWLEDGEMENTS
map (due to the epipolar search making incorrect correspondences)
and can make the whole system fragile: if tracking falls into a local This work was supported by EPSRC grant GR/S97774/01.
minimum and a keyframe is then inserted, the whole map could be
corrupted. R EFERENCES
We experience three types of mapping failure: the first is a fail- [1] S. Baker and I. Matthews. Equivalence and efficiency of image align-
ure of the initial stereo algorithm. This is merely a nuisance, as it ment algorithms. In Proc. IEEE Intl. Conference on Computer Vision
is immediately noticed by the user, who then just repeats the pro- and Pattern Recognition (CVPR01), Hawaii, Dec 2001.
cedure; nevertheless it is an obstacle to a fully automatic initialisa- [2] G. Bleser, H. Wuest, and D. Stricker. Online camera pose estima-
tion of the whole system. The second is the insertion of incorrect tion in partially known and dynamic scenes. In Proc. 5th IEEE and
information into the map. This happens if tracking has failed (or ACM International Symposium on Mixed and Augmented Reality (IS-
reached an incorrect local minimum as described above) but the MAR06), San Diego, CA, October 2006.
tracking quality heuristics have not detected this failure. A more [3] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, and A. Calway. Real-
robust tracking quality assessment might prevent such failures; al- time and robust monocular SLAM using predictive multi-resolution
ternatively, a method of automatically removing outlier keyframes descriptors. In 2nd International Symposium on Visual Computing,
from the map might be viable. Finally, while the system is very November 2006.
tolerant of temporary partial occlusions, it will fail if the real-world [4] A. Davison, W. Mayol, and D. Murray. Real-time localisation and
scene is substantially and permanently changed. mapping with wearable active vision. In Proc. 2nd IEEE and ACM In-
ternational Symposium on Mixed and Augmented Reality (ISMAR03),
8.2 Mapping inadequacies Tokyo, October 2003.
[5] A. Davison, I. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Real-
Currently, the systems map consists only of a point cloud. While time single camera SLAM. to appear in IEEE Trans. Pattern Analysis
the statistics of feature points are linked through common obser- and Machine Intelligence, 2007.
vations in the bundle adjustment, the system currently makes little [6] F. Devernay and O. D. Faugeras. Straight lines have to be straight.
effort to extract any geometric understanding from the map: after Machine Vision and Applications, 13(1):1424, 2001.
Figure 7: Sample AR applications using tracking as a user interface. Left: Darth Vaders laser gun is aimed by the cameras optical axis to
defend against a rabid ewok horde. Right: The user employs a virtual camera-centred magnifying glass and the heat of a virtual sun to burn a
tasteful image onto a CD-R. These applications are illustrated in the accompanying video.
[7] E. Eade and T. Drummond. Edge landmarks in monocular slam. [21] J. Park, S. You, and U. Neumann. Natural feature tracking for ex-
In Proc. British Machine Vision Conference (BMVC06), Edinburgh, tendible robust augmented realities. In Proc. Int. Workshop on Aug-
September 2006. BMVA. mented Reality, 1998.
[8] E. Eade and T. Drummond. Scalable monocular slam. In Proc. IEEE [22] M. Pupilli and A. Calway. Real-time camera tracking using a particle
Intl. Conference on Computer Vision and Pattern Recognition (CVPR filter. In Proc. British Machine Vision Conference (BMVC05), pages
06), pages 469476, New York, NY, 2006. 519528, Oxford, September 2005. BMVA.
[9] C. Engels, H. Stewenius, and D. Nister. Bundle adjustment rules. In [23] E. Rosten and T. Drummond. Machine learning for high-speed cor-
Photogrammetric Computer Vision (PCV06), 2006. ner detection. In Proc. 9th European Conference on Computer Vision
[10] M. Fischler and R. Bolles. Random sample consensus: A paradigm (ECCV06), Graz, May 2006.
for model fitting with applications to image analysis and automated [24] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Intl.
cartography. Communcations of the ACM, 24(6):381395, June 1981. Conference on Computer Vision and Pattern Recognition (CVPR 94),
[11] Y. Genc, S. Riedel, F. Souvannavong, C. Akinlar, and N. Navab. pages 593600. IEEE Computer Society, 1994.
Marker-less tracking for AR: A learning-based approach. In Proc. [25] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless tracking us-
IEEE and ACM International Symposium on Mixed and Augmented ing planar structures in the scene. In Proc. IEEE and ACM Interna-
Reality (ISMAR02), Darmstadt, Germany, September 2002. tional Symposium on Augmented Reality (ISAR00), Munich, October
[12] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer 2000.
Vision. Cambridge University Press, second edition, 2004. [26] R. C. Smith and P. Cheeseman. On the representation and estimation
[13] P. Huber. Robust Statistics. Wiley, 1981. of spatial uncertainty. International Journal of Robotics Research,
[14] B. Jiang and U. Neumann. Extendible tracking by line auto- 5(4):5668, 1986.
calibration. In Proc. IEEE and ACM International Symposium on Aug- [27] H. Stewenius, C. Engels, and D. Nister. Recent developments on direct
mented Reality (ISAR01), pages 97103, New York, October 2001. relative orientation. ISPRS Journal of Photogrammetry and Remote
[15] R. Koch, K. Koeser, B. Streckel, and J.-F. Evers-Senne. Markerless Sensing, 60:284294, June 2006.
image-based 3d tracking for real-time augmented reality applications. [28] R. Subbarao, P. Meer, and Y. Genc. A balanced approach to 3d track-
In WIAMIS, Montreux, 2005. ing from image streams. In Proc. 4th IEEE and ACM International
[16] T. Kurata, N. Sakata, M. Kourogi, H. Kuzuoka, and M. Billinghurst. Symposium on Mixed and Augmented Reality (ISMAR05), pages 70
Remote collaboration using a shoulder-worn active camera/laser. In 78, Vienna, October 2005.
8th International Symposium on Wearable Computers (ISWC04), [29] V. Varadarajan. Lie Groups, Lie Algebras and Their Representa-
pages 6269, Arlington, VA, USA, 2004. tions. Number 102 in Graduate Texts in Mathematics. Springer-
[17] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: Verlag, 1974.
An improved particle filtering algorithm for simultaneous localization [30] B. Williams, G. Klein, and I. Reid. Real-time SLAM relocalisa-
and mapping that provably converges. In Proc. International Joint tion. In Proc. 11th IEEE International Conference on Computer Vision
Conference on Artificial Intelligence, pages 11511156, 2003. (ICCV07), Rio de Janeiro, October 2007.
[18] E. Mouragnon, F. Dekeyser, P. Sayd, M. Lhuillier, and M. Dhome.
Real time localization and 3d reconstruction. In Proc. IEEE Intl.
Conference on Computer Vision and Pattern Recognition (CVPR 06),
pages 363370, New York, NY, 2006.
[19] J. Neira and J. Tardos. Data association in stochastic mapping using
the joint compatibility test. In IEEE Trans. on Robotics and Automa-
tion, 2001.
[20] D. Nister, O. Naroditsky, and J. R. Bergen. Visual odometry. In
Proc. IEEE Intl. Conference on Computer Vision and Pattern Recogni-
tion (CVPR 04), pages 652659, Washington, D.C., June 2005. IEEE
Computer Society.

KleinMurray2007ISMAR PDF

Uploaded by

Copyright:

Available Formats

KleinMurray2007ISMAR PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KleinMurray2007ISMAR PDF

Uploaded by

Copyright:

Available Formats

Parallel Tracking and Mapping for Small AR Workspaces

Georg Klein David Murray

Active Vision Laboratory

suppression, resulting in a blob-like clusters of corner regions.

where ej is the reprojection error vector:

Obj(, T ) is the Tukey biweight objective function [13] and T a

7.1 Tracking performance on live video Keyframes 2-49 50-99 100-149

You might also like