Visual Localisation of A Robot With An External RGBD Sensor
Visual Localisation of A Robot With An External RGBD Sensor
0.675
0
+0.675
_
_
_
_
_
_
where b is the index of the intensity bin and i is the index
of the sampled pixel
The quantisation bins are selected to distribute the samples
into each range approximately evenly and assume a normal
distribution of intensities among the 64 sampled pixels in the
patch for which the mean intensity and standard deviation
are computed.
To optimise the build-time for the descriptors on the mo-
bile phone, we applied ARM Neon SIMD instructions to the
feature patch selection and quantisation process. The 64-bit
quantised bins were computed by data parallelism, treating
each row of the feature patch intensities as a vector of 8 wide
8 bit elements.
The descriptors for each corner in the Kinect frame are
computed for a 3x3 patch of centre locations and aggregated
by being ORed together to create a descriptor that will toler-
ate small deviations in translation, rotation, scale or shear.
K(x, y) =
i, j{1,0,1}
k(x +i, y + j) (2)
This ORing process creates a descriptor which is the union
of all nine appearances of the simple descriptor.
The mobile cameras feature locations and descriptors are
sent to the Kinect-CPU server via Wireless LAN where they
are matched to the Kinects current database of features.
Sparse features are sent over the wireless link rather than full
mobile phone camera images to reduce the amount of trans-
mitted data. For each feature, a 32 byte descriptor and 8 bytes
of coordinates are sent. There are typically 800 features in
an image from our smartphone camera, which gives approx-
imately 32KB of data to be sent to the Kinect-CPU server.
Sending data over the Wi link causes the largest latency in
the system.
3.4 Feature Matching
A computationally efcient comparison between a test de-
scriptor and a descriptor in the database can be performed by
counting bit differences. The error function, E, between a test
descriptor from the mobile camera, c, and database descriptor
from the Kinect image, k, is expressed as:
E = Error(K, c) = bitcount(c K) (3)
This is the number of bits set in c and not set in K and thus
counts how many pixels in c have an intensity that was never
observed in any of the nine k descriptors that were ORed to-
gether to form K. Whilst the ORing operation reduces the
error function between correct matches where the viewpoint
differs slightly, some descriptors are rather unstable in ap-
pearance and this results in them acquiring a large number of
bits set. This results in those descriptors acquiring too many
correspondences at matching time. In order to prevent this, a
threshold on the range of appearance is used to exclude un-
stable descriptors from the matching stage.
accept K if (bitcount(K) < threshold) (4)
In order to perform rapid matching, a fast tree-based index
of the Kinect HIPS descriptors is formed using the or-tree
method described in [Taylor and Drummond, 2009]. The tree
is built once per frame allowing the system to deal with dy-
namic scenes. Building the tree only takes approximately 20
ms to build on our system.
Once the tree index has been built, it can be used to rapidly
nd matches for the mobile camera corners. On our system,
this typically takes 10 ms to nd matches for 2000 corners.
In order to tune the system for speed, only the match that min-
imises the error function for each mobile camera corner (pro-
vided it is below an error threshold) is retained, rather than
admitting all matches below the threshold. This results in a
set of correspondences between corners in the mobile camera
image and the kinect image as shown in gure 3 below.
Figure 3: All matches between the Kinect RGB image (left)
and the mobile camera RGB image (right) produced by HIPS
3.5 Pose Estimation with Lie Algebra
The pose of the mobile camera relative to the coordinate
frame of the Kinect can be expressed as a rigid body trans-
formation in SE(3):
T
ck
=
_
R t
000 1
_
where R SO(3) and t R
3
(5)
The subscript ck indicates the camera coordinate frame rel-
ative to the kinect frame.
A point at a known position in the kinect frame, p
k
=
(x, y, z, 1)
T
, can be expressed in the camera frame by applying
the rigid transformation to obtain p
c
= T
ck
p
k
.
p
k
is projectively equivalent to (
x
z
,
y
z
, 1,
1
z
)
T
= (u, v, 1, q)
T
and it is more convenient to solve for pose using this formu-
lation as (u, v) are a linear function of pixel position; Our
cameras exhibit almost no radial distortion and we assume a
linear camera model for both cameras, and the value reported
in the depth image from the Kinect is linearly related to q
k
.
This gives:
_
_
_
_
u
c
v
c
1
q
c
_
_
_
_
T
ck
_
_
_
_
u
k
v
k
1
q
k
_
_
_
_
=
_
_
_
_
R
_
_
u
k
v
k
1
_
_
+q
k
t
q
k
_
_
_
_
(6)
q
c
is unobserved, so only two constraints are provided per
correspondence and thus three correspondences are needed to
compute T
ck
. When this has been computed, the same equa-
tion can be used to calculate q
c
, thus making it possible to
transform the Kinect depth map into the mobile camera frame
should that be required (e.g. to build an obstacle map from the
robots perspective).
Pose is solved iteratively from an initial assumption T
0
ck
=
I, ie that the mobile and kinect coordinate frames are coinci-
dent. At each iteration, the error in T
ck
is parameterised using
the exponential map and at each iteration T
ck
is updated by:
T
j+1
ck
= exp
_
i
G
i
_
T
j
ck
(7)
where G
i
are the generators of SE(3). The six coefcients
i
are then estimated by considering the projection of the kinect
coordinates into the mobile camera frame.
_
_
_
_
u
c
v
c
1
q
c
_
_
_
_
exp
_
i
G
i
_
T
j
ck
_
_
_
_
u
k
v
k
1
q
k
_
_
_
_
, (8)
Writing
_
_
_
_
u
k
v
k
1
q
k
_
_
_
_
T
j
ck
_
_
_
_
u
k
v
k
1
q
k
_
_
_
_
, (9)
then gives the Jacobian J of partial derivatives of (u
c
, v
c
)
T
with respect to the six
i
as:
J =
_
q
k
0 u
k
q
k
u
k
v
k
1+u
k
2
v
k
0 q
k
v
k
q
k
1v
k
2
u
k
v
k
u
k
_
, (10)
where the rst three columns of J correspond to derivatives
with respect to translations in x, y and z (and thus depend on
q) and the last three columns correspond to rotations about
the x, y and z axes respectively. This simple form of J is
a convenient consequence of working in (u, v, q) coordinates
rather than (x, y, z).
From three correspondences between (u
k
, v
k
, q
k
) and
(u
c
, v
c
) it is possible to stack the three Jacobians to obtain
a 6x6 matrix which can be inverted to nd the linearised so-
lution to the
i
from the error between the observed coordi-
nates in the mobile camera frame and the transformed Kinect
coordinates: (u
c
, v
c
) (u
k
, v
k
). With more than three corre-
spondences, it is possible to compute a least squares solution
to obtain
i
.
When
i
have been computed, iteration proceeds by setting
T
j+1
ck
= exp(
i
i
G
i
)T
j
ck
and recomputing the
i
. In practice,
we found that 10 iterations is enough for this to converge to
acceptable accuracy.
Because the correspondences obtained using HIPS contain
many outliers, RANSAC was used (sampling on triplets of
correspondences) to obtain the set of inliers. The nal pose
was then calculated through an optimisation over all inliers.
This process is typically only takes about 0.5ms. Figure 4
shows all the inlier correspondences between the Kinect and
mobile camera images computed in this way.
Figure 4: The inlier matches between the Kinect RGB im-
age and the mobile camera RGB image as determined by
RANSAC
3.6 Robot localisation on a 2D planar map
Since the robot is restricted to movement on the ground sur-
face, we dene the ground plane using the Kinects depth data
to give the robot a 2D map of its operating environment. This
is done through a calibration process where a user manually
selects three points on the ground plane in the Kinect image.
The depth values at these three points are then used to gen-
erate a ground plane of the form ax +by +cz = d, where the
constants a,b and c are normalised to produce unit vector for
the plane normal.
Points in the kinect depth image are classied into three
categories
Points within a noise threshold of the ground plane are
considered to be part of the ground plane.
Points above the ground but belowthe height of the robot
are marked as obstacles.
Other points are disregarded.
Projection of a point [x, y, z]
T
to the ground plane with
plane normal n = [a, b, c]
T
is computed by:
Projected Point = [I
3x3
n.n
T
]
_
_
x
y
z
_
_
dn, (11)
The projected ground plane points (x
, y
, z
)
T
are trans-
formed onto a xy plane to give an overhead 2D map. The
transformation required is a rotation to align the plane nor-
mal, n, and two other perpendicular vectors lying on the
plane, x
plane
and y
plane
, with the Kinects camera coordinate
frame. This is followed by a translation in the plane normal
direction by an arbitrary amount . The transformation can
be expressed as:
_
_
x
map
y
map
z
map
_
_
=
_
_
x
plane
T
y
plane
T
n
T
_
_
_
_
x
_
_
n (12)
The z
map
coordinate is constant over all points in the xy
plane and is discarded.
To obtain the nal map localisation, the pose estimation
is projected from the Kinects frame onto the 2D map. The
translation component of the pose estimation is projected in
a similar manner to a 3D point as described above. The ori-
entation requires multiplying the rotation component with a
unit vector along the Kinects optical axis before projection.
The result is the localisation of robot position as well as ori-
entation. Figure 5 displays the position and orientation of the
robot in a map lled with obstacles.
Figure 5: Dynamic map generated using Kinect. Left:
Kinects point-of-view showing obstacles in red. Right:
kinect data projected as a 2D planar map.
4 Experiments
4.1 Localisation Accuracy for a Static Scene
Figure 6: Experimental setup to measure the accuracy of
localisation system The masking tape on the oor is used for
ground truth measurements.
To measure the precision and repeatability of the localisa-
tion system, the robot was localised at a number of marked
positions and orientations as shown in gure 6. This is done
by projecting the 6 DoF robot pose onto the 2D planar map.
The distance between the marked positions was measured
to 2mm accuracy with a measuring tape to provide ground
truth data. 10 data points were measured for each location to
provide a mean and standard deviation for each position. The
obtained results are shown in gure 7. Each cluster of points
represent a set of measurements for a single location.
Figure 7: Measured Position Localisation
This data was analysed in a number of ways:
Because the full 3D 6 DoF pose is calculated, the con-
straint that the robot sits on the ground plane is not re-
spected. In order to provide a measure of the global
accuracy of the system, a plane was tted to all the
computed mobile camera positions. The residual sum-
squared error between all computed positions and this
plane was then calculated and divided by (n3), where
n is the number of measured positions, to obtain an un-
biassed estimate for the variance of position error in the
direction of the plane normal. This analysis gave a stan-
dard deviation in this direction of 0.012m (1.2cm).
A second analysis was performed on each line of data
using a similar method. A line was tted to each of the
three subsets of the data, the residual squared distance
to the line was computed and divided by (n2) to give
unbiassed estimates of the standard deviation of 0.010m
for the right line, 0.16m for the centre line and 0.021m
for the left line. The reason for this variation in accu-
racy from left to right is unknown, but is likely to be due
to the specic geometry of landmark data used in the
experiments. The standard deviation of the orientation
was also computed to be 0.03 radians (All points on the
same line were assumed to have the same ground truth
orientation).
The accuracy of the localisation is largely dependent on
the number of correct feature matches between the kinect and
robots camera. With a cluttered scene where there are nu-
merous texture features, the system demonstrates results con-
sistent to 16mm of the ground truth. As the proportion, p,
of correct matches drops, the probability of selecting the cor-
rect hypotheses for a given number of RANSAC iterations, n,
descreases. For the three point pose estimation, we require
three correct correspondences. A 0.01 probability of testing
n randomly sampled hypotheses and not selecting the correct
inlier set is:
(1p
3
)
n
< 0.01 (13)
Our system samples n = 200 hypotheses, requiring an in-
lier rate of 0.283 for a 0.99 chance of selecting the correct
pose. This places a limit on the eld of view overlap required
between the Kinect and the robots camera for correct local-
isation. A smaller view overlap and likely increase in false
positive correspondences may diminish the chance obtaining
of the correct pose.
Another source of mismatches occurs from large viewpoint
differences between the Kinect and robot. Despite our HIPS
features having slight afne invariance, sufcient viewpoint
deviations will cause structural corners to change appearance
signicantly due to its changing background.
4.2 Localisation in Dynamic Scenes
It is difcult to perform visual localisation in dynamic scenes
as visual features may move unexpectedly or become oc-
cluded on a regular basis. Our system operates robustly in
dynamic scenes as moving objects still contribute visual fea-
tures to pose estimation. Occlusions have little effect because
they only matter when features visible in one view are oc-
cluded in another; typically a small portion of features.
The video submission accompanying this paper
2
shows the
robot localising repeatedly in a dynamic environment. Still
frames of the robotic experiment taken from a handycam and
screen captures of the 2D planar map showing the robots
pose are available at the end of this paper.
5 Conclusion and Future Work
We have presented a robot and external RGB-D sensor sys-
tem that performs a 6 DoF pose estimation in real time. The
system achieves up to 10 fps despite the processing limitation
of the smartphone on the robot and the bandwidth limitation
of the Wi-based communication link. Localisation is also
robust against dynamic environments. Future work includes
extending the system to include input from multiple Kinects
so that universal coverage can be provided, and the use of
2
http://www.youtube.com/watch?v=N5PXHWk8nCE
multiple robots, including kinect-carrying robots that can act
as the external sensor. We also plan to explore the compu-
tational asymmetry further by investigation of different divi-
sions of labour between robots and external sensors.
References
[Betke and Gurvits, 1997] M. Betke and L. Gurvits. Mobile
robot localization using landmarks. Robotics and Automa-
tion, IEEE Transactions on, 13(2):251263, 1997.
[Cox et al., 1990] I.J. Cox, G.T. Wilfong, and T.L. P erez.
Autonomous robot vehicles, volume 447. Springer-Verlag,
1990.
[Dellaert et al., 1999] F. Dellaert, W. Burgard, D. Fox, and
S. Thrun. Using the condensation algorithm for robust,
vision-based mobile robot localization. Computer Vision
and Pattern Recognition, 1999. IEEE Computer Society
Conference on., 2:2 vol. (xxiii+637+663), 1999.
[Inaba et al., 2000] Masayuki Inaba, Satoshi Kagami, Fumio
Kanehiro, Yukiko Hoshino, and Hirochika Inoue. A plat-
form for robotics research based on the remote-brained
robot approach. The International Journal of Robotics Re-
search, 19(10):933954, 2000.
[OpenNI, 2010] OpenNI. Kinect drivers.
http://www.openni.org/, 2010.
[Rawlinson et al., 2004] D. Rawlinson, P. Chakravarty, and
R. Jarvis. Distributed visual servoing of a mobile robot
for surveillance applications. Australasian Conference on
Robotics and Automation (ACRA), 2004.
[Rekleitis et al., 2006] Ioannis Rekleitis, David Meger, and
Gregory Dudek. Simultaneous planning, localization, and
mapping in a camera sensor network. Robotics and Au-
tonomous Systems, 54(11):921932, 2006. Planning Un-
der Uncertainty in Robotics.
[Rosten and Drummond, 2006] E. Rosten and T. Drum-
mond. Machine learning for high-speed corner detection.
Computer VisionECCV 2006, pages 430443, 2006.
[Se et al., 2001] S. Se, D. Lowe, and J. Little. Vision-
based mobile robot localization and mapping using scale-
invariant features. In Robotics and Automation, 2001. Pro-
ceedings 2001 ICRA. IEEE International Conference on,
volume 2, pages 20512058. IEEE, 2001.
[Taylor and Drummond, 2009] S. Taylor and T. Drummond.
Multiple target localisation at over 100 fps. In British ma-
chine vision conference, volume 4. Citeseer, 2009.
Figure 8: Initial movement of robot away from Kinect
Figure 9: No visual localisation as robot and external do not share eld of view
Figure 10: Example of localisation in dynamic scene with moving objects
Figure 11: Example of localisation in dynamic scene with occlusions
Figure 12: Example of localisation under signicant scale and viewpoint change