A Low-Cost Stereo System For 3D Object Recognition
A Low-Cost Stereo System For 3D Object Recognition
net/publication/260081360
CITATIONS READS
14 5,690
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dario Lodi Rizzini on 14 March 2014.
E-mail {oleari,dlr,caselli}@ce.unipr.it
Abstract—In this paper, we present a low-cost stereo vision the specific knowledge about the setup, e.g. all the candidate
system designed for object recognition with FPFH point feature objects lie on a table. Object recognition is commonly achieved
descriptors. Image acquisition is performed using a pair of by extracting features that represent a signature for a point
consumer market UVC cameras costing less than 80 Euros, neighborhood. Several 3D features to be extracted from point
lacking synchronization signal and without customizable optics. clouds or other representations have been proposed during
Nonetheless, the acquired point clouds are sufficiently accurate to
the years. Spherical harmonic invariants [5] are computed on
perform object recognition using FPFH features. The recognition
algorithm compares the point cluster extracted from the current parametrized surfaces as values invariant to translation and
image pair with the models contained in a dataset. Experiments rotation of such surfaces. Spin images [6] are obtained by
show that the recognition rate is above 80% even when the object projecting and binning the object surface vertices on the frame
is partially occluded. defined by an oriented point on the surface. Curvature map
method [7] computes a signature based on curvature in the
I. I NTRODUCTION neighborhood of each vertex. Scale Invariant Feature Trans-
form (SIFT) [8], which extracts points and a signature vector
The diffusion of relatively accurate 3D sensors has popu- of descriptors characterizing the neighborhood, has established
larized scene interpretation and point cloud processing. Motion a standard model for several point feature descriptors and
planning, human-robot interaction, manipulation and grasp- has popularized the feature constellation method to recognize
ing [1] have taken advantage from these advancements in objects. According to such approach the signature of an object
perception. In particular, identification of objects in a scene is consists of a collection of features extracted from the observa-
a fundamental task when the robot operates in environments tion. Object recognition between the current observation and
with human artifacts. an object model is performed by matching each descriptor
The complexity of object recognition depends on the extracted from the observation with its closest descriptor in the
accuracy of sensors, on the availability of shape or color model. If many pairs of similar points have consistent relative
information, on specific prior knowledge of the object dataset, position, the comparison outcome is positive. The feature
and on the setup or the operating context. Although 3D per- constellation method exploits both feature similarity, which is
ception is not mandatory for object recognition, the availability measured by a metric in descriptor space, and feature proxim-
of the object shape can improve the recognition and allows ity. More recently, point feature descriptors designed according
the assessment of the object pose for further operations like to the point descriptor paradigm like Normal Aligned Radial
manipulation. Low-cost 3D sensors broaden the application Feature (NARF) [9], Point Feature Histogram (PFH) and Fast
domains of shape processing and support the development Point Feature Histogram (FPFH) [10] have been proposed for
of effective algorithms. Depth cameras and RGB-D sensors 3D points. FPFH are computed as histograms of the angle
rely either on active stereo or on time-of-flight [2] and often between the normal of a point and the normals of the points
provide an off-the-shelf solution for end-users that does not in its neighborhood. Several features have been proposed
require complex calibration operations. However, active stereo and implemented in Point Cloud Library (PCL) [11]. These
devices like MS Kinect [3] or Dinast Cyclope [4] are sensitive methods usually provide a parameter vector that describes
to environment lighting conditions, since the perception of the local shape. Such descriptors allow object recognition of
patterns in infrared or other domains may be noisy. A cheap known objects by matching a model and the observed point
stereo vision systems can be constructed using a pair of low- cloud.
cost cameras. Such cost-effective solution requires to manually
In this paper, we present a low-cost stereo vision system
build the setup, to calibrate the complete system and to care-
designed for object recognition with FPFH point feature de-
fully tune the parameters to achieve a sufficiently dense point
scriptors. We show its effectiveness even in presence of occlu-
cloud. Moreover, a stereo system can be designed according
sions. This work demonstrates that this fundamental task can
to the requirements of a specific application (e.g. by adapting
be performed using state-of-art algorithms on 3D sensor data
the baseline). Since such 3D sensor is not an active sensor, it
acquired with generic consumer hardware costing less than
can be used in outdoor environments.
80 Euros, taking a different approach from RGB-D cameras.
A common requirement for a recognition algorithm is the The stereo system has been built by mounting two Logitech
identification of a region of interest (ROI) corresponding to a C270 UVC (USB Video Class) cameras on a rigid bar. The
candidate object. This operation can be simplified by exploiting main limitations of such sensors lie in the lack of hardware
synchronization trigger signals and of customizable optics. sensors that could compromise camera calibration.
A flaw in frame synchronization may affect the accuracy of On the bottom and on the back of the enclosure there are two
the disparity map. However, the overall image quality and 1/4“ UNC nuts fully compatible with photographic supports.
resolution and the approximate software synchronization allow The overall final dimensions are 205x44x40 mm and the total
the computation of a sufficiently dense and accurate disparity cost, webcams included, does not exceed 80 euro.
image to perform object recognition. The calibration (intrinsic
and extrinsic) and the computation the disparity image have
been performed using the packages and libraries provided
by ROS (Robotic Operating System) framework. Since scene
segmentation is not the aim of this work, the system works
under the assumption that all the objects to be recognized lie
on a planar surface and inside a given bounded region. The
object recognition algorithm is based on comparison of FPFH
feature collections. In particular, the FPFH points extracted
from the current point cloud are compared with the FPFH point
models contained in an object dataset. The dataset consists
of 8 objects observed from about 6 viewpoints. Tests have
been performed to evaluate the recognition performance of
the stereo system. The recognition algorithm has shown good
performance with true positive rate above 80%. The effects of
occlusion on recognition rate have been assessed by showing
that the recognition performance is only slightly affected when
the occluded part of the object is less than the 40% of its visible
surface.
The paper is organized as follows. Section II illustrates the
low-cost stereo system. Section III presents the algorithm for
the identification of the region of interest where the object
lies. Section IV presents the object recognition algorithms.
Section V presents the experiments performed to assess per-
formance and section VI discusses the results.
Fig. 2. Acquisition time in milliseconds for both left and right frames
IV. R ECOGNITION
Cluster recognition is the last step in the described pipeline
III. O BJECT C LUSTER E XTRACTION and aims at matching a selected cluster with an entry in a
The processing pipeline starts from the acquisition of dataset of known models. The dataset consists of a variable
left and right frames. Then the standard ROS package number of views for each object, taken from pseudo-random
stereo image proc performs disparity and computes the re- points of view, as shown in Figure 5. Each model is obtained
sulting point cloud. Parameters of the stereo reconstruction by accumulating points from multiple frames in order to fill
algorithm are shown in Table I and example results in different gaps of the cloud produced by stereo vision. Then a voxel grid
scenarios are displayed in Fig.3. filter is applied to achieve a uniformly-sampled point cloud.
The recognition algorithm is based on point clouds alignment.
The two clouds of the i-th model Pimod and the current object
P obj in 3D space need to be registered or aligned in order to
be compared. The registration procedure computes the rigid
geometric transformation that should be applied to Pimod to
align it to P obj . Registration is performed in three different
steps:
(a) (b) (c) • Remove dependency on external reference frame.
Pimod and P obj are initially expressed in the reference
Fig. 3. Example results of stereo reconstruction in different scenarios: (a)
working environment; (b) interior and (c) exterior/automotive.
of the respective centroids.
• Perform initial alignment.
The 3D representation of the scene is then segmented and The algorithm estimates an initial and sub-optimal
filtered to only preserve information on the region of interest. alignment between point clouds. This step is per-
In this work we did not focus on detection, so the extraction formed with the assistance of a RANSAC method that
Algorithm 2: Overall recognition procedure
Data:
P mod [·]: List of point cloud models;
P obj : Point cloud of the object to be recognized;
Result:
name: Name of the recognized object;
1 Fmax ← 0;
2 foreach Pimod ∈ P mod [·] do
mod
3 Pi,aligned ← performRegistration(Pimod , P obj );
4 Fi ← getFitness(P obj , Pi,aligned
mod
);
5 if Fi > Fmax then
6 Fmax ← Fi ;
7 name ← name of Pimod ;
8 end
9 end
Fig. 5. Multiple models obtained from different PoV for an example object.
chocolate
hammer
woolite
horse
baby
fire
horse starlet 144 2 0 0 0 0 0 0
horse 0 111 2 0 0 0 0 0
baby 1 0 128 2 0 0 0 0
big detergent 0 1 1 60 0 0 0 14
fire 5 3 3 0 111 1 0 0
woolite 2 0 6 8 0 116 0 0
chocolate 3 4 9 3 0 23 67 0
hammer 14 2 3 10 0 6 2 132
TABLE II. C ONFUSION MATRIX FOR EACH CATEGORY WITHOUT OCCLUSION .
Recognition Performance
100
TruePos
FalsePos
80
60
percentage
40
20
te
te
te
te
te
te
st0
st0
st0
st0
st0
st0
st0
test01 61 3,5,10,20,30,50
1
7
test02 32 3,5,10,20,30,50
test03 61 3,10,30
test04 24 3,5,10,20,30,50 Fig. 8. True and false positive rates for the tests in Table III.
test05 61 5,15
test06 32 5,15
test07 32 3,10,30 1
R EFERENCES
0.8 [1] J. Aleotti, D. Lodi Rizzini, and S. Caselli, “Object Categorization
and Grasping by Parts from Range Scan Data,” in Proc. of the IEEE
Int. Conf. on Robotics & Automation (ICRA), 2012, pp. 4190–4196.
true positive rate
0.6
[2] K. Konolige, “Projected texture stereo,” in Proc. of the IEEE
Int. Conf. on Robotics & Automation (ICRA), 2010.
[3] M. Andersen, T. Jensen, P. Lisouski, A. Mortensen, M. Hansen,
0.4
T. Gregersen, and P. Ahrendt, “Kinect depth sensor evaluation for
computer vision applications,” Technical report ECETR-6, Department
of Engineering, Aarhus University (Denmark), Tech. Rep., 2012.
0.2 no occlusions [4] D. Um, D. Ryu, and M. Kal, “Multiple intensity differentiation for
occl10to20
occl20to30 3-d surface reconstruction with mono-vision infrared proximity array
occl30to40 sensor,” Sensors Journal, IEEE, vol. 11, no. 12, pp. 3352–3358, 2011.
0
0 0.2 0.4 0.6 0.8 1 [5] G. Burel and H. Hènocq, “Three-dimensional invariants and their
false positive rate
application to object recognition,” Signal Process., vol. 45, no. 1, pp.
1–22, 1995.
Fig. 10. ROC curves for tests with occlusions. [6] A. Johnson, “Spin-images: A representation for 3-D surface matching,”
Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, Au-
gust 1997.
[7] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature maps for
from 10% to 70% of the object surface perceived from the local shape compariso,” in Proc. of Int. Conf. on Shape Modeling and
current viewpoint. Recognition results are shown in Table IV. Applications (SMI), 2005, pp. 246–255.
Recognition algorithm still exhibits good performance with [8] D. Lowe, “Distinctive image features from scale-invariant keypoints,”
occlusions up to 30% with true positive rates above 70%. Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
Performance rapidly decreases with occlusions up to 40% and [9] B. Steder, R. B. Rusu, K. Konolige, and W. Burgard, “Point feature
then collapses with increasing percentage of occluded points. extraction on 3D range scans taking into account object boundaries,” in
Figure 10 shows Precision-Recall curves for all tests with Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2011.
occlusions and a reference test without them. Performance with [10] R. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms
(FPFH) for 3D registration,” in Proc. of the IEEE Int. Conf. on Robotics
occlusions up to 30% is consistent with the reference test. & Automation (ICRA), 2009, pp. 3212–3217.
[11] A. Aldoma, Z. Marton, F. Tombari, W. Wohlkinger, C. Potthast, B. Zeisl,
VI. C ONCLUSION R. Rusu, S. Gedikli, and M. Vincze, “Tutorial: Point cloud library:
Three-dimensional object recognition and 6 DOF pose estimation,”
In this paper, we have illustrated a low-cost stereo vision IEEE Robotics & Automation Magazine, vol. 19, no. 3, pp. 80–91,
system and its application to object recognition. The hardware Sept. 2012.
consists of a pair of consumer market cameras mounted on [12] H. Kato and M. Billinghurst, “Marker tracking and HMD calibration
a rigid bar and costs less than 80 Euros. These cameras lack for a video-based augmented reality conferencing system,” in Proc. of
hardware-synchronized trigger signals and do not allow optics the Int. Workshop on Augmented Reality, 1999.
customization. In spite of such limitations, the point cloud [13] R. Rusu and S. Cousins, “3D is here: Point Cloud Library (PCL),”
in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA),
obtained using the ROS packages for acquisition, calibration Shanghai, China, May 9-13 2011.
and disparity image computation is sufficiently accurate for
the given task. The point cloud cluster containing the object
to be recognized is identified under the hypothesis that such
object lies on a planar surface and inside a given bounded
region. The recognition algorithm is based on the extraction
and comparison of FPFH features and is robust to partial
views and to occlusions. Each candidate object is compared
with the models contained into a dataset defined a priori.
Experiments have been performed to assess the performance
of the algorithm and have shown an overall recognition rate
above 80%. The effect of occlusion on recognition rate has
been assessed by showing that recognition performance is only
slightly affected even when occlusion removes up to 30% of
the object surface perceived from the current viewpoint.
In the system described in this work, the ROI is fixed and
a single object is assumed to lie in the scene. We are currently