0% found this document useful (0 votes)
21 views7 pages

A Low-Cost Stereo System For 3D Object Recognition

The document presents a low-cost stereo vision system designed for 3D object recognition using Fast Point Feature Histograms (FPFH). Utilizing two consumer-grade UVC cameras, the system achieves an object recognition rate above 80%, even with partial occlusions. The paper details the hardware setup, object extraction, and recognition algorithms, demonstrating the effectiveness of the system in various environments.

Uploaded by

xcwang821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

A Low-Cost Stereo System For 3D Object Recognition

The document presents a low-cost stereo vision system designed for 3D object recognition using Fast Point Feature Histograms (FPFH). Utilizing two consumer-grade UVC cameras, the system achieves an object recognition rate above 80%, even with partial occlusions. The paper details the hardware setup, object extraction, and recognition algorithms, demonstrating the effectiveness of the system in various environments.

Uploaded by

xcwang821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260081360

A low-cost stereo system for 3D object recognition

Conference Paper · September 2013


DOI: 10.1109/ICCP.2013.6646095

CITATIONS READS
14 5,690

3 authors, including:

Fabio Oleari Dario Lodi Rizzini


Elettric80 S.p.A. Università di Parma
19 PUBLICATIONS 390 CITATIONS 55 PUBLICATIONS 571 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MARIS: Marine Autonomous Robotics for InterventionS View project

POSITIVE - Protocolli Operativi Scalabili per l’agricoltura di precisione View project

All content following this page was uploaded by Dario Lodi Rizzini on 14 March 2014.

The user has requested enhancement of the downloaded file.


A Low-Cost Stereo System for 3D Object
Recognition

Fabio Oleari∗† , Dario Lodi Rizzini∗ , Stefano Caselli∗


∗ RIMLab - Robotics and Intelligent Machines Laboratory
Dipartimento di Ingegneria dell’Informazione, University of Parma, Italy
† Elettric80 S.p.a. - Via G. Marconi, 23 42030 Viano (RE), Italy

E-mail {oleari,dlr,caselli}@ce.unipr.it

Abstract—In this paper, we present a low-cost stereo vision the specific knowledge about the setup, e.g. all the candidate
system designed for object recognition with FPFH point feature objects lie on a table. Object recognition is commonly achieved
descriptors. Image acquisition is performed using a pair of by extracting features that represent a signature for a point
consumer market UVC cameras costing less than 80 Euros, neighborhood. Several 3D features to be extracted from point
lacking synchronization signal and without customizable optics. clouds or other representations have been proposed during
Nonetheless, the acquired point clouds are sufficiently accurate to
the years. Spherical harmonic invariants [5] are computed on
perform object recognition using FPFH features. The recognition
algorithm compares the point cluster extracted from the current parametrized surfaces as values invariant to translation and
image pair with the models contained in a dataset. Experiments rotation of such surfaces. Spin images [6] are obtained by
show that the recognition rate is above 80% even when the object projecting and binning the object surface vertices on the frame
is partially occluded. defined by an oriented point on the surface. Curvature map
method [7] computes a signature based on curvature in the
I. I NTRODUCTION neighborhood of each vertex. Scale Invariant Feature Trans-
form (SIFT) [8], which extracts points and a signature vector
The diffusion of relatively accurate 3D sensors has popu- of descriptors characterizing the neighborhood, has established
larized scene interpretation and point cloud processing. Motion a standard model for several point feature descriptors and
planning, human-robot interaction, manipulation and grasp- has popularized the feature constellation method to recognize
ing [1] have taken advantage from these advancements in objects. According to such approach the signature of an object
perception. In particular, identification of objects in a scene is consists of a collection of features extracted from the observa-
a fundamental task when the robot operates in environments tion. Object recognition between the current observation and
with human artifacts. an object model is performed by matching each descriptor
The complexity of object recognition depends on the extracted from the observation with its closest descriptor in the
accuracy of sensors, on the availability of shape or color model. If many pairs of similar points have consistent relative
information, on specific prior knowledge of the object dataset, position, the comparison outcome is positive. The feature
and on the setup or the operating context. Although 3D per- constellation method exploits both feature similarity, which is
ception is not mandatory for object recognition, the availability measured by a metric in descriptor space, and feature proxim-
of the object shape can improve the recognition and allows ity. More recently, point feature descriptors designed according
the assessment of the object pose for further operations like to the point descriptor paradigm like Normal Aligned Radial
manipulation. Low-cost 3D sensors broaden the application Feature (NARF) [9], Point Feature Histogram (PFH) and Fast
domains of shape processing and support the development Point Feature Histogram (FPFH) [10] have been proposed for
of effective algorithms. Depth cameras and RGB-D sensors 3D points. FPFH are computed as histograms of the angle
rely either on active stereo or on time-of-flight [2] and often between the normal of a point and the normals of the points
provide an off-the-shelf solution for end-users that does not in its neighborhood. Several features have been proposed
require complex calibration operations. However, active stereo and implemented in Point Cloud Library (PCL) [11]. These
devices like MS Kinect [3] or Dinast Cyclope [4] are sensitive methods usually provide a parameter vector that describes
to environment lighting conditions, since the perception of the local shape. Such descriptors allow object recognition of
patterns in infrared or other domains may be noisy. A cheap known objects by matching a model and the observed point
stereo vision systems can be constructed using a pair of low- cloud.
cost cameras. Such cost-effective solution requires to manually
In this paper, we present a low-cost stereo vision system
build the setup, to calibrate the complete system and to care-
designed for object recognition with FPFH point feature de-
fully tune the parameters to achieve a sufficiently dense point
scriptors. We show its effectiveness even in presence of occlu-
cloud. Moreover, a stereo system can be designed according
sions. This work demonstrates that this fundamental task can
to the requirements of a specific application (e.g. by adapting
be performed using state-of-art algorithms on 3D sensor data
the baseline). Since such 3D sensor is not an active sensor, it
acquired with generic consumer hardware costing less than
can be used in outdoor environments.
80 Euros, taking a different approach from RGB-D cameras.
A common requirement for a recognition algorithm is the The stereo system has been built by mounting two Logitech
identification of a region of interest (ROI) corresponding to a C270 UVC (USB Video Class) cameras on a rigid bar. The
candidate object. This operation can be simplified by exploiting main limitations of such sensors lie in the lack of hardware
synchronization trigger signals and of customizable optics. sensors that could compromise camera calibration.
A flaw in frame synchronization may affect the accuracy of On the bottom and on the back of the enclosure there are two
the disparity map. However, the overall image quality and 1/4“ UNC nuts fully compatible with photographic supports.
resolution and the approximate software synchronization allow The overall final dimensions are 205x44x40 mm and the total
the computation of a sufficiently dense and accurate disparity cost, webcams included, does not exceed 80 euro.
image to perform object recognition. The calibration (intrinsic
and extrinsic) and the computation the disparity image have
been performed using the packages and libraries provided
by ROS (Robotic Operating System) framework. Since scene
segmentation is not the aim of this work, the system works
under the assumption that all the objects to be recognized lie
on a planar surface and inside a given bounded region. The
object recognition algorithm is based on comparison of FPFH
feature collections. In particular, the FPFH points extracted
from the current point cloud are compared with the FPFH point
models contained in an object dataset. The dataset consists
of 8 objects observed from about 6 viewpoints. Tests have
been performed to evaluate the recognition performance of
the stereo system. The recognition algorithm has shown good
performance with true positive rate above 80%. The effects of
occlusion on recognition rate have been assessed by showing
that the recognition performance is only slightly affected when
the occluded part of the object is less than the 40% of its visible
surface.
The paper is organized as follows. Section II illustrates the
low-cost stereo system. Section III presents the algorithm for
the identification of the region of interest where the object
lies. Section IV presents the object recognition algorithms.
Section V presents the experiments performed to assess per-
formance and section VI discusses the results.

II. H ARDWARE S ETUP


The stereo camera developed in this work (Figure 1)
has been designed to be as general purpose as possible so Fig. 1. Stereo camera realized with two Logitech C270 webcams.
that object recognition tasks can be performed in different
scenarios. The stereo system exploits a pair of Logitech C270 When using non-professional devices for stereo vision, the
webcams which offer a relatively good flexibility and quality key problem is the impossibility to synchronize the cameras
compared to other low-cost consumer cameras. Logitech C270 with an external trigger signal. The timing incoherence of left
webcams provide a high definition image sensor with a maxi- and right frames may generate gross errors when there is a
mum available resolution of 1280x720 pixels. At the maximum relative movement between scene and camera (moving scene
resolution, the camera can grab frames with a frequency of 10 and/or moving camera). To reduce this issue the webcams
Hz. have been driven at the maximum frame rate available for
This image sensor is fully compliant to UVC standard and the selected resolution which does not saturate the USB 2.0
allows the setting of some image parameters like brightness, bandwidth. In this way the inter-frame time is reduced to
contrast, white balance, exposure, gain and so on. Moreover, the minimum according to constraints imposed by the overall
each webcam exposes a unique serial number that can be system architecture.
used to deploy a dedicated UDEV rule to set devices. In this
way the system univocally distinguishes between left and right The Logitech C270 webcams can provide images in SVGA
cameras. resolution (800x600) at 15 frames per second without fully
occupying the USB 2.0 bandwidth.
The case has been built in aluminium to ensure a good The frame grabbing task is assigned to the ROS package
strength to the stereo camera. Thus, the sensor can be mounted uvc camera and in particular to a slightly modified version
in mobile robots, on manipulators, and in other scenarios of the stereo node which performs the acquisition of a couple
where it could be mechanically stressed due to vibrations or of roughly software-synchronized frames from two different
collisions. devices.
The internal structure of the enclosure is realized in 5 mm Figure 2 shows the time in milliseconds needed by the driver
thick aluminium and all parts are mounted with M2.0 precision to read both the left and right frames. The mean value is
screws. The webcam PCBs are fixed to the internal structure equal to 66.01 ms which corresponds to a frequency of 15Hz
with M1.6 precision screws that use the existing holes in and the standard deviation is 3.73 ms. In the plotted sequence
the boards. Moreover, the use of grover washers between of 5000 samples, only 11 acquisitions were found to be not
nuts guarantees a robust fastening, without movements of the well synchronized because grabbing both the frames took
Parameter Value
prefilter size 9 from the overall scene point cloud relies on the assumption that
prefilter cap 31 the ROI is fixed. The working environment consists of a planar
correlation window size 15 surface with a known bounded region and an ARToolKit [12]
min disparity 0
disparity range 128
marker that identifies the global world frame. The first step of
uniqueness ratio 15 object extraction is the geometric transformation of the point
texture threshold 9 cloud, required to express the points with respect to the center
speckle size 90
speckle range 4
of the world. The rototranslation matrices are obtained from
the values of position and orientation of the ARToolKit marker.
TABLE I. PARAMETERS OF THE STEREO RECONSTRUCTION The resulting point cloud is then segmented with a bounding
ALGORITHM . box that discards all points of table and background using
Point Cloud Library (PCL) [13]. In the end, a statistical outlier
removal filter is applied to discard the remaining isolated
points. An example of a cluster of points resulting from the
approximately twice the mean time. In the end, only 184 extraction process is shown in Fig.4.
frames (corresponding to 3.68 %) were grabbed in a time
higher than mean +1σ.

Fig. 4. Final result of the scene segmentation and filtering.

Fig. 2. Acquisition time in milliseconds for both left and right frames
IV. R ECOGNITION
Cluster recognition is the last step in the described pipeline
III. O BJECT C LUSTER E XTRACTION and aims at matching a selected cluster with an entry in a
The processing pipeline starts from the acquisition of dataset of known models. The dataset consists of a variable
left and right frames. Then the standard ROS package number of views for each object, taken from pseudo-random
stereo image proc performs disparity and computes the re- points of view, as shown in Figure 5. Each model is obtained
sulting point cloud. Parameters of the stereo reconstruction by accumulating points from multiple frames in order to fill
algorithm are shown in Table I and example results in different gaps of the cloud produced by stereo vision. Then a voxel grid
scenarios are displayed in Fig.3. filter is applied to achieve a uniformly-sampled point cloud.
The recognition algorithm is based on point clouds alignment.
The two clouds of the i-th model Pimod and the current object
P obj in 3D space need to be registered or aligned in order to
be compared. The registration procedure computes the rigid
geometric transformation that should be applied to Pimod to
align it to P obj . Registration is performed in three different
steps:
(a) (b) (c) • Remove dependency on external reference frame.
Pimod and P obj are initially expressed in the reference
Fig. 3. Example results of stereo reconstruction in different scenarios: (a)
working environment; (b) interior and (c) exterior/automotive.
of the respective centroids.
• Perform initial alignment.
The 3D representation of the scene is then segmented and The algorithm estimates an initial and sub-optimal
filtered to only preserve information on the region of interest. alignment between point clouds. This step is per-
In this work we did not focus on detection, so the extraction formed with the assistance of a RANSAC method that
Algorithm 2: Overall recognition procedure
Data:
P mod [·]: List of point cloud models;
P obj : Point cloud of the object to be recognized;
Result:
name: Name of the recognized object;
1 Fmax ← 0;
2 foreach Pimod ∈ P mod [·] do
mod
3 Pi,aligned ← performRegistration(Pimod , P obj );
4 Fi ← getFitness(P obj , Pi,aligned
mod
);
5 if Fi > Fmax then
6 Fmax ← Fi ;
7 name ← name of Pimod ;
8 end
9 end

Fig. 5. Multiple models obtained from different PoV for an example object.

Algorithm 1: Registration procedure


Data:
Pimod : Point cloud of i-th model;
P obj : Point cloud of the object to be recognized;
R: set of search radii in FPFH features computation;
Result:
mod
Pi,aligned : Aligned point cloud of the model;
1 Pcobj ← shiftToCentroid(P obj );
mod
2 Pi,c ← shiftToCentroid(Pimod );
mod
3 Pi,sac ← ∅;
4 foreach r ∈ R do
5 Fo ← computeFPFH(Pcobj ,r);
mod
6 Fm ← computeFPFH(Pi,c ,r);
mod,r
7 Pi,sac ← getRANSACAlignment(Pcobj , Fo , Pi,c mod
, Fm );
mod,r mod
8 if getFitness(Pi,sac ) > getFitness(Pi,sac ) then
mod mod,r Fig. 6. Alignment of a model after RANSAC (blue) and after ICP (green)
9 Pi,sac ← Pi,sac ; to the object (red).
10 end
11 end
mod mod
12 Pi,aligned ← getICPAlignment(Pi,sac , Pobj ); Maximum fitness, equal to 100%, is obtained when all points
of P obj have a neighbour in Pi,aligned
mod
within δth .
The algorithm is iterated for each model in the dataset and
uses FPFH descriptors as parameters for the consensus returns the recognized model with the higher fitness, as shown
function. The computation of FPFH is performed in Algorithm 2.
using different search radii.
• Refine the alignment. V. R ESULTS
The initial alignment is then refined with an ICP
This section presents the experiments performed to assess
algorithm that minimizes the mean square distance
the performance of the recognition algorithm illustrated in
between points.
the previous section. These results show the performance
The procedure is detailed in Algorithm 1 and an example result afforded by a low-cost stereo system in 3D object recognition.
is shown in Figure 6. The object extraction and the recognition modules have been
implemented as separated components using ROS framework.
Recognition is then performed by computing a fitness value The experimental setup consists of the stereo vision system
that evaluates the overall quality of the alignment between described in this paper with one of the candidate objects
mod
Pi,aligned and P obj . For each point of P obj , the algorithm placed in front of the sensor. The experiments are designed
calculates the square mean distance from the nearest point of to assess the object recognition algorithm, in particular when
mod
Pi,aligned and retrieves the percentage of points whose distance only a partial point cloud of the object is available due to
is below a fixed threshold δth : noisy segmentation and occlusions. Test set consists of a fixed
n o
Q = pi ∈ P obj : kpj − pi k2 ≤ δth , pj ∈ Pi,aligned
mod sequence of 2241 object point clouds taken from random
viewpoints. The dataset consists of 61 models representing
|Q| the 8 objects in Figure 7 (8 views for each object on average).
f itness(P obj , Pi,aligned
mod
)= (1)
|P obj | Table II shows the confusion matrix obtained without imposing
big detergent
horse starlet

chocolate

hammer
woolite
horse

baby

fire
horse starlet 144 2 0 0 0 0 0 0
horse 0 111 2 0 0 0 0 0
baby 1 0 128 2 0 0 0 0
big detergent 0 1 1 60 0 0 0 14
fire 5 3 3 0 111 1 0 0
woolite 2 0 6 8 0 116 0 0
chocolate 3 4 9 3 0 23 67 0
hammer 14 2 3 10 0 6 2 132
TABLE II. C ONFUSION MATRIX FOR EACH CATEGORY WITHOUT OCCLUSION .

Recognition Performance
100
TruePos
FalsePos
80

60

percentage
40

20

Fig. 7. Set of objects for the recognition experiment.


0
# dataset radius [mm]
te

te

te

te

te

te

te
st0

st0

st0

st0

st0

st0

st0
test01 61 3,5,10,20,30,50
1

7
test02 32 3,5,10,20,30,50
test03 61 3,10,30
test04 24 3,5,10,20,30,50 Fig. 8. True and false positive rates for the tests in Table III.
test05 61 5,15
test06 32 5,15
test07 32 3,10,30 1

TABLE III. D IFFERENT TUNING OF ALGORITHM PARAMETERS .


0.8
true positive rate

a threshold on fitness. The classification results show that, 0.6


even without a threshold on fitness to detect true negatives,
correct matches largely prevail. The articulated objects (horse,
horse starlet, baby and fire) are better recognized. 0.4
test01
The next test series takes into account parameters of the test02
test03
algorithm like the number of model views in the dataset and 0.2 test04
test05
the search radius used to compute the FPFH (trials are called test06
test01, test02, etc. in Table III). The true positive and false 0
test07
positive rates for the different trials are shown in Figure 8. 0 0.2 0.4 0.6 0.8 1
Experimental results show the importance of including dataset false positive rate
models taken from multiple viewpoints. Keeping fixed all the
parameters while decreasing the dataset size, the percentage Fig. 9. ROC curves for tests test01 to test07.
of true positives decreases (see test01, test02 and test04). On
the other hand, the recognition rate is only slightly affected
by restricting the set of search radii used to compute FPFH We have then evaluated the recognition algorithm with
features when the full dataset of model views is available partial and occluded objects. In order to have comparable
(compare test01, test03 and test05 in Figure 8). The Receiver results, occlusions have been artificially generated with a
Operating Charateristic curves in Figure 9 depict the perfor- random procedure. The occlusion generator processes the
mance of the classifier as its discrimination threshold is varied. original test set and for each view chooses a random point
To summarize, the webcam-based stereo system together with in the cloud and removes all points within a random radius.
the recognition algorithm has shown good performance, with In this way it generates a new synthetically occluded test set
true positive rate above 80% provided that sufficient viewpoint with occlusions measured as percentage of removed points. Six
models are available. different tests have been performed with increasing occlusions
occl 10to20 occl 20to30 occl 30to40
Occlusions [%] 10-20 20-30 30-40 working on an object detection algorithm dealing with less
True Pos. [%] 76 70 60 restrictive assumptions about the objects in the scene and the
False Pos. [%] 24 30 40 region of interest.
TABLE IV. E XPERIMENTAL RESULTS FOR TEST SET WITH
OCCLUSIONS AND RECOGNITION PARAMETERS AS TEST 05 [TABLE III] VII. ACKNOWLEDGEMENTS
We thank Elettric80 S.p.a. - Viano (Italy), for supporting
this work.
1

R EFERENCES
0.8 [1] J. Aleotti, D. Lodi Rizzini, and S. Caselli, “Object Categorization
and Grasping by Parts from Range Scan Data,” in Proc. of the IEEE
Int. Conf. on Robotics & Automation (ICRA), 2012, pp. 4190–4196.
true positive rate

0.6
[2] K. Konolige, “Projected texture stereo,” in Proc. of the IEEE
Int. Conf. on Robotics & Automation (ICRA), 2010.
[3] M. Andersen, T. Jensen, P. Lisouski, A. Mortensen, M. Hansen,
0.4
T. Gregersen, and P. Ahrendt, “Kinect depth sensor evaluation for
computer vision applications,” Technical report ECETR-6, Department
of Engineering, Aarhus University (Denmark), Tech. Rep., 2012.
0.2 no occlusions [4] D. Um, D. Ryu, and M. Kal, “Multiple intensity differentiation for
occl10to20
occl20to30 3-d surface reconstruction with mono-vision infrared proximity array
occl30to40 sensor,” Sensors Journal, IEEE, vol. 11, no. 12, pp. 3352–3358, 2011.
0
0 0.2 0.4 0.6 0.8 1 [5] G. Burel and H. Hènocq, “Three-dimensional invariants and their
false positive rate
application to object recognition,” Signal Process., vol. 45, no. 1, pp.
1–22, 1995.
Fig. 10. ROC curves for tests with occlusions. [6] A. Johnson, “Spin-images: A representation for 3-D surface matching,”
Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, Au-
gust 1997.
[7] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature maps for
from 10% to 70% of the object surface perceived from the local shape compariso,” in Proc. of Int. Conf. on Shape Modeling and
current viewpoint. Recognition results are shown in Table IV. Applications (SMI), 2005, pp. 246–255.
Recognition algorithm still exhibits good performance with [8] D. Lowe, “Distinctive image features from scale-invariant keypoints,”
occlusions up to 30% with true positive rates above 70%. Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
Performance rapidly decreases with occlusions up to 40% and [9] B. Steder, R. B. Rusu, K. Konolige, and W. Burgard, “Point feature
then collapses with increasing percentage of occluded points. extraction on 3D range scans taking into account object boundaries,” in
Figure 10 shows Precision-Recall curves for all tests with Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2011.
occlusions and a reference test without them. Performance with [10] R. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms
(FPFH) for 3D registration,” in Proc. of the IEEE Int. Conf. on Robotics
occlusions up to 30% is consistent with the reference test. & Automation (ICRA), 2009, pp. 3212–3217.
[11] A. Aldoma, Z. Marton, F. Tombari, W. Wohlkinger, C. Potthast, B. Zeisl,
VI. C ONCLUSION R. Rusu, S. Gedikli, and M. Vincze, “Tutorial: Point cloud library:
Three-dimensional object recognition and 6 DOF pose estimation,”
In this paper, we have illustrated a low-cost stereo vision IEEE Robotics & Automation Magazine, vol. 19, no. 3, pp. 80–91,
system and its application to object recognition. The hardware Sept. 2012.
consists of a pair of consumer market cameras mounted on [12] H. Kato and M. Billinghurst, “Marker tracking and HMD calibration
a rigid bar and costs less than 80 Euros. These cameras lack for a video-based augmented reality conferencing system,” in Proc. of
hardware-synchronized trigger signals and do not allow optics the Int. Workshop on Augmented Reality, 1999.
customization. In spite of such limitations, the point cloud [13] R. Rusu and S. Cousins, “3D is here: Point Cloud Library (PCL),”
in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA),
obtained using the ROS packages for acquisition, calibration Shanghai, China, May 9-13 2011.
and disparity image computation is sufficiently accurate for
the given task. The point cloud cluster containing the object
to be recognized is identified under the hypothesis that such
object lies on a planar surface and inside a given bounded
region. The recognition algorithm is based on the extraction
and comparison of FPFH features and is robust to partial
views and to occlusions. Each candidate object is compared
with the models contained into a dataset defined a priori.
Experiments have been performed to assess the performance
of the algorithm and have shown an overall recognition rate
above 80%. The effect of occlusion on recognition rate has
been assessed by showing that recognition performance is only
slightly affected even when occlusion removes up to 30% of
the object surface perceived from the current viewpoint.
In the system described in this work, the ROI is fixed and
a single object is assumed to lie in the scene. We are currently

View publication stats

You might also like