Fast and Robust Pose Estimation Algorithm For Bin Picking Using Point Pair Feature

Uploaded by

Bích Lâm Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

220 views6 pages

Fast and Robust Pose Estimation Algorithm For Bin Picking Using Point Pair Feature

Uploaded by

Bích Lâm Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2018 24th International Conference on Pattern Recognition (ICPR)

Beijing, China, August 20-24, 2018

Fast and Robust Pose Estimation Algorithm for Bin

Picking Using Point Pair Feature
Mingyu Li and Koichi Hashimoto
Graduate School of Information Sciences
Tohoku University
6-6-01, Aobaku, Sendai 980-8579, Japan
Email: [email protected], [email protected]

Abstract—Bin picking refers to picking up the objects ran- achieved a recognition rate of 93.9%. Choi et al. introduced
domly piled in the container (bin) and robotic bin picking is boundary points with directions and boundary line segments to
always used to improve the industrial production efficiency. A the algorithm for estimating planar industrial objects[10] and
pose estimation algorithm is necessary to tell the poses of the
objects to the robot. This paper proposes a pose estimation the algorithm performed a higher recognition rate and faster
algorithm for bin picking using 3D point cloud data. Point Pair speed than PPF. But PPF is also criticized as not efficient for
Feature algorithm is performed in a fast way to propose possible objects lack of curvature changes.
poses and the poses are verified by a voxel-based verification Li et al. proposed Curve Set Feature algorithm for bin
method. Iterative Closest Point is used to refine the result poses. picking[6]. The algorithm computes the surface fluctuation
Our algorithm is proved to be more accurate and faster than
Curve Set Feature algorithm and Point Pair Feature algorithm, of the objects, matches scene features with model using
robust to occlusion and able to detect multiple poses in one scene. nearest neighbor search[11] and select pose with a fast pose
verification method. The algorithm was proved to be accurate
and efficient for bin picking tasks, but not robust enough
I. I NTRODUCTION against occlusion.
Bin picking system typically consists of three components: In this paper, we propose a point pair feature-based pose
a sensor above the objects, a processor and the robot arm. The estimation algorithm for bin picking. Normals are computed in
sensor takes the 2D or 3D data of the objects and sends to the a more precise way to improve the accuracy of the algorithm.
processor. The processor computes the position and orientation Then the point pair features of scene points are computed
(pose) of the objects and decides the picking point and picking and matched with model points to compute pose candidates.
path. Finally the robot arm picks up the object and places it Different from the original PPF algorithm of [1], only a small
to the specified position. As 3D sensors are becoming cost part of the point pairs are computed to improve the efficiency.
effective, pose estimation algorithms using 3D point cloud data Pose verification method is performed to verify all the poses. A
are developed these years[1][2]. multiple selection method is used to select result poses without
The 3D keypoint descriptors are capable of estimating repetition. Finally Iterative Closest Point (ICP)[12] algorithm
6 Degree of Freedom poses, for example, CVFH (Clus- is performed to improve precision.
tered Viewpoint Feature Histogram)[3] and OUR-CVFH (Ori- The rest of the paper is organized as follows: Section II
ented, Unique and Repeatable Clustered Viewpoint Fea- describes the original PPF algorithm[1] since our algorithm
ture Histogram)[2]. Compared with some template matching is based on it. Section III introduces the pipeline of our pose
algorithms[4][5] which need both color information and depth estimation algorithm. Section IV provides the experiments to
information, the algorithms can perform estimation tasks with examine the algorithm and Section V gives the conclusion.
only point cloud data. But they suggested that the objects could II. O RIGINAL PPF A LGORITHM
be segmented from the background. In bin picking tasks, many
In this paper, we denote si ∈ {S} for points in the scene
objects are piled together and it is difficult to perfectly segment
cloud, mi ∈ {M } for points in the model cloud, n(mi ) for
every object. As shown in [6], the recognition rate was bad
normal of mi , diam(M ) for the model diameter and Nm for
when the segmentation failed.
the number of model points.
Drost et al. proposed Point Pair Feature algorithm (PPF) in
To reduce feature number, input clouds(scene cloud and
[1]. The algorithm computes a four dimensional feature of two
model cloud) are subsampled by the same size ddist =
points and matches the features by a efficient voting scheme.
τd diam(M ), where τd is subsampling rate. To avoid con-
It is proved to be valid for many objects, not depends on
fusion, in this paper we call the clouds before subsampling as
segmentation and capable of handling sparse point cloud data.
original clouds.
Many algorithms were proposed based on the PPF algorithm.
Birdal et al.[8] and Hinterstoisser et al.[9] improved the PPF A. Model Description
algorithm for daily objects in cluttered scenes. Wu et al.[7] The point pair features between every two model points are
performed robotic bin picking using the PPF algorithm and computed by Equation 1:

978-1-5386-3788-3/18/$31.00 ©2018 IEEE 1604

III. O UR A LGORITHM

F (mi , mj ) = (f1 , f2 , f3 , f4 ) Our algorithm consists of four steps: (i) build the model
(1) point pair feature hash table, as introduced in Section II-A, (ii)
= [||d||2 , 6 (n(mi ), d), 6 (n(mj ), d), 6 (n(mi ), n(mj ))]
perform voting scheme using less scene features and propose
where d represents the vector from mi to mj . 6 (a,b) represents possible poses, in Section III-B (iii) verify the poses with
the angle between the vectors a and b. In the point pair, the the voxel-based verification method, in Section III-C and (iv)
first point mi is called the reference point while mj is called select multiple nonredundant poses, in Section III-D.
the referred point.
A. Normal Estimation
After subsampling and normal estimation, the point pair
features of every two model points are discretized and stored Normals are very important in the algorithm. The normal of
in a hash table for future fast search. a point is computed by fitting a plane to some neighbouring
points. Since the clouds (model cloud and scene cloud) are
subsampled, the normals computed are not accurate enough.
B. Voting Scheme Therefore, when computing n(mi ) and n(si ), we use the
Consider a scene point pair (si , sj ) and their point pair neighbouring points of mi and si in original clouds, instead
feature F (si , sj ). F (si , sj ) is searched in the hash table and of subsampled clouds.
model point pair (mi , mj ) sharing similar feature is found. We show the necessity of computing normals in this way
In order to match these two point pairs, they are transformed in Figure 1. For a synthetic scene, we subsampled the model
so that the reference points si and mi are at the origin and cloud and scene cloud with different subsampling rate τd and
their normals are on the x axis. The transformation matrices estimate the normals using the subsampled clouds. We then
are Tm and Ts . Then the model point pair is rotated an transformed the model cloud into scene space according to
angle α around x axis to match the scene pair. The rotation the transformation matrices and computed the angles between
matrix is Rx(α). The pair (mi , α) is called a local coordinate. scene normals and corresponding model normals. The average
The transformation from the model pair to scene pair can be angle error against subsampling rate τd is shown in the figure.
described by: The average angle error increases from 9◦ to 22◦ with the
increase of τd and this error will lower the accuracy of the
algorithm.
si = Ts−1 Rx(α)Tm mi (2)
B. Compute Pose Candidates
Usually, 20% of the scene points are used as reference We used the framework of original PPF algorithm to pro-
points. For a reference scene point si , the above-mentioned pose possible poses. Same as Section II-A, hash table is built
matching process is performed with all other scene points. with every two model points.
Every time a local coordinate is computed, a vote is cast for
During the voting scheme, different from original PPF
it and an 2D accumulator is built to count the votes of every
algorithm, for every reference scene point si , we only compute
local coordinate. The local coordinates that received a certain
features with a part of scene points(usually 20%) to improve
amount of votes relative to the one with most votes are saved.
efficiency. The top Np local coordinates(poses) are selected
This voting scheme is proved to be efficient for many for the next stage. We do not perform pose clustering.
objects. If the object has sufficient curvature change, the model
features will be distributed among the hash table and the
feature number in a hash bin is not very large. However,
for industrial objects, which usually consist of planes and
cylinders, the features of many model pairs are similar, which
means in some hash bins, the numbers of point pairs are much
larger. When searching scene point pairs in these bins, there
are more model pairs to match with the scene pair. Therefore,
usually it takes more time to estimate the industrial objects
using the PPF algorithm than non-industrial objects.

C. Pose Clustering
The local coordinates from last step are converted into poses
according to Equation 2. The poses are clustered so that similar
poses are in the same cluster. The score of a cluster is the Fig. 1. A synthetic scene cloud and model cloud were subsampling with
different subsampling rate and their normals were estimated. The average
summation of the votes of poses in it. The average pose of angle error between the normals of corresponding points against subsampling
the cluster with highest score is selected as the final result. rate is shown.

1605
C. Pose Verification
size(SS(Pk ))
X
We improved the voxel-based pose verification method of Score(Pk ) = Exist(SS(Pk )n ) (4)
[6] to make it more efficient for bin picking tasks. Before n=1
verification, the scene space is divided into many small cubic
which is the number of scene points in SS(Pk ) whose binary
voxels with length Lvoxel . We denote V oxel(si ) as the voxel
values are 1.
si in. The key to access the voxel is computed by Equation 3:
After all pose candidates are verified, the first result pose is
the one with highest score.
xsi − min x
xint = f loor( ) In our method, the pose verification method is utilized to
Lvoxel take the place of pose clustering of original PPF method. The
ys − min y advantage is that as long as a good pose is selected from the
yint = f loor( i )
Lvoxel (3) voting scheme, the verification result would give a high score
zs − min z no matter how many votes the pose got in the voting scheme.
zint = f loor( i )
Lvoxel Therefore, by evaluating a large number of pose candidates,
V oxel(si ) = voxel[xint ][yint ][zint ] we will always get some good results.

where (xsi , ysi , zsi ) is the 3D coordinate of si in scene D. Multiple Selection

space, min x, min y, min z are the minimum coordinate In bin picking, it is necessary to detect multiple result poses.
components of the scene cloud. A method is to set a distance threshold dthres . Suppose a result
At first, the initial value of all voxels are -1, means there is pose P0 is selected, the pose P1 with the highest score among
no scene point within this voxel. Then for every scene point the poses whose model centers from that of P0 are larger than
si , V oxel(si ) is searched by Equation 3. The values of this dthres is the next result pose. But it is difficult to find a proper
voxel and neighbouring voxels change to i, the index of the dthres which can delete repetitive poses and at the same time,
scene point, which means the distance between si and any reserve all correct poses.
point within these voxels is less than a constant. In order to select another pose P1 , Li et.al[6] deleted the
To verify a pose Pk , model points are transformed into scene scene points belonging to P0 , verified the poses from top to
space according to Pk and for every transformed model point bottom and selected the top pose again. This method can avoid
mi , if V oxel(mi ) is not -1, it means there is a scene point selecting repetitive poses. However, if the detection number is
near mi and mi is a fitted model point, as shown in Figure 2. large, it will be necessary to verify the poses many times which
In [6], the score of the pose, Score(Pk ) is the number will dramatically increase computation time.
of fitted model points. This method was proved to be fast In our method, in order to delete the scene points belonging
and robust, but not efficient enough if multiple detection to the old pose P0 and reverify the poses efficiently, the
is necessary, which will be shown in Section IV-A. In our binary value of every scene point Exist(si ) is utilized. The
algorithm, a binary value Exist(si ) = 1 is initialized for every score of every pose Pk before reverification is denoted as
scene point si . For every pose Pk , a scene point set SS(Pk ) Score Old(Pk ). We firstly search the scene points belonging
is initialized to store the corresponding scene point of every to P0 and change the binary value of them to 0. For the
fitted model point. Score(Pk ) is decided by Equation 4: pose Pk to reverify, we check the binary values of the stored
corresponding scene points. The new score of Pk , denoted
as Score N ew(Pk ), is the number of corresponding scene
points whose binary value are still 1, as Equation 4. Using
this method, we do not need to transform the model points
and search scene points again. Instead, we just check some
binary values and this improves the speed of reverification.
We call it refreshing the poses.
However, it still takes some time to refresh all the poses
again to select a new pose. Therefore, every time a new
pose is selected, the other poses are ranked based on their
scores. Then from the top P1 to bottom, the poses are
refreshed. After Pi is refreshed, among the i poses P1 , P2 ,
(a) (b) ... Pi , we suppose the pose with highest new score is Pj
Fig. 2. An example of searching corresponding scene points[6]. (a) The scene (i ≥ j ≥ 1). If Score N ew(Pj ) ≥ Score Old(Pi+1 ), it
space is divided into cubic voxels. The values of white voxels are -1 which means the residual poses can not possess a higher new score
means there are no scene points near the voxels. The values of yellow voxels than Score N ew(Pj ) because the new score of a pose can
are p and those of green voxels are q. (b) Transformed model points mi and
mj are located in different voxels. V oxel(mj ) = −1, which means mj not be larger than the old score. In this case, Pj is selected
does not have a corresponding scene point. V oxel(mi ) = p, therefore the as the next result pose. The selection process is presented in
corresponding scene point of mi is sp . Algorithm 1.

1606
Algorithm 1: Multiple selection
Data: all pose candidats P , first result pose P0
Result: new result poses Pnew
rank P based on their old scores; (a) (b) (c)
for k = 1 to Nm do
mtk = P0 mk ;
if V oxel(mtk )! = −1 then
Exist(V oxel(mtk )) = 0;
end (d) (e) (f)
end
Fig. 3. Models used in the experiment. (a) Gear; (b) L1 part; (c) Magnet; (d)
M ax N ew Score = 0; L2 part; (e) Switch; (f) Bulge.
for i = 1 to size(P ) do
Psize(SS(Pi ))
Score N ew(Pi ) = n=1 Exist(SS(Pi )n );
if Score N ew(Pi ) > M ax N ew Score then A. Synthetic Scenes
M ax N ew Score = Score N ew(Pi ); Synthetic scenes were generated with multiple same object
Best P ose = Pi ; in every scene using the simulator in [13]. 50 synthetic scenes
end were generated for every model, and there were 20 objects
if M ax N ew Score ≥ Score Old(Pi+1 ) then in every scene. In the experiments, we tried to detect all the
Pnew = Best P ose; objects in every scene. The result recognition rate and speed
break; of the algorithms are presented in Table II and Table III. It
end is seen that our method outperformed Curve Set Feature and
end Drost PPF in both recognition rate and speed. Some detection
results are shown in Figure 4.

The selected result poses are refined by ICP. Instead of using

the whole model cloud as source cloud, we compute the visible
points based on camera view point and the poses. This will
improve the speed of the algorithm because fewer points are
computed and, at the same time, improve the precision.

IV. E XPERIMENT
We evaluated our algorithm against synthetic and real scenes
of six industrial objects, as shown in Figure 3. We compared
our algorithm, denoted as Proposed with Curve Set Feature (a)
[6], denoted as CSF and Drost PPF[1]. For Drost PPF, two
sets of parameters were used, denoted as PPF1 and PPF5,
respectively. The parameters are presented in Table I.
We used the structured-light projector-camera system as our
3D sensor. The resulting poses of all four algorithms were
refined by ICP. All given timings contain the whole process
including the normal estimation, matching and ICP refinement.
The algorithms were implemented in C++ and run on an Intel
Core i7-7820HQ CPU with 2.90 GHz and 32 GB RAM. Every
result pose was considered to be correct if the error was less
(b)
than the specified threshold. In our experiment, the threshold
was set to diam(M )/10 for the translation and 10◦ for the Fig. 4. Our detection results of synthetic scenes of (a) Gear (b) L1 part. The
rotation. Repetitive poses were regarded as wrong poses. gray part is the scene cloud with the triangle mesh, and The green points
show the contour of detection results. All of the resulting poses shown are
correct. Our algorithm is able to detect multiple objects accurately for bin
TABLE I picking scenes.
PARAMETERS FOR DIFFERENT ALGORITHMS

Parameters Proposed PPF1 PPF5

Experiments on how multiple selection time varies against
subsampling rate τd 0.04 0.04 0.04
detection number were conducted on our method and Curve
reference points 10% 100% 20% Set Feature and the result is presented in Figure 5. It is
referred points 20% 100% 100% seen that the detection time of Curve Set Feature increased
quadratically while that of our algorithm was always less than

1607
TABLE II
R ECOGNITION RATE OF THE ALGORITHMS ON SYNTHETIC SCENES TABLE IV
R ECOGNITION RATE OF THE ALGORITHMS ON REAL SCENES
Models Proposed CSF PPF1 PPF5
Models(object number) Proposed CSF PPF1 PPF5
Gear 82.2% 85.1% 47.5% 38.9%
L1 98.3% 83.9% 94.5% 94.2% Gear(250) 94.8% 89.2% 92.0% 85.6%
L2 78.5% 68.0% 63.2% 60.2% L1(375) 79.7% 65.9% 29.1% 28.5%
Magnet 85.7% 74.6% 57.0% 51.9% L2(355) 80.6% 47.9% 59.2% 59.4%
Switch 84.3% 87.7% 61.5% 60.0% Magnet(250) 96.0% 78.4% 95.2% 94.4%
Bulge 75.5% 70.6% 41.2% 41.2% Switch(333) 83.8% 63.1% 70.9% 69.1%
Bulge(361) 74.5% 60.7% 41.8% 41.8%
Average 84.1% 78.3% 60.8% 57.7%
Average 84.9% 67.5% 64.7% 63.1%

TABLE III
S PEED OF THE ALGORITHMS ON SYNTHETIC SCENES ( MS / OBJ )
TABLE V
Models Proposed CSF PPF1 PPF5 S PEED OF THE ALGORITHMS ON REAL SCENES ( MS / OBJ )
Gear 47 191 1266 221
Models Proposed CSF PPF1 PPF5
L1 200 184 4727 873
L2 86 142 1751 328 Gear 70 234 1818 376
Magnet 83 249 2887 532 L1 211 195 4271 851
Switch 169 403 3461 655 L2 230 164 3858 788
Bulge 65 174 1144 245 Magnet 73 334 1834 393
Switch 168 169 3756 762
Average 108 224 3047 476
Bulge 133 157 1834 430
Relative time 1.00 2.07 28.21 4.41
Average 156 201 3013 624
Relative time 1.00 1.29 19.31 4.00

1 ms.

B. Real Scenes
Then we tested the algorithms against real scene data from
our 3D sensor. For every model, 25 scenes were taken and
there were 10 - 15 objects in every scene. The ground truth
poses of all the objects were made manually. The performance
of the algorithms were presented in Table IV and Table V,
respectively, and some results of our algorithm are presented
in Figure 6. The object number of every model is presented in (a)
Table IV. The recognition rate in respect to detection number
in every scene is presented in Figure 7. Similar as the synthetic
scenes, our algorithm performed best recognition rate and
speed among the algorithms.
We are interested to determine the robustness of our algo-
rithm against occlusion. Following the occlusion definition of

(b)

(c)
Fig. 6. Our detection results of real scenes of (a) Gear, (b) L2 part and (c)
Magnet. The green points show the contour of detection results. We projected
the detection results in point cloud to the 2D image based on the intrinsic
Fig. 5. Multiple selection time against detect number in every scene of our matrix of the sensor to make the results more recognizable. The 2D image
algorithm and CSF. The time of CSF increases fast with the detect number, information was not used in the algorithm.
but that of our algorithm is always less than 1 ms.

1608
R EFERENCES
[1] Drost, B.; Ulrich M.; Navab N.; Ilic S. Model globally, match locally:
Efficient and robust 3d object recognition. In Proceedings of the
2010 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), San Francisco, CA, USA, 13-18 June 2010; pp. 998-1005.
[2] Aldoma, A.; Tombari, F.; Rusu3, R.B.; Vincze, M. OUR-CVFH Ori-
ented, Unique and Repeatable Clustered Viewpoint Feature Histogram
for Object Recognition and 6DOF Pose Estimation. Pattern Recogni-
tion. 2012, pp. 113-122.
[3] Aldoma, A.; Vincze, M.; Blodow, N.; David, G.; Suat, G.; Rusu,
R.B.; Bradski, G.; Garage, W. CAD-model recognition and 6DOF pose
estimation using 3D cues. In Proceedings of the 2011 IEEE Interna-
tional Conference on Computer Vision Workshops (ICCV Workshops),
Barcelona, Spain, 6-13 November 2011; pp. 585-592.
[4] Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.;
Navab, N.; Lepetit, V. Multimodal Templates for Real-Time Detection
of Texture-less Objects in Heavily Cluttered Scenes. In Proceedings of
Fig. 7. Recognition rate against detect number in every scene of the the 2011 IEEE International Conference on Computer Vision (ICCV),
algorithms. Barcelona, Spain, 6–13 November 2011; pp. 858–865.
[5] Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige,
K.; Navab, N. Model based training, detection and pose estimation of
texture-less 3d objects in heavily cluttered scenes. In Proceedings of
[14]: the Asian Conference on Computer Vision (ACCV 2012), Daejeon,
Korea, 5–9 November 2012; pp. 548–562.
model surf ace area in the scene [6] Li, M.; Hashimoto, K. Curve Set Feature-Based Robust and Fast Pose
Occlusion = 1 − (5) Estimation Algorithm. Sensors 2017, 17, 1782.
total model surf ace area
[7] Wu, C.H.; Jiang, S.Y.; Song, K.T. CAD-based pose estimation for
we computed how recognition rate varied with occlusion and random bin-picking of multiple objects using a RGB-D camera. In
Proceedings of the 2015 15th International Conference on Control,
the result is shown in Figure 8. Clearly, our algorithm is more Automation and Systems (ICCAS), Busan, South Korea, 13-16 October
robust to occlusion than the other two algorithms. 2015; pp. 1645-1649.
[8] Birdal T.; Ilic S. Point pair features based object detection and
pose estimation revisited. In Proceedings of the 2015 International
Conference on 3D Vision (3DV), Lyon, France, 19-22 October 2015;
pp. 527-535.
[9] Hinterstoisser, S.; Lepetit, V.; Rajkumar, N.; Konolige, K.; Going
further with point pair features. In Proceedings of the European Confer-
ence on Computer Vision (ECCV 2016), Amsterdam, The Netherlands,
8-16 Octorber 2016; pp. 834-848.
[10] Choi, C.; Taguchi, Y.; Tuzel, O.; Liu, M.Y. Voting-based pose estima-
tion for robotic assembly using a 3d sensor. In Proceedings of the 2012
IEEE International Conference on Robotics and Automation (ICRA),
Saint Paul, MN, USA, 4-18 May 2012; pp. 1724-1731
[11] Muja, M.; Lowe, D.G.; Fast approximate nearest neighbors with auto-
matic algorithm configuration. In Proceedings of the International Con-
ference on Computer Vision Theory and Applications (VISAPP’09),
Lisboa, Portugal, 5-8 February 2009; pp. 331–340.
[12] Z, Zhang. Iterative point matching for registration of freeform curves.
International journal of computer vision 1994, 7(3):119152.
[13] Naoya, C.; Hashimoto, K. Development of Program for Generating
Pointcloud of Bin Scene Using Physical Simulation and Perspective
Fig. 8. Recognition rate against occlusion for real scenes of the algorithms. Camera Model. In Proceedings of The Robotics and Mechatronics
Our algorithm was proved to be more robust to occlusion. The result was the Conference 2017(ROBOMECH 2017), Fukushima, Japan, 10-12 May,
average of different objects, therefore, the recognition rate did not show a 2017; 2A2-O09.
monotone decrease with the increase of the occlusion. [14] Johnson, A.E.; Hebert, M. Using spin images for efficient object
recognition in cluttered 3 d scenes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 21(5):433449, 1999.
V. C ONCLUSION
This paper proposes a 6D pose estimation algorithm for bin
picking tasks. We have shown that by estimating the normals
precisely and combining the PPF algorithm with the improved
pose verification method, our algorithm is robust to occlustion
and is able to estimate the poses multiple objects fast and
accurately.