PhoCaL - A Multi-Modal Dataset For Category-Level Object Pose Estimation With Photometrically Challenging Objects - Toyota
PhoCaL - A Multi-Modal Dataset For Category-Level Object Pose Estimation With Photometrically Challenging Objects - Toyota
Pengyuan Wang∗1 , HyunJun Jung∗1 , Yitong Li1 , Siyuan Shen1 , Rahul Parthasarathy Srikanth1 ,
Lorenzo Garattoni2 , Sven Meier2 , Nassir Navab1 , Benjamin Busam1
∗ 1 2
Equal Contribution Technical University of Munich Toyota Motor Europe
[email protected] [email protected] [email protected]
Figure 1. PhoCaL comprises 60 high quality 3D models of household object in 8 categories with different photometric complexity.
The selected objects include challenging texture-less, occluded, symmetric, reflective and transparent objects. Our robotic-induced pose
annotation pipeline provides highly accurate 6D pose labels even for objects that are hard to capture by modern RGBD sensors. The figure
shows RGB, 3D bounding boxes and rendered Normalized Object Coordinate Space (NOCS) map for 4 example scenes.
Abstract 1. Introduction
Object pose estimation is crucial for robotic applica- Vision systems interacting with their environment need
tions and augmented reality. Beyond instance level 6D to estimate the position and orientation of objects in space,
object pose estimation methods, estimating category-level which highlights why 6D object pose estimation is an im-
pose and shape has become a promising trend. As such, a portant task for robotic applications. Even though there
new research field needs to be supported by well-designed have been great advances in the field [6, 42], instance-level
datasets. To provide a benchmark with high-quality ground 6D pose methods require pre-scanned object models and
truth annotations to the community, we introduce a mul- support limited number of objects. Category-level object
timodal dataset for category-level object pose estimation pose estimation [40] scales better to the needs of real oper-
with photometrically challenging objects termed PhoCaL. ating environments. However, photometrically challenging
PhoCaL comprises 60 high quality 3D models of household objects such as shiny, e.g. metallic, and transparent, e.g.
objects over 8 categories including highly reflective, trans- glass, objects are very common in our daily life and little
parent and symmetric objects. We developed a novel robot- work has been done to estimate their 6D poses within prac-
supported multi-modal (RGB, depth, polarisation) data ac- tical accuracy on a category-level. The difficulty arises from
quisition and annotation process. It ensures sub-millimeter two aspects: first, it is difficult to annotate 6D pose ground
accuracy of the pose for opaque textured, shiny and trans- truth for photometrically challenging objects since no tex-
parent objects, no motion blur and perfect camera synchro- ture can be used to determine key points; second, commonly
nisation. used depth sensors fail to return the correct depth infor-
To set a benchmark for our dataset, state-of-the-art mation, as structured light and stereo method often fail to
RGB-D and monocular RGB methods are evaluated on the correctly interpret reflection and refraction artefacts. As a
challenging scenes of PhoCaL. consequence, RGB-D methods [25, 40] do not work reli-
ably with photometrically challenging objects. We intro-
21222
Figure 2. Our dataset comprises 60 household objects among 8 object categories. The training and test split is depicted here.
duce PhoCaL, a class-level dataset of photometrically chal- setup, we designed and 3D printed a rig that holds multi-
lenging objects with high-quality ground-truth annotations. ple cameras, each mounted on it and carefully calibrated.
The dataset provides multi-modal data such as RGB, depth During recording, a pre-defined trajectory is repeated by
and polarization which enables investigation into object’s the robot arm. The robot arm stops when capturing images
surface reflectance properties. from all cameras, which avoids motion blur and diminished
We obtain highly accurate ground truth poses with a effects from imperfect synchronization.
novel method using a collaborative robot arm in gravity In summary, our main contributions are:
compensated mode and a calibrated mechanical tip. In or-
der to annotate the 6D pose of transparent and non-textured 1. We propose PhoCaL, a multi-modal (RGBD +
objects, a specially designed tip is mounted on the robot RGBP) dataset for category-level object pose esti-
arm. With the calibrated tip, the positions of pre-defined mation. The dataset comprises 60 high-quality 3D
points on the object surface are acquired on the real ob- models of household objects including symmetric,
ject and matched to a scan thereof. Using this method, the transparent and reflective objects in 8 categories with
object pose can be determined with an order of magnitude 24 sequences featuring occlusion, partial visibility and
more accuracy than previous methods. For transparent and clutter.
textureless objects, topographic key points are used instead
2. We introduce a new and highly accurate pose anno-
of textural ones. The points gathered in this way are then
tation method using a robotic manipulator that al-
matched to the object model in a final ICP [2] step to yield
lows for sub-millimeter precision 6D pose annotations
an accurate fit.
of photometrically challenging objects even with re-
The camera to robot end-effector transformation is flective or transparent surfaces.
needed to obtain the object poses in camera coordinates.
Typically, hand-eye calibration approaches solve this by vi-
sually estimating the marker position and optimizing for 2. Related Work & Current Challenges
the transformation between camera and end-effector. To
minimize the error propagation and obtain highly accurate Standardized datasets are used in the field of object pose
ground truth labels, we instead used the end-effector tip of and shape estimation to quantify and compare contributions
the arm in gravity-compensated mode to measure the posi- and advances in the field. These datasets generally fall in
tion of 12 points on a ChArUco [1] board. This allows us two domains: instance-level datasets, where the 3D model
to use the robot’s accurate position system to obtain both of the object is known a priori; and category-level datasets,
object poses and camera poses for image sequences. where the exact CAD model is unknown. Tab. 1 provides
Beyond photometrically challenging categories and an overview of related datasets in both domains.
high-quality annotations, multi-modal input is another high-
2.1. Instance-level 6D Object Pose Dataset
light of PhoCaL. As the active depth sensors fail on metallic
and transparent surfaces, we include an additional passive One of the earliest, most widely used publicly available
sensor modality in the form of a polarization camera. It pro- datasets for instance level pose estimation is LineMOD [19]
vides valuable information on object surfaces [22]. In our and its occlusion extension LM-Occlusion [5]. Their data
21223
Polarisation
Robotic GT
Multi-View
Transparent
Categories
Occlusion
Sequences
Reflective
Symmetry
License
Objects
Depth
RGB
Real
Dataset
Table 1. Overview of pose estimation datasets. The upper part shows instance-level datasets while the lower part includes category-level
setups. PhoCaL is the only dataset that includes both photometrically challenging objects with high quality (robotic) pose annotations and
all three modalities, RGB, depth, and polarisation.
was acquired using a PrimeSense RGB-D Carmine sen- between synthetic data for evaluations and real-world ap-
sor and a marker board was used to keep track of the pearances faced in the final applications.
relative sensor pose. While undoubtedly pioneering this
field, the 3D model quality is now outdated and the leader 2.2. Category-level Object Poses and Datasets
boards on these datasets have become saturated. Home-
In real-world applications, a 3D model is not always
brewedDB [23] accounts for the latter shortcoming by pro-
available, but pose information is still required. Detection
viding high quality 3D models scanned with a structured
of such objects under these conditions has classically been
light sensor. Including three models from LineMOD, they
tackled using 3D geometric primitives [3, 4, 9].
add 30 more toy, household and industrial objects. Differ-
While these methods consider outdoor scenes for which
ent illumination conditions and occlusions make the scenes
kitti [18] provides 3D bounding box annotations, they lack
more challenging. Other datasets also include household
object shape comparison and the information is often too
objects [13, 21, 34, 37] or focus on industrial parts [14, 20]
inaccurate for robotic grasping tasks. The pioneering work
with low texture for which it is also possible to manually
of NOCS [40] was the first category-level method that
design or retrieve accurate CAD models [20]. The BOP 6D
could detect object pose and shape in indoor environments.
pose benchmark [21] includes a summary of these datasets
Further investigations consider correspondence-free meth-
with standardized metrics in a common format.
ods [10] where a deep generative model learns a canoni-
While the datasets mentioned so far provide individual cal shape space from RGBD and a method to estimate pose
frames, the YCB-Video dataset [41] also includes video and shape for fully unseen objects is also proposed [32], al-
sequences of 21 household objects. While YCB uses La- beit this method requires a reference image for latent code
belFusion [31] for semi-manual frame annotation and pose generation. CPS [28] demonstrates how to estimate pose
propagation through the sequence, Garon et al. [16] lever- and metric shapes at category level, using only a monocular
age tiny markers on the object to estimate the poses in their view. The extension CPS++ [29] further utilizes synthetic
videos directly at the cost of synthetic data cleaning after- data and a domain transfer approach using self-supervised
wards. The advent of photo realistic rendering further en- refinement with a differentiable renderer from RGBD data
ables a branch of works that leverages training on purely without annotations. SGPA [11] explores shape priors to es-
synthetic data [12, 38]. Although this circumvents the cum- timate the object pose. DualPoseNet [25] leverages spheri-
bersome pose labelling process, it introduces a domain gap cal fusion for better encoding of the object information.
21224
2
3
GT Corrected 3D
4
Figure 4. Annotation quality for poses in datasets Linemod [19]
RGB D 1 (projected green silhouette, left) and YCB [8] (rendered overlay,
Figure 3. Limitations of RGBD sensors. The depth for photomet- right) together with its correction [7] (right).
rically challenging objects is difficult to measure with a commod-
ity depth sensor. The intel RealSense D515 LiDAR ToF sensor
used here is affected by reflections that lead to invalid (1) incorrect 2.4. Ground Truth Pose Annotation
(2) distance estimates. Moreover, the glassware becomes invisible Manual annotation of 6D pose is difficult and extremely
to the sensor (3) and causes noise (4). time-consuming. Therefore, most datasets rely on semi-
manual processes for ground truth annotation. The data
from a depth sensor, if available, is often used to register
We leverage the standard RGBD method NOCS and the
the 3D model and manual adjustments are applied to visu-
strong state-of-the-art RGB method CPS to set the baselines
ally refine the pose for this one frame. Relative camera mo-
on our new dataset. While task-specific datasets for gen-
tion is typically calculated using visual markers [19, 23] to
eral object detection exist for robot grasping [15, 30], meth-
propagate the pose information through a sequence of im-
ods for category-level pose estimation are typically tested
ages. The use of depth sensors for ICP-based alignment of
on NOCS [40] data. The NOCS objects comprise various
pose labels reduces labour and improves fully-manual an-
categories, but do not contain photometric challenges often
notation quality. However, depth maps from RGBD sensors
present in everyday objects such as reflectance and trans-
are erroneous or invalid for photometrically challenging ob-
parency.
jects with high reflectance and translucent or transparent
2.3. Photometric Challenges and Multimodalities surfaces [26]. An examples is shown in Fig. 3.
Ensuring high quality of pose labels over a series of im-
While texture-less objects [20] were initially challeng- ages is difficult and errors accumulate as the examples in
ing for pose estimation, transparency presents an even big- Fig. 4 show. This equally affects depth-based refinement
ger hurdle. While the problem is not new, previous meth- strategies of 6D pose pipelines [21, 24]. We propose a me-
ods have addressed this using RGB stereo without a 3D chanical measurement process using a robotic manipulator
model to identify grasping points only [36]. Rotational ob- to circumvent this issue and allow for high precision labels
ject symmetry can be leveraged by contour fitting for trans- that omits the error propagation of relative camera pose re-
parent object reconstruction [33] using template matching. trieval from images.
ClearGrasp [35] proposes a method for geometry estima-
tion of transparent objects based on RGBD. However, this 3. Dataset Acquisition Pipeline
method passes over the transparent regions from the depth
map and predicts depth from RGB in these areas instead. Our dataset features multiple object classes including
Liu et al. [27] investigate instance- and category-level pose photometrically challenging classes such as objects with re-
estimation from stereo imagery. Since their depth sensing flective surfaces or transparent material. It also provides
fails on transparent objects, they use an opaque object twin multi-modal sensor data with highly accurate 6D pose an-
as proxy to establish ground truth depth. More recently notation. This section describes our dataset acquisition
StereOBJ-1M proposed [26] a large dataset including trans- pipeline as shown in Fig. 5.
parent and translucent objects with specular reflections and
3.1. Objects Model Acquisition
symmetry. However, at the time of this writing it is not yet
available for download. To represent a cross section of common household
For 2D object detection, information from multiple or- objects, we selected eight common categories for our
thogonal sensor modalities such as polarisation (RGBP) can category-level 6D object pose dataset: bottle, box, can, cup,
help for transparent object segmentation [22]. This modal- remote, teapot, cutlery, glassware. All object models are
ity can provide information in regions were depth sensors scanned using an EinScan-SP 3D Scanner (SHINING 3D
fail. Their inherent connection with surface normals [43] Tech. Co., Ltd., Hangzhou, China). The scanner is a struc-
can also make them attractive for pose estimation of photo- tured light stereo system with a single shot accuracy of ≤
metrically challenging objects. 0.05 mm in a scanning volume of 1200×1200×1200 mm3 .
21225
Figure 5. Overview of dataset acquisition pipeline. (a): 3D models are extracted with a structured light scanner. (b): Pivot calibration
calibrates a tipping tool to robot coordinates. (c): 6D poses are annotated using the tool and manual movements of the robot. (d): The
camera trajectory is saved. (e): Dataset is recorded automatically following the planned trajectory.
Figure 6. Overview of hand-eye-calibration and its evaluation. (a): shows the marker-to-robot calibration. (b): illustrates camera-to-robot
hand-eye calibration. (c) depicts our accuracy evaluation.
The models from the first six categories are provided 3.3. Tip Calibration
as textured obj files. Since the cutlery and glassware ob-
We use a rigid, pointy metallic tip to obtain the coordi-
jects are photometrically challenging with their highly re-
nate position of selected points on the object. Tip calibra-
flective and transparent surfaces, we apply a self-vanishing
tion is therefore essential to ensure the accuracy of the sys-
3D scanning spray (AESUB Blue, Aesub, Recklinghausen,
tem. The rig attached to the robot’s end-effector consists of
Germany) to make the objects temporarily opaque for scan-
custom 3D printed mount which holds the tool-tip rigidly.
ning. We scan the object and provide an obj file without
The pivot calibration is performed as shown in Fig. 8 (left),
texture. The spray sublimes after approx. 4 h.
where the tip point is placed in a fixed position, while only
the robot end-effector position is changed. We collect data
from N such tip positions with corresponding end-effector
3.2. Scene Acquistion Setup poses, i Teb , which contain rotation i Reb and translation i tbe ,
the final translation tet of the end-effector is calculated as
For each scene, 5-8 objects are placed on the table with follows:
the random background. We use a KUKA LBR iiwa 7
† b
R800 (KUKA Roboter GmbH, Augsburg, Germany) 7 DoF b
− 2 Reb b
1 Re 1 te − 2 t e
robotic arm that guarantees a positional reproducibility of 2 Reb − 3 Reb b b
2 te − 3 t e
±0.1 mm. The vision system comprises a Phoenix 5.0 tet = · (1)
.. ..
MP Polarization camera (IMX264MZR/MYR) with Sony
. .
b
IMX264MYR CMOS (Color) Polarsens (i.e. PHX050S1- n Re − 1 Reb b
n te − 1 tbe
QC) (LUCID Vision Labs, Inc., Richmond B.C., Canada)
where † denotes the pseudo-inverse. We evaluate the tip
with a Universe Compact lens with C-Mount 5MP 2/3º
calibration by calculating the variance of each tip location
6mm f/2.0 (Universe, New York, USA). As depth camera,
at the pivot point. The variance of the tip location in our
the Time-of-Flight (ToF) sensor Intel® RealSense™LiDAR
setup is ε = 0.057 mm.
L515 is used, which acquires depth images at a resolution
of 1024x768 pixels in an operating range between 25 cm
3.4. 6D Pose Annotation
and 9 m with a field-of-view of 70°x 55°and an accuracy
of 5 ± 2.5 mm at 1 m distance up to 14 ± 15.5 mm at 9 m Annotating the precise 6D pose of the objects is a chal-
distance. lenging task as mentioned in Sec. 2.4. Here, we utilize
21226
Figure 7. Example of annotation quality before and after ICP based refinement on the textureless object. (a) Initial pose of mesh overlaid
with measured surface points (red dots) shows error in initial pose (red arrow). (b) After the ICP, refined pose matches with the surface
points properly (blue arrow). (c) Shows improvement in 6D pose annotation. Rendering of the mesh with initial pose (d) and refined pose
(e) shows a significant difference in quality.
21227
the dataset. With the predicted normalized object shape
from NOCS map, the depth information is used to lift 2D
detection to 3D space using ICP. Considering the artifacts
in the depth data from metallic and transparent objects in
the dataset, along with the occlusion, the test sequences are
very challenging for RGBD methods.
Similiar to NOCS, CPS first detects 2D bounding boxes.
Then lifting modules for each class transform 2D image fea-
tures to 6D pose and scales. Simultaneously the method
Figure 9. Measuring the marker points for the calibration on the also estimates the point cloud shape for the respective ob-
scene (left) and detected marker from one of the cameras (right) ject class. CPS is trained on approximately 1000 object in-
stance models for each category to learn a deep point cloud
encoding of each class. The 2D detection and lifting mod-
jectory of all joints of the robot is recorded by manually ules are trained together for 100k steps with a learning rate
moving the end-effector while the robot arm is in gravity of 1e-4, decaying to 1e-5 at 60k steps.
compensated mode. Thereafter, we record the images of
the scene by replaying the joint trajectory while stopping 4.1. Evaluation Pipeline
the robot every 5-7 joint positions to capture the images and
the robot pose (approx 10-15 fps). This ensures no mo- Our dataset consists of 24 image sequences in total with
tion blur and camera synchronization artefacts are recorded training and testing split in each sequence. In our evalua-
while reproducing the original hand-held camera trajectory. tion pipeline, the training split of the first 12 sequences are
used to train the network. To have an evaluation on both the
3.7. Evaluation of Overall Annotation Quality known and novel objects in each category, two experiments
are designed. To evaluate on seen objects firstly, the net-
We evaluate overall annotation quality of our dataset by
work is trained on the training split of the first 12 sequences
running simulated data acquisition with two measured er-
and tested on the testing split of the same sequences. To fur-
ror statistics : ICP error (Sec 3.4) and hand-eye calibration
ther evaluate the generalization ability of NOCS and CPS to
error (Sec 3.5). For both RGBD and Polarization camera,
novel objects in the same category, the same training split
setup from one of the scenes is used including the objects
of the first 12 sequences is used, but we evaluate the result
and the trajectories. The acquisition is simulated twice, with
on the testing split of the latter 12 sequences, where objects
and without the aforementioned error. In the end, RMSE er-
are mostly unseen. With this way, generalization ability of
ror is calculated pointwise in mm between the acquisitions.
the methods to novel objects in the category is emphasized,
We averaged the error per object and per each frame in the
which is a common issue in real operating environments.
trajectories.
The evaluation metric is the intersection over union (IoU)
RMSE error for RGBD camera is 0.84 mm and for po-
result with a threshold of 25% and 50%.
larization camera is 0.76 mm. Detailed description of this
procedure is attached in the supplementary material. The 4.2. Evaluation Result
annotation quality in comparison with other dataset acqui-
sition principles is shown in Tab. 2. The 3D IoU at 25% and 50% evaluations of NOCS for
the first experiment setup is shown in Tab. 3. The mean
Dataset RGBD Dataset TOD [27] StereOBJ [26] Ours average precision (mAP) for 3D IoU at 25% is 43.34%. It
3D Labeling Depth Map Multi-View Robot is observed in the experiment that even if the segmentation
Point RMSE ≥ 17mm 3.4mm 2.3mm 0.80mm
and normalized object coordinate map predictions are ac-
curate, the lifting from NOCS map to 6D space is sensitive
Table 2. Comparison of pose annotation quality for different
to artifacts in depth maps. Since the objects are highly oc-
dataset setups. The error for RGBD is exemplified with the stan-
dard deviation of the Microsoft Azure Kinect [26].
cluded in the PhoCaL dataset, and depth measurements are
inaccurate because of cutlery and glassware categories, the
method does not have a good performance on the dataset
which indicates the drawbacks of RGBD methods in these
4. Benchmarks and Experiments
photometrically challenging cases. The average precision
Both monocular (CPS) and RGB-D based (NOCS) of each category with respect to 3D IoU threshold is plotted
category-level methods are considered for the baseline eval- in Fig. 10a. Note that the results of cutlery and glassware
uation on the PhoCaL dataset. For the evaluation of NOCS, categories are among the worst three categories.
the normal object coordinate space maps are rendered for For comparison, the result of CPS is also listed in Tab.
each training image and will be published together with 3. As can be seen from the table, CPS has a higher preci-
21228
3D25 / 3D50 Bottle Box Can Cup Remote Teapot Cutlery Glassware Mean
NOCS [40] 91.17 / 0.65 16.10 / 0.01 85.44 / 23.01 51.83 / 1.48 93.26 / 86.05 0.00 / 0.00 4.89 / 0.01 4.00 / 0.06 43.34 / 13.91
CPS [28] 80.08 / 40.30 31.68 / 28.18 68.96 / 6.69 81.60 / 70.24 86.30 / 37.08 67.43 / 4.31 44.00 / 24.95 30.33 / 17.74 61.30 / 28.69
Table 3. Class-wise evaluation of 3D IoU for NOCS [40] and CPS [28] on test split of known objects.
3D25 / 3D50 Bottle Box Can Cup Remote Teapot Cutlery Glassware Mean
Experiment 1 91.17 / 0.65 16.10 / 0.01 85.44 / 23.01 51.83 / 1.48 93.26 / 86.05 0.00 / 0.00 4.89 / 0.01 4.00 / 0.06 43.3 / 13.91
Experiment 2 13.70 / 1.28 27.74 / 0.00 48.17 / 0.00 61.77 / 0.00 8.35 / 0.00 4.90 / 0.00 16.10 / 0.00 0.83 / 0.00 22.70 / 0.17
Table 4. Class-wise evaluation of 3D IoU for NOCS [40] on seen (Experiment 1) and mostly unseen (Experiment 2) objects.
(a) NOCS result in the first experiment (b) CPS result in the first experiment (c) NOCS result in the second experiment
Figure 10. Plots of average precision (AP) with respect to 3D IoU thresholds for each category.
sion for cutlery and glassware categories. Monocular meth- gles in the image sequences which is an issue the PhoCaL
ods are not affected by artifacts in depth images, which ex- shares with other robotic acquisition setups. The hand eye
plains the result from the experiment. CPS has a higher calibration of the camera plays a key role for the annotation
mAP of 61.30%, which means RGB has an advantage in quality. If the camera resolution is low, a good calibration
dealing with photometrically challenging objects. The de- result requires significantly more input images from differ-
tailed APs for each category are plotted in Fig. 10b. ent angles.
In addition, the NOCS evaluation on both experiments
are compared in table 4. The evaluation result for the sec-
ond experiment has a lower mAP for 3D IoU at 25% and 5. Conclusion
50% as expected, as most of the test objects are novel in
the second experiment. Fig. 10c plots NOCS APs in the
In this paper we introduce the PhoCaL dataset, which
second experiment. In comparison to NOCS, the CPS re-
contains photometrically challenging categories. High-
sult drops significantly in the second experiment and the 3D
quality 6D pose annotations are provided for all categories
IoU at 25% is 4.3%. The result shows that pretraining with
and multiple camera modalities, namely RGBD and RGBP.
a large amount of synthetics images is necessary for monoc-
With our manipulator-driven annotation pipeline, we reach
ular methods, to learn the correct lifting from 2D detection
pose accuracy levels that are one order of magnitude more
to 3D space without the help of depth images.
precise than previous vision-sensor-only pipelines even for
photometrically complex objects. Moreover, baselines are
4.3. Limitations
provided for future works on category-level 6D pose on our
Even though the proposed pipeline for annotating the 6D dataset by evaluating both monocular and RGB-D methods.
pose ground truth is accurate, annotating the objects with The evaluation shows the difficulty level of the dataset in
deformable surface, such as empty boxes, poses a challenge particular for objects that include reflective and transpar-
during the surface measurement step in the workflow due ent surfaces. PhoCaL therefore constitutes a challenging
to its light deformation which could deteriorate the quality dataset with accurate ground truth that can pave the way for
of both initial pose and ICP based refinement. Moreover, future pose pipelines that are applicable to more realistic
the limited workspace of the robot constrains the view an- scenarios with everyday objects.
21229
References IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3583±3592, 2016. 3
[1] Gwon Hwan An, Siyeong Lee, Min-Woo Seo, Kugjin Yun, [14] Bertram Drost, Markus Ulrich, Paul Bergmann, Philipp
Won-Sik Cheong, and Suk-Ju Kang. Charuco board-based Hartinger, and Carsten Steger. Introducing mvtec itodd - a
omnidirectional camera calibration method. Electronics, dataset for 3d object recognition in industry. In Proceedings
7(12):421, 2018. 2 of the IEEE International Conference on Computer Vision
[2] Paul J Besl and Neil D McKay. Method for registration of Workshops, Oct 2017. 3
3-d shapes. In Sensor fusion IV: control paradigms and data [15] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu.
structures, volume 1611, pages 586±606. International Soci- Graspnet-1billion: A large-scale benchmark for general ob-
ety for Optics and Photonics, 1992. 2 ject grasping. In Proceedings of the IEEE Conference on
[3] Tolga Birdal, Benjamin Busam, Nassir Navab, Slobodan Ilic, Computer Vision and Pattern Recognition, pages 11444±
and Peter Sturm. A minimalist approach to type-agnostic de- 11453, 2020. 3, 4
tection of quadrics in point clouds. In Proceedings of the [16] Mathieu Garon, Denis Laurendeau, and Jean-FrancËois
IEEE Conference on Computer Vision and Pattern Recogni- Lalonde. A framework for evaluating 6-dof object trackers.
tion, pages 3530±3540, 2018. 3 In Proceedings of the European Conference on Computer Vi-
[4] Tolga Birdal, Benjamin Busam, Nassir Navab, Slobodan Ilic, sion, pages 582±597, 2018. 3
and Peter Sturm. Generic primitive detection in point clouds [17] Mathieu Garon, Denis Laurendeau, and Jean-FrancËois
using novel minimal quadric fits. IEEE transactions on pat- Lalonde. A framework for evaluating 6-DOF object trackers.
tern analysis and machine intelligence, 42(6):1333±1347, In Proceedings of the European Conference on Computer Vi-
2019. 3 sion, 2018. 6
[5] Eric Brachmann, Alexander Krull, Frank Michel, Stefan [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d ready for autonomous driving? the kitti vision benchmark
object pose estimation using 3d object coordinates. In Pro- suite. In Proceedings of the IEEE Conference on Computer
ceedings of the European Conference on Computer Vision, Vision and Pattern Recognition, pages 3354±3361. IEEE,
pages 536±551. Springer, 2014. 2, 3 2012. 3
[6] Yannick Bukschat and Marcus Vetter. Efficientpose: An [19] Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobo-
efficient, accurate and scalable end-to-end 6d multi object dan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit.
pose estimation approach. arXiv preprint arXiv:2011.04307, Multimodal templates for real-time detection of texture-less
2020. 1 objects in heavily cluttered scenes. In Proceedings of the
[7] Benjamin Busam, Hyun Jun Jung, and Nassir Navab. I like IEEE International Conference on Computer Vision, pages
to move it: 6d pose estimation as an action decision process. 858±865. IEEE, 2011. 2, 3, 4
arXiv preprint arXiv:2009.12678, 2020. 4 [20] TomÂaš Hodan, Pavel Haluza, ŠtepÂan ObdržÂalek, Jiri Matas,
Manolis Lourakis, and Xenophon Zabulis. T-less: An rgb-
[8] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srini-
d dataset for 6d pose estimation of texture-less objects. In
vasa, Pieter Abbeel, and Aaron M Dollar. Benchmarking
2017 IEEE Winter Conference on Applications of Computer
in manipulation research: The ycb object and model set and
Vision (WACV), pages 880±888. IEEE, 2017. 3, 4
benchmarking protocols. arXiv preprint arXiv:1502.03143,
[21] Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl,
2015. 3, 4
Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal,
[9] Peter Carr, Yaser Sheikh, and Iain Matthews. Monocular
Stephan Ihrke, Xenophon Zabulis, et al. Bop: Benchmark for
object detection using 3d geometric primitives. In Proceed-
6d object pose estimation. In Proceedings of the European
ings of the European Conference on Computer Vision, pages
Conference on Computer Vision, pages 19±34, 2018. 3, 4
864±878. Springer, 2012. 3
[22] Agastya Kalra, Vage Taamazyan, Supreeth Krishna
[10] Dengsheng Chen, Jun Li, Zheng Wang, and Kai Xu. Learn- Rao, Kartik Venkataraman, Ramesh Raskar, and Achuta
ing canonical shape space for category-level 6d object pose Kadambi. Deep polarization cues for transparent object
and size estimation. In Proceedings of the IEEE Conference segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 11973± on Computer Vision and Pattern Recognition, pages
11982, 2020. 3 8602±8611, 2020. 2, 4
[11] Kai Chen and Qi Dou. Sgpa: Structure-guided prior adap- [23] Roman Kaskman, Sergey Zakharov, Ivan Shugurov, and Slo-
tation for category-level 6d object pose estimation. In Pro- bodan Ilic. Homebreweddb: Rgb-d dataset for 6d pose esti-
ceedings of the IEEE International Conference on Computer mation of 3d objects. Proceedings of the IEEE International
Vision, pages 2773±2782, 2021. 3 Conference on Computer Vision Workshops, 2019. 3, 4
[12] Maximilian Denninger, Martin Sundermeyer, Dominik [24] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan
Winkelbauer, Youssef Zidan, Dmitry Olefir, Mohamad El- Ilic, and Nassir Navab. Ssd-6d: Making rgb-based 3d de-
badrawy, Ahsan Lodhi, and Harinandan Katam. Blender- tection and 6d pose estimation great again. In Proceedings
proc. arXiv preprint arXiv:1911.01911, 2019. 3 of the IEEE International Conference on Computer Vision,
[13] Andreas Doumanoglou, Rigas Kouskouridas, Sotiris Malas- pages 1521±1529, 2017. 4
siotis, and Tae-Kyun Kim. Recovering 6d object pose and [25] Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui
predicting next-best-view in the crowd. In Proceedings of the Jia, and Yuanqing Li. Dualposenet: Category-level 6d
21230
object pose and size estimation using dual pose network [38] Jonathan Tremblay, Thang To, and Stan Birchfield. Falling
with refined learning of pose consistency. arXiv preprint things: A synthetic dataset for 3d object detection and pose
arXiv:2103.06526, 2021. 1, 3 estimation. In Proceedings of the IEEE Conference on
[26] Xingyu Liu, Shun Iwase, and Kris M Kitani. Stereobj-1m: Computer Vision and Pattern Recognition Workshops, pages
Large-scale stereo image dataset for 6d object pose estima- 2038±2041, 2018. 3
tion. In Proceedings of the IEEE International Conference [39] Roger Y Tsai, Reimar K Lenz, et al. A new technique for
on Computer Vision, pages 10870±10879, 2021. 3, 4, 7 fully autonomous and efficient 3 d robotics hand/eye cal-
[27] Xingyu Liu, Rico Jonschkowski, Anelia Angelova, and Kurt ibration. IEEE Transactions on robotics and automation,
Konolige. Keypose: Multi-view 3d labeling and keypoint 5(3):345±358, 1989. 6
estimation for transparent objects. In Proceedings of the [40] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin,
IEEE Conference on Computer Vision and Pattern Recog- Shuran Song, and Leonidas J Guibas. Normalized object co-
nition, pages 11602±11610, 2020. 3, 4, 7 ordinate space for category-level 6d object pose and size esti-
[28] Fabian Manhardt, Manuel Nickel, Sven Meier, Luca Min- mation. In Proceedings of the IEEE Conference on Computer
ciullo, and Nassir Navab. Cps: Class-level 6d pose and Vision and Pattern Recognition, pages 2642±2651, 2019. 1,
shape estimation from monocular images. arXiv preprint 3, 4, 8
arXiv:2003.05848, 2020. 3, 8 [41] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and
[29] Fabian Manhardt, Gu Wang, Benjamin Busam, Manuel Dieter Fox. Posecnn: A convolutional neural network for 6d
Nickel, Sven Meier, Luca Minciullo, Xiangyang Ji, and object pose estimation in cluttered scenes. Robotics: Science
Nassir Navab. Cps++: Improving class-level 6d pose and Systems, 2018. 3
and shape estimation from monocular images with self- [42] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod:
supervised learning. arXiv preprint arXiv:2003.05848, 2020. 6d pose object detector and refiner. In Proceedings of the
3 IEEE International Conference on Computer Vision, pages
[30] Lucas Manuelli, Wei Gao, Peter Florence, and Russ Tedrake. 1941±1950, 2019. 1
kpam: Keypoint affordances for category-level robotic ma- [43] Shihao Zou, Xinxin Zuo, Yiming Qian, Sen Wang, Chi Xu,
nipulation. arXiv preprint arXiv:1903.06684, 2019. 3, 4 Minglun Gong, and Li Cheng. 3d human shape recon-
[31] Pat Marion, Peter R Florence, Lucas Manuelli, and Russ struction from a polarization image. In Computer Vision±
Tedrake. Label fusion: A pipeline for generating ground ECCV 2020: 16th European Conference, Glasgow, UK, Au-
truth labels for real rgbd data of cluttered scenes. In IEEE In- gust 23±28, 2020, Proceedings, Part XIV 16, pages 351±368.
ternational Conference on Robotics and Automation, pages Springer, 2020. 4
3235±3242. IEEE, 2018. 3
[32] Keunhong Park, Arsalan Mousavian, Yu Xiang, and Dieter
Fox. Latentfusion: End-to-end differentiable reconstruction
and rendering for unseen object pose estimation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 10710±10719, 2020. 3
[33] Cody J Phillips, Matthieu Lecce, and Kostas Daniilidis. See-
ing glassware: from edge detection to pose estimation and
shape recovery. In Robotics: Science and Systems, volume 3,
2016. 4
[34] Colin Rennie, Rahul Shome, Kostas E Bekris, and Alberto F
De Souza. A dataset for improved rgbd-based object de-
tection and pose estimation for warehouse pick-and-place.
IEEE Robotics and Automation Letters, 1(2):1179±1185,
2016. 3
[35] Shreeyak Sajjan, Matthew Moore, Mike Pan, Ganesh Na-
garaja, Johnny Lee, Andy Zeng, and Shuran Song. Clear
grasp: 3d shape estimation of transparent objects for manip-
ulation. In IEEE International Conference on Robotics and
Automation, pages 3634±3642. IEEE, 2020. 4
[36] Ashutosh Saxena, Justin Driemeyer, and Andrew Y Ng.
Robotic grasping of novel objects using vision. The Interna-
tional Journal of Robotics Research, 27(2):157±173, 2008.
4
[37] Alykhan Tejani, Danhang Tang, Rigas Kouskouridas, and
Tae-Kyun Kim. Latent-class hough forests for 3D object de-
tection and pose estimation. In Proceedings of the European
Conference on Computer Vision, pages 462±477. Springer,
2014. 3
21231