0% found this document useful (0 votes)
30 views8 pages

A Large-Scale Part-Centric Dataset For Material-Agnostic Articulated Object Manipulation.18276v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

A Large-Scale Part-Centric Dataset For Material-Agnostic Articulated Object Manipulation.18276v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

GAPartManip: A Large-scale Part-centric Dataset for

Material-Agnostic Articulated Object Manipulation


Wenbo Cui* 1,2 , Chengyang Zhao* 3,4 , Songlin Wei* 3,6 , Jiazhao Zhang3,6 , Haoran Geng3,5 ,
Yaran Chen1 , He Wang† 2,3,6

Abstract— Effectively manipulating articulated objects in


arXiv:2411.18276v1 [cs.RO] 27 Nov 2024

household scenarios is a crucial step toward achieving general


embodied artificial intelligence. Mainstream research in 3D Grasping
Poses Box Refrigerator Microwave Laptop
vision has primarily focused on manipulation through depth
perception and pose detection. However, in real-world environ- Observations
ments, these methods often face challenges due to imperfect
depth perception, such as with transparent lids and reflective
handles. Moreover, they generally lack the diversity in part- Door Cabinet WashingMachine SuitCase


based interactions required for flexible and adaptable manip-
ulation. To address these challenges, we introduced a large-
scale part-centric dataset for articulated object manipulation
that features both photo-realistic material randomizations and Left/Right IR RGB Depth Mask Table Toilet Washing Machine
detailed annotations of part-oriented, scene-level actionable RGB-D Active Stereo Camera Part-level Grasping Poses
interaction poses. We evaluated the effectiveness of our dataset (240K images) (8B actionable poses)
by integrating it with several state-of-the-art methods for depth
estimation and interaction pose prediction. Additionally, we
proposed a novel modular framework that delivers superior
and robust performance for generalizable articulated object
manipulation. Our extensive experiments demonstrate that
our dataset significantly improves the performance of depth
perception and actionable interaction pose prediction in both
simulation and real-world scenarios.

I. I NTRODUCTION Sim-to-Real Transfer

Articulated objects are ubiquitous in people’s daily lives, Fig. 1. GAPartManip. We introduce a large-scale part-centric dataset
ranging from tabletop items like microwaves and kitchen for material-agnostic articulated object manipulation. It encompasses 19
pots to larger items like cabinets and washing machines. Un- common household articulated categories, totaling 918 object instances,
240k photo-realistic rendering images, and 8 billion scene-level actionable
like simple, single-function rigid objects, articulated objects interaction poses. GAPartManip enables robust zero-shot sim-to-real transfer
consist of multiple parts with different functions, featuring for accomplishing articulated object manipulation tasks.
varied geometric shapes and kinematic structures, making
generalizable perception and manipulation towards them Firstly, the material of articulated objects significantly
highly non-trivial [1]. Some existing works tried to simplify impacts the quality of point cloud data. Most existing work
this problem by developing intermediate representations to relies on point clouds, and these methods struggle due to the
encode the similarities across different objects implicitly, sim-to-real gap of depth estimation [9], [10], [12], [13]. Some
such as affordance [2]–[5] and motion flow [6]–[8], thereby neural-based stereo-matching depth reconstruction methods
achieving generalization across objects. Another series of are proposed and show some success on rigid objects [14],
work [9]–[11] tried to tackle the articulated object perception [15]. These methods use neural networks to encode the dis-
and manipulation based on a more explicit and fundamental parity in stereo infrared (IR) patterns projected by structured
concept called Generalizable and Actionable Part (GAPart), light cameras. However, due to the limited diversity in the
demonstrating more manipulation capabilities attributed to stereo IR dataset, these methods are constrained to small
its 7-DoF pose representation compared to value map rep- rigid objects and perform poorly on large articulated objects.
resentation of visual affordance. However, we observe that
Secondly, there is no method that can predict stable and
two critical limitations impede their real-world performance.
actionable interactive poses across categories for articulated
*Equal Contribution. objects. Some work employs heuristic-based methods [9] to
1 Institute of Automation, Chinese Academy of Sciences, 2 Beijing interact with articulated objects, but it is limited in diversity
Academy of Artificial Intelligence, 3 CFCS, School of Computer Science, and fails to account for the geometric details necessary for
Peking University, 4 Carnegie Mellon University, 5 University of California,
Berkeley, 6 Galbot, robust interactions in real-world settings [3]. Some methods
†Corresponding to [email protected]. for rigid objects grasping pose prediction can generate stable
poses. However, due to the lack of data on articulated on rigid objects, neglecting the kinematic semantics specific
objects, it is challenging to discern whether each link can to articulated objects. Where2act [2] first introduces a data
interact independently, resulting in poses that are mostly non- generation pipeline for articulated objects, and they generate
actionable [16]. Affordance-based methods [2], [13], [17] data by sampling successful poses in the simulator. AO-
receive widespread attention for interacting with articulated Grasp [16] leverages a curvature-based sampling method to
objects by generating heatmaps. However, these heatmaps are accelerate data collection efficiency and proposes an 87k
ambiguous, hard to annotate and struggle to produce stable dataset of actionable poses. RPMart [12] manually anno-
grasping interactive poses [12]. tated affordance maps for articulated objects and provided
In this paper, we address these limitations from a data- rendered data in SAPIEN [1]. None of the current datasets
centric perspective. We introduce GAPartManip, a novel provide sufficient photo-realistic rendering data to improve
large-scale synthetic dataset that features two important algorithms’ perception of articulated objects during sim2real,
aspects: (1) realistic, physics-based IR image rendering of limiting real-world performance, especially with imperfect
various parts in diverse scenes, and (2) part-oriented ac- point clouds [10], [12]. Additionally, the data collection pro-
tionable interaction pose annotations for a wide range of cesses are inefficient and result in small datasets, hindering
articulated objects. Our GAPartManip inherits 918 object algorithm generalization to unknown objects. This work aims
instances across 19 categories from the previous GAPartNet to create a large-scale dataset with diverse photo-realistic and
dataset [9]. By leveraging these assets, we developed a actionable pose data covering all types of GAParts.
data generation pipeline for part manipulation, producing the
synthetic data needed to address the previously mentioned B. Articulated object manipulation
limitations. To improve generalizability and mitigate the sim- Due to unique kinematic structures and geometric shapes,
to-real gap, we incorporate domain randomization techniques articulated objects present significant challenges in manip-
[15] during data generation, ensuring a diverse range of ulation. Current methods can be broadly categorized into:
outputs. In total, our dataset contains approximately 14000 learning-based methods and prediction-planning methods.
scene-level samples with 8 billion part-oriented actionable Learning-based methods, such as reinforcement learning
pose annotations, encompassing a wide array of physical [10], [31] and imitation learning [32], [36], require a large
materials, object states, and camera perspectives. amount of high-quality robot demonstration. However, col-
Trained on the proposed dataset, we can obtain a depth lecting such data is both impractical and time-consuming,
reconstruction network and an actionable pose prediction and their sim-to-real performance heavily relies on simulator.
network separately to address the two limitations mentioned Current prediction-planning methods [2], [9], [11], [37]–[39]
earlier. Moreover, we compose these two neural networks focus on visual affordance but offer ambiguous interactive
modular to a novel articulated object manipulation frame- poses and struggle to generalize due to limited data. They
work. Through extensive experiments on both synthetic and rely on 3D point clouds, ignoring the impact of object materi-
real worlds, our method achieves state-of-the-art (SOTA) als. In the real world, depth cameras often miss critical points
performance in both individual module experiments and part like handles and lids, reducing sim-to-real performance.
manipulation experiments.
To summarize, our main contributions are as follows: III. GAPART M ANIP DATASET
• We introduce GAPartManip, a novel large-scale dataset
A. Overview
of various articulated objects featuring realistic, physics-
based rendering and diverse scene-level, part-oriented We construct a large-scale dataset, GAPartManip, to
actionable interaction pose annotations. address both depth estimation and actionable interaction
• We propose a novel framework for articulated ob- pose prediction challenges in articulated object manipulation
ject manipulation and evaluate each module separately, in real-world scenarios from a data-centric perspective. It
demonstrating superior effectiveness and robustness contains 19 common household articulated categories from
compared to baseline methods. GAPartNet, including Box, Bucket, CoffeeMachine, Dish-
• We conduct comprehensive experiments in the real washer, Door, KitchenPot, Laptop, Microwave, Oven, Printer,
world and achieve SOTA performance on articulated Refrigerator, Safe, StorageFurniture, Suitcase, Table, Toaster,
object manipulation tasks. Toilet, TrashCan, and WashingMachine, comprising a total
of 918 object instances after removing problematic assets.
II. R ELATED WORK
We build a photo-realistic rendering pipeline for each asset
A. Articulated object dataset in indoor scenes. We render RGB images, IR images, depth
Articulated object dataset and modeling is a crucial and maps, and part-level segmentations. Additionally, we create
longstanding research field in 3D vision and robotics, encom- high-quality and physics-plausible interaction pose annota-
passing a wide range of work in perception [9], [18]–[23], tions for each part of the articulated object. Then leverage
generation [24]–[28], and manipulation [9]–[11], [29]–[33]. our GPU-accelerated scene-level pose annotation pipeline
As to manipulation dataset, GAPartNet [9] annotates 6-DoF to generate dense, part-oriented actionable interaction pose
part pose to manipulate parts. Graspnet [34] and Contact- annotations for each data sample. Our dataset contains over
grasp [35] build several datasets, but these datasets all focus 8 billion actionable poses across 241680 data samples. Fig. 2
RGB Image
IR Rendering
Interaction Poses

Fig. 2. Data Examples in GAPartManip. GAPartManip is a novel large-scale synthetic dataset for articulated objects, featuring two important aspects:
(1) realistic, physics-based IR rendering for various object materials in diverse scenes, and (2) part-oriented actionable interaction pose annotations for a
wide range of articulated objects. Each column shows a data sample. From top to bottom, each column displays the RGB image, the IR image (only the
left IR image is shown here), and the scene-level actionable interaction pose annotations.

shows examples of data samples from our dataset. Our whole More importantly, we randomize the parameters of all dif-
data generation pipeline is illustrated in Fig. 3. fuse, transparent, specular, and metal materials of each
part corresponding to their semantics. Finally, we uniformly
randomize the joint poses of the object within its joint limits
Mesh Fusion FPS in each scene during the rendering process.
We render the objects and parts from different distances.
We render each scene with 5 object-centric camera per-
Photo-realistic Scene- Part-level Stable spectives for the whole object and 5 part-centric camera
level Rendering Pose Annotation
perspectives for each part. To place the object within the
camera view, i.e., the object-centric perspective, the camera
is positioned at a latitude of ranged in [10°,60°] and a
Scene-level Actionable
Pose Annotation longitude ranged in [-60°,60°] in the target object. To capture
the more fine-grained parts, i.e., the part-centric perspective,
we leverage part pose annotations from GAPartNet and the
Fig. 3. Dataset Generation Pipeline. For scene-level data sample
rendering, we input the object asset into our photo-realistic rendering current joint poses to determine the position and orientation
pipeline, generating one RGB image and two IR images (left and right) of each part in the scene. The camera is then randomly
for each camera perspective. For pose annotation, we begin by performing positioned around each part, aiming directly toward the part
mesh fusion on each GAPart of the object to establish a one-to-one
correspondence between GAParts and meshes. Then, we use FPS to obtain center. As a result, the target part occupies the primary area
the point cloud for each GAPart, enabling part-level stable interaction of the image. During this process, camera viewpoints are
pose annotation. These poses are further utilized for scene-level actionable randomly sampled within a latitude range of [0°,60°] and a
interaction pose annotation for each rendered data sample.
longitude range of [-75°,75°].
B. Photo-realistic Scene-level Rendering C. GPU-accelerated Scene-level Pose Annotation
Our photo-realistic rendering pipeline is built upon a) Part-level Stable Pose Annotation: We employ a
NVIDIA Isaac Sim [40]. Specifically, we simulate the RGB pose sampling strategy similar to GraspNet [34] to annotate
and IR imaging process of Intel RealSense D415, a widely- dense and diverse stable interaction poses for each GAPart,
used structured light camera for real-world depth estimation based on the original semantic annotations in GAPartNet [9].
in previous research works. We replicate the layout of the First, we perform mesh fusion for each part, merging the
D415 imaging system consisting of four hardware mod- meshes corresponding to the same part to establish a one-to-
ules, i.e., an IR projector, an RGB camera, and two infrared one correspondence between parts and meshes. Then, we
(IR) cameras. We also project a similar shadow pattern onto apply Farthest Point Sampling (FPS) to downsample the
the scenes with D415. mesh for each part, resulting in N candidate points for pose
Inspired by previous works [14], [41], we incorporate do- sampling. For each candidate point, we uniformly generate
main randomization techniques into our rendering pipeline to V × A × D candidate poses, where V is the number of
mimic the IR rendering under various lighting conditions and gripper views distributed uniformly over a spherical surface,
material properties in the real world. We render each object A represents the number of in-plane gripper rotations, and D
in 20 different scenes with various domain randomization refers to the number of gripper depths. In our case, N = 512,
settings. Concretely, we randomly vary ambient lighting, V = 64, A = 12, and D = 4. We follow GraspNet to
background, and object material properties in the scene, calculate the pose score based on antipodal analysis.
generating more diverse data that covers a wider range of b) Scene-level Actionable Pose Annotation: To obtain
real-world imaging conditions. we further randomize the part-centric grasping poses, We first project the part-level
ambient light positions and intensities within each scene. grasping poses into the scene using the part pose annotations,
and then filter out unreasonable and unreachable poses. More actionness score sP V
i and view-wise actionness score si are
concretely, We classify grasping poses that do not align with defined as:
single-view partial point clouds as unreasonable. Meanwhile, 1 X  i,j 
P i i,j
we consider poses that cause collisions between the gripper si = X ca 1 qk > T ck , (1)
i,j
Ak j,k
and other objects as unreachable. j,k

However, such a filtering process is computationally de- V 1 i


X  i,j 
i,j
si = X ca 1 qk > T ck , (2)
manding due to the large amounts of points in the scene. i,j
Ak k
To accelerate the pose annotation process, we implemented k

a CUDA-based optimization for the filtering process. Our where T is a predefined threshold to filter out inferior quality
optimization significantly reduces the processing time from poses. We then train Part-aware EcoGrasp [42] following
5 minutes to less than 2 seconds for each part which is nearly [42]. Additionally, we utilize the pre-trained GAPartNet [9]
a 150 times speed-up. As a result, the originally year-long to predict the motion direction which speficies the part
pose reduction process can now be completed within 3 days. movement direction after grasping the actionable part.

IV. F RAMEWORK C. Local planner module


We propose a novel framework to address cross-category We use CuRobo [43] as our motion planner. It optimizes
articulated object manipulation in real-world settings. As motion trajectories based on actionable poses given by the
illustrated in Fig. 4, the framework primarily consists of three pose prediction module, computes joint angles through in-
modules: a depth reconstruction module, a pose prediction verse kinematics, and drives the robot to execute trajectory
module, and a local planner module. actions through joint control modes. Subsequently, the robot
executes actions based on the motion direction r⃗p .
A. Depth reconstruction module V. E XPERIMENTS
The input to our system is a single view RGB-D obser- We conduct experiments for each module. The depth
l
vation including a raw depth Id , left IR image Iir , right IR estimation and actionable pose prediction experiments are
r
image Iir , and an RGB image Ic . However, the raw sensor conducted to illustrate the significance of our dataset in
depth are often imcomplete and even incorrect because trans- articulated object manipulation tasks. Meanwhile, real-world
parent and reflective surfaces are inherently ambiguous for experiments are carried out to compare the performance of
structured light and Time-of-Flight depth cameras. Therefore, our framework with existing methods. We also performed
we leverage diffusion model-based approaches to estimate ablation studies for each module.
and restore the incomplete depths of raw sensor outputs. We
use D3 RoMa [14] as our depth predictor and fine-tune it on A. Depth Estimation Experiments
our dataset. In this section, we evaluate different depth estimation
methods with our GAPartManip to demonstrate the effec-
B. Pose prediction module tiveness of our dataset for improving articulated object depth
Different from 6 DoF grapsing pose prediction for rigid estimation in both simulation and the real world.
object manipulation, we need to predict both the 6-DoF part Data Preparation. We split the dataset into training and test
grasping pose and the 2-DoF movement direction after grasp- sets using an approximate 8:2 ratio. To maintain comprehen-
ing. We adapt the SOTA method Economicgrasp [42] as our sive coverage, each object category is split carefully, ensuring
actionable pose estimator dubbed Part-aware EcoGrasp and that both the training and test sets include samples from all
use pretrained GAPartNet [9] to predict the part movement categories. Additionally, we make sure that samples rendered
direction. from the same object category are assigned exclusively to
To precisely annotate the part-centric interaction pose, either the training or test set. We compare our method with
we propose actionness instead of graspness in contrast to following baselines: leftmargin=10pt
Economicgrasp. To annotate actionness, we first denote the • SGM [44] is one of the most widely-used traditional
scene as a point cloud P = {pi |1, ..., N } with N points. algorithm for dense binocular stereo matching.
Then for each point pi , we uniformly discretize its sphere • RAFT-Stereo (RS) [45] is a learning-based binocular
space into V = {vj |j = 1, ..., V } approaching directions. stereo matching architecture built upon the dense optical
For each view vj of point pi , we generate L actionable flow estimation framework RAFT [46], using an itera-
pose candidates Ai,jk ∈ SE(3) indexed by k ∈ [1, L] by grid
tive update strategy to recursively refine the disparity
sampling along gripper depths and in-plane rotation angels map.
3
respectively. We employ antipodal analysis [34] to calculate • D RoMa (DR) [14] is a SOTA, learning-based stereo
the grasping quliaty score qki,j ∈ [0, 1.2]. Next, We define an depth estimation framework based on the diffusion
actionable label cia ∈ {0, 1} for each point indicating whether model. It excels at restoring noisy depth maps, espe-
this point is on a interacble part. We also define a scene- cially for transparent and specular surfaces.
level collision label ci,j
k ∈ {0, 1} for each pose indicating Evaluation Metrics. We evaluate the estimated disparity and
whether this pose will cause collision. Finally, the point-wise depth using the following metrics:
Actionable Pose
Estimator

Raw Depth
Depth 7-DoF Actionable Poses cuRobo
Estimator
Reconsturted Motion Direction
TODO
Depth Estimator

IR Images 6-DoF-based Motion

Depth Reconstruction Module Pose Prediction Module Local Planner Module


Fig. 4. Framework overview. Given IR images and raw depth, the depth reconstruction module first performs depth recovery. Subsequently, the pose
prediction module generates a 7-DOF actionable pose and a 3-DOF motion directive based on the reconstructed depth. Finally, the local planner module
carries out the action execution.

leftmargin=10pt strate reasonably good stereo depth estimation capabilities in


• EPE: Mean absolute difference between the ground the experiments. However, both RAFT-Stereo and D3 RoMa
truth and the estimated disparity map across all pixels. are significantly enhanced when fine-tuned on GAPartManip.
• RMSE: Root mean square of depth errors across all Specifically, RAFT-Stereo achieves a 150% improvement
pixels. in MAE compared to its pre-trained version, while our
• MAE: Mean absolute depth error across all pixels. model exhibits a 600% improvement, achieving the best
• REL: Mean relative depth error across all  pixels.
 performance in the simulation. As illustrated in Fig. 5, the
d dˆ fine-tuned models also demonstrate strong depth estimation
• δi : Percentage of pixels satisfying max ˆ, d < δi . d
d
performance in real-world scenarios. In particular, in real-
denotes the estimated depth. dˆ denotes the ground truth. world environments with challenging materials, as shown in
TABLE I
the first three rows of the figure, our model significantly
Q UANTITATIVE R ESULTS FOR D EPTH E STIMATION IN S IMULATION
outperforms the fine-tuned RAFT-Stereo and the raw depth,
exhibiting noticeably better robustness. Both simulation and
Methods EPE ↓ RMSE ↓ REL ↓ MAE ↓ δ1.05 ↑ δ1.10 ↑ δ1.25 ↑
real-world experiments demonstrate the effectiveness of our
SGM [44] 6.82 1.623 0.561 0.794 34.71 38.94 46.27
proposed GAPartManip in substantially improving depth
RS [45] 5.28 1.497 0.506 0.618 36.82 41.05 49.92 estimation for articulated objects with challenging materials.
DR [14] 2.82 0.732 0.268 0.317 46.22 67.62 83.09
RS* [45] 2.79 0.798 0.247 0.309 52.83 68.30 80.15 B. Actionable Pose prediction Experiments
Ours* 0.69 0.225 0.041 0.050 86.22 93.45 97.41
In this section, we evaluate the impact of our dataset on
* indicates that the method is trained on the GAPartManip dataset. improving the method for articulated object actionable pose
estimation.
Data Preparation. We split the dataset into training and
testing sets using an approximate 7:3 ratio. Specifically, we
further divide the test sets into 3 categories: seen instances,
unseen but similar instances, and novel instances. We com-
pare our methods with the following baselines:
leftmargin=10pt
• GSNet (GS) [47] is a grasping pose prediction model
trained on the GraspNet-1 billion [34] dataset for rigid
object. We evaluate both the pre-trained model and the
fine-tuned model separately.
• Where2Act (WA) [2] is an affordance-based method for
interacting with articulated objects. Unlike the original
Fig. 5. Qualitative Results for Depth Estimation in the Real World.
Our refined depth is more robust for transparent and translucent lids and
approach, we do not train a separate network for each
small handles compared to RAFT-Stereo. Zoom in to better observe small task. As Where2act cannot generate stable grasping
parts like handles and knobs. poses, we integrated GSNet, as referenced in [12], to
Results and Analysis. The quantitative results in simulation enhance where2act’s capabilities to align with experi-
are presented in Tab. I. The results indicate that the traditional mental setting.
stereo matching algorithm, SGM, struggles in scenes with • EconomicGrasp (EG) [42] is also a pose prediction
articulated objects with challenging material characteristics. method for rigid objects, which includes an interactive
The same observation applies to the pre-trained RAFT- grasp head and composite score estimation to enhance
Stereo. Meanwhile, the pre-trained D3 RoMa models demon- the precision of specific grasps.
RGB
EconomicGrasp
Pre-trained

Raw Depth
Ours

Ours
EconomicGrasp
Fig. 6. Qualitative comparison of actionable pose prediction on
synthetic data.

Evaluation Metrics. Following [34], we utilize precision to


evaluate the performance of actionable pose estimation.

Ours
P recisionµ = nsucµ /ngrasp (3)

P recisionµ represents the success rate (SR) of predicted


Fig. 7. Qualitative Results For Real-world Manipulation. The top-15
interaction poses at friction coefficient µ, where ngrasp scored actionable poses are displayed, with the red gripper representing the
denotes the number of predicted poses, and nsucµ denotes top-1 pose.
the number of successful grasps predicted under µ.
experiment consists of 7 distinct instances, including Stor-
Results and Analysis. Our quantitative results in simulation
ageFurniture, Box, and Microwave, evaluating the success
are presented in Tab. II. Even though both GSNet and
rate of the top-1 interactive pose for each method across open
our Part-aware EcoGrasp are trained on our data, they
(n=14) and close (n=17) tasks. As shown in Tab. V-C, the
outperform Where2Act, possibly because Where2Act strug-
overall success rate of GAPartManip is 61.29%, showcasing
gles with cross-category and cross-action reasoning. Part-
not only a successful transfer to the real world but also a
aware EcoGrasp and fine-tuned GSNet show a substantial
significant performance boost compared to other methods.
improvement in precision compared to pretrained models. Additionally, we perform ablation studies to assess how
It is evident that our data significantly enhances the ca- different modules affect the overall pipeline performance.
pability of existing methods in actionable pose estimation As shown in Fig. 7, depth cameras yield poor depth data
for articulated objects. Specifically, our dataset offers strong when faced with certain materials, significantly impacting
geometric priors for parts, enabling networks to focus more subsequent manipulations. Our depth reconstruction module
on actionable parts rather than non-actionable links. For effectively addresses this issue by repairing 2D depth map,
instance, although the pre-trained EconomicGrasp in Fig. 6 thereby enhancing the performance of subsequent modules.
generates a set of stable grasping poses, it cannot differentiate Similarly, as shown in Fig 7, GAPartManip tends to prior-
whether these poses act on actionable parts, meaning they itize interactable GAParts. This part-aware capability could
may fail in interacting with articulated objects. possibly explain why our method leads to such significant
TABLE II performance disparities as seen in Tab. V-C.
Q UANTITATIVE R ESULTS FOR ACTIONABLE POSE PREDICTION IN TABLE III
S IMULATION R EAL - WORLD ARTICULATED OBJECTS MANIPULATION RESULTS
Seen Unseen Novel Success Rate (%) ↑
Method Method
P P0.8 P0.4 P P0.8 P0.4 P P0.4 P0.8 Open Close Overall
GS [47] 13.28 11.55 6.70 17.36 15.57 9.19 9.76 8.43 5.25 AO-Grasp 28.57 29.41 29.03
EG [42] 24.72 19.65 9.97 23.91 20.29 9.90 14.56 12.02 9.23 Where2act 21.42 17.64 19.35
GS* [47] 25.70 20.26 9.00 25.45 20.28 9.67 23.99 20.55 11.20 GSNet 42.85 23.53 32.25
WA* [2] 14.43 12.44 6.53 11.04 7.41 2.52 4.17 1.85 0.47 Ours w/o Part-aware EcoGrasp 64.28 41.17 51.61
Ours* 55.33 51.19 30.25 56.26 53.02 32.91 41.65 39.06 23.25 Ours w/o Depth Reconstruction 50.00 29.41 38.70
Ours 64.28 58.82 61.29
* indicates that the method is trained on the GAPartManip dataset.

VI. C ONCLUSIONS
C. Real-World experiment In this paper, we build a large-scale synthetic dataset for
To validate the sim-to-real generalizability of GAPartMa- generalizable and actionable part manipulation with material-
nip, we conducte real-world experiments. We use a Franka agnostic articulated objects. Our dataset is the first large-
robot arm with an Intel RealSense camera to capture depth scale, diverse in instances, categories, scenes, and materi-
and IR images. We compare our method with three baselines: als articulated object dataset. Meanwhile, we propose an
Where2act, AO-Grasp, GSNet, and, like in V-B, We modified articulated object manipulation framework capable of zero-
the Where2act interaction pipeline to finish our tasks. The shot transfer to the real world. We conduct experiments on
individual modules and real-world overall experiments, with [18] L. Yi, H. Huang, D. Liu, E. Kalogerakis, H. Su, and L. Guibas,
results indicating the competitiveness of our approach. Our “Deep part induction from articulated object pairs,” arXiv preprint
arXiv:1809.07417, 2018. 2
dataset will be released. [19] C. Deng, J. Lei, W. B. Shen, K. Daniilidis, and L. J. Guibas, “Banana:
Banach fixed-point network for pointcloud segmentation with inter-
part equivariance,” in NeurIPS, 2024. 2
R EFERENCES
[20] X. Li, H. Wang, L. Yi, L. J. Guibas, A. L. Abbott, and S. Song,
“Category-level articulated object pose estimation,” in CVPR, 2020. 2
[1] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, [21] G. Liu, Q. Sun, H. Huang, C. Ma, Y. Guo, L. Yi, H. Huang, and
Y. Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, R. Hu, “Semi-weakly supervised object kinematic motion prediction,”
“SAPIEN: A simulated part-based interactive environment,” in The in CVPR, 2023. 2
IEEE Conference on Computer Vision and Pattern Recognition
[22] J. Lyu, Y. Chen, T. Du, F. Zhu, H. Liu, Y. Wang, and
(CVPR), June 2020. 1, 2
H. Wang, “Scissorbot: Learning generalizable scissor skill for
[2] K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani,
paper cutting via simulation, imitation, and sim2real,” in 8th
“Where2act: From pixels to actions for articulated 3d objects,” in
Annual Conference on Robot Learning, 2024. [Online]. Available:
Proceedings of the IEEE/CVF International Conference on Computer
https://openreview.net/forum?id=PAtsxVz0ND 2
Vision, 2021, pp. 6813–6823. 1, 2, 5, 6
[23] J. Zhang, N. Gireesh, J. Wang, X. Fang, C. Xu, W. Chen, L. Dai, and
[3] R. Wu, Y. Zhao, K. Mo, Z. Guo, Y. Wang, T. Wu, Q. Fan, X. Chen,
H. Wang, “Gamma: Graspability-aware mobile manipulation policy
L. Guibas, and H. Dong, “Vat-mart: Learning visual action trajectory
learning based on online grasping pose fusion,” in 2024 IEEE Inter-
proposals for manipulating 3d articulated objects,” arXiv preprint
national Conference on Robotics and Automation (ICRA). IEEE,
arXiv:2106.14440, 2021. 1
2024, pp. 1399–1405. 2
[4] Y. Wang, R. Wu, K. Mo, J. Ke, Q. Fan, L. J. Guibas, and H. Dong,
[24] Q. Chen, M. Memmel, A. Fang, A. Walsman, D. Fox, and A. Gupta,
“Adaafford: Learning to adapt manipulation affordance for 3d artic-
“Urdformer: Constructing interactive realistic scenes from real im-
ulated objects via few-shot interactions,” in European conference on
ages via simulation and generative modeling,” in Towards Generalist
computer vision. Springer, 2022, pp. 90–107. 1
Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL
[5] Y. Zhao, R. Wu, Z. Chen, Y. Zhang, Q. Fan, K. Mo, and H. Dong,
2023, 2023. 2
“Dualafford: Learning collaborative visual affordance for dual-gripper
[25] J. Mu, W. Qiu, A. Kortylewski, A. Yuille, N. Vasconcelos, and
manipulation,” arXiv preprint arXiv:2207.01971, 2022. 1
X. Wang, “A-sdf: Learning disentangled signed distance functions for
[6] B. Eisner, H. Zhang, and D. Held, “Flowbot3d: Learning 3d ar-
articulated shape representation,” in ICCV, 2021. 2
ticulation flow to manipulate articulated objects,” arXiv preprint
arXiv:2205.04382, 2022. 1 [26] Z. Jiang, C.-C. Hsu, and Y. Zhu, “Ditto: Building digital twins of
articulated objects from interaction,” in CVPR, 2022. 2
[7] H. Zhang, B. Eisner, and D. Held, “Flowbot++: Learning generalized
articulated objects manipulation via articulation projection,” arXiv [27] W.-C. Tseng, H.-J. Liao, L. Yen-Chen, and M. Sun, “Cla-nerf:
preprint arXiv:2306.12893, 2023. 1 Category-level articulated neural radiance field,” in ICRA, 2022. 2
[8] C. Zhong, Y. Zheng, Y. Zheng, H. Zhao, L. Yi, X. Mu, L. Wang, [28] R. Luo, H. Geng, C. Deng, P. Li, Z. Wang, B. Jia, L. Guibas,
P. Li, G. Zhou, C. Yang, et al., “3d implicit transporter for temporally and S. Huang, “Physpart: Physically plausible part completion for
consistent keypoint discovery,” in Proceedings of the IEEE/CVF interactable objects,” 2024. [Online]. Available: https://arxiv.org/abs/
International Conference on Computer Vision, 2023, pp. 3869–3880. 2408.13724 2
1 [29] J. Lei, C. Deng, B. Shen, L. Guibas, and K. Daniilidis, “Nap: Neural
[9] H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang, 3d articulation prior,” arXiv preprint arXiv:2305.16315, 2023. 2
“Gapartnet: Cross-category domain-generalizable object perception [30] J. Liu, H. I. I. Tam, A. Mahdavi-Amiri, and M. Savva, “Cage:
and manipulation via generalizable and actionable parts,” in Proceed- Controllable articulation generation,” in CVPR, 2024. 2
ings of the IEEE/CVF Conference on Computer Vision and Pattern [31] Y. Geng, B. An, H. Geng, Y. Chen, Y. Yang, and H. Dong, “End-
Recognition, 2023, pp. 7081–7091. 1, 2, 3, 4 to-end affordance learning for robotic manipulation,” in ICRA, 2023.
[10] H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang, “Partmanip: 2
Learning cross-category generalizable part manipulation policy from [32] R. Gong, J. Huang, Y. Zhao, H. Geng, X. Gao, Q. Wu, W. Ai, Z. Zhou,
point cloud observations,” in Proceedings of the IEEE/CVF Conference D. Terzopoulos, S.-C. Zhu, et al., “Arnold: A benchmark for language-
on Computer Vision and Pattern Recognition, 2023, pp. 2978–2988. grounded task learning with continuous states in realistic 3d scenes,”
1, 2 in ICCV, 2023. 2
[11] H. Geng, S. Wei, C. Deng, B. Shen, H. Wang, and L. Guibas, “Sage: [33] Y. Kuang, J. Ye, H. Geng, J. Mao, C. Deng, L. Guibas,
Bridging semantic and actionable parts for generalizable manipulation H. Wang, and Y. Wang, “Ram: Retrieval-based affordance transfer
of articulated objects,” 2024. 1, 2 for generalizable zero-shot robotic manipulation,” 2024. [Online].
[12] J. Wang, W. Liu, Q. Yu, Y. You, L. Liu, W. Wang, and C. Lu, “Rpmart: Available: https://arxiv.org/abs/2407.04689 2
Towards robust perception and manipulation for articulated objects,” [34] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-
arXiv preprint arXiv:2403.16023, 2024. 1, 2, 5 scale benchmark for general object grasping,” in Proceedings of the
[13] Y. Geng, B. An, H. Geng, Y. Chen, Y. Yang, and H. Dong, “Rlafford: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
End-to-end affordance learning for robotic manipulation,” in 2023 2020, pp. 11 444–11 453. 2, 3, 4, 5, 6
IEEE International Conference on Robotics and Automation (ICRA), [35] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-
2023, pp. 5880–5886. 1, 2 graspnet: Efficient 6-dof grasp generation in cluttered scenes,” 2021.
[14] S. Wei, H. Geng, J. Chen, C. Deng, W. Cui, C. Zhao, X. Fang, [Online]. Available: https://arxiv.org/abs/2103.14127 2
L. Guibas, and H. Wang, “D3roma: Disparity diffusion-based depth [36] P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and
sensing for material-agnostic robotic manipulation,” in 8th Annual C. Schmid, “Instruction-driven history-aware policies for robotic ma-
Conference on Robot Learning (CoRL), 2024. 1, 3, 4, 5 nipulations,” in Conference on Robot Learning. PMLR, 2023, pp.
[15] J. Shi, A. Yong, Y. Jin, D. Li, H. Niu, Z. Jin, and H. Wang, 175–187. 2
“Asgrasp: Generalizable transparent object reconstruction and 6-dof [37] W. Liu, J. Mao, J. Hsu, T. Hermans, A. Garg, and J. Wu,
grasp detection from rgb-d active stereo camera,” in 2024 IEEE “Composable part-based manipulation,” 2024. [Online]. Available:
International Conference on Robotics and Automation (ICRA). IEEE, https://arxiv.org/abs/2405.05876 2
2024, pp. 5441–5447. 1, 2 [38] S. Ling, Y. Wang, S. Wu, Y. Zhuang, T. Xu, Y. Li, C. Liu,
[16] C. P. Morlans, C. Chen, Y. Weng, M. Yi, Y. Huang, N. Heppert, and H. Dong, “Articulated object manipulation with coarse-to-fine
L. Zhou, L. Guibas, and J. Bohg, “Ao-grasp: Articulated object grasp affordance for mitigating the effect of point cloud noise,” 2024.
generation,” arXiv preprint arXiv:2310.15928, 2023. 2 [Online]. Available: https://arxiv.org/abs/2402.18699 2
[17] B. An, Y. Geng, K. Chen, X. Li, Q. Dou, and H. Dong, “Rgbmanip: [39] Y. Kuang, J. Ye, H. Geng, J. Mao, C. Deng, L. Guibas, H. Wang, and
Monocular image-based robotic manipulation through active object Y. Wang, “Ram: Retrieval-based affordance transfer for generalizable
pose estimation,” in 2024 IEEE International Conference on Robotics zero-shot robotic manipulation,” in 8th Annual Conference on Robot
and Automation (ICRA). IEEE, 2024, pp. 7748–7755. 2 Learning. 2
[40] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and
D. Fox, “Gpu-accelerated robotic simulation for distributed reinforce-
ment learning,” 2018. 3
[41] Q. Dai, J. Zhang, Q. Li, T. Wu, H. Dong, Z. Liu, P. Tan, and H. Wang,
“Domain randomization-enhanced depth simulation and restoration for
perceiving and grasping specular and transparent objects,” in European
Conference on Computer Vision (ECCV), 2022. 3
[42] X.-M. Wu, J.-F. Cai, J.-J. Jiang, D. Zheng, Y.-L. Wei, and W.-S.
Zheng, “An economic framework for 6-dof grasp detection,” 2024.
[Online]. Available: https://arxiv.org/abs/2407.08366 4, 5, 6
[43] B. Sundaralingam, S. K. S. Hari, A. Fishman, C. Garrett, K. Van Wyk,
V. Blukis, A. Millane, H. Oleynikova, A. Handa, F. Ramos, N. Ratliff,
and D. Fox, “Curobo: Parallelized collision-free robot motion gen-
eration,” in 2023 IEEE International Conference on Robotics and
Automation (ICRA), 2023, pp. 8112–8119. 4
[44] H. Hirschmuller, “Stereo processing by semiglobal matching and
mutual information,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008. 4, 5
[45] L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field
transforms for stereo matching,” in 2021 International Conference on
3D Vision (3DV). IEEE, 2021, pp. 218–227. 4, 5
[46] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms
for optical flow,” in Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II
16. Springer, 2020, pp. 402–419. 4
[47] C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Grasp-
ness discovery in clutters for fast and accurate grasp detection,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2021, pp. 15 964–15 973. 5, 6

You might also like