A Large-Scale Part-Centric Dataset For Material-Agnostic Articulated Object Manipulation.18276v1
A Large-Scale Part-Centric Dataset For Material-Agnostic Articulated Object Manipulation.18276v1
…
based interactions required for flexible and adaptable manip-
ulation. To address these challenges, we introduced a large-
scale part-centric dataset for articulated object manipulation
that features both photo-realistic material randomizations and Left/Right IR RGB Depth Mask Table Toilet Washing Machine
detailed annotations of part-oriented, scene-level actionable RGB-D Active Stereo Camera Part-level Grasping Poses
interaction poses. We evaluated the effectiveness of our dataset (240K images) (8B actionable poses)
by integrating it with several state-of-the-art methods for depth
estimation and interaction pose prediction. Additionally, we
proposed a novel modular framework that delivers superior
and robust performance for generalizable articulated object
manipulation. Our extensive experiments demonstrate that
our dataset significantly improves the performance of depth
perception and actionable interaction pose prediction in both
simulation and real-world scenarios.
Articulated objects are ubiquitous in people’s daily lives, Fig. 1. GAPartManip. We introduce a large-scale part-centric dataset
ranging from tabletop items like microwaves and kitchen for material-agnostic articulated object manipulation. It encompasses 19
pots to larger items like cabinets and washing machines. Un- common household articulated categories, totaling 918 object instances,
240k photo-realistic rendering images, and 8 billion scene-level actionable
like simple, single-function rigid objects, articulated objects interaction poses. GAPartManip enables robust zero-shot sim-to-real transfer
consist of multiple parts with different functions, featuring for accomplishing articulated object manipulation tasks.
varied geometric shapes and kinematic structures, making
generalizable perception and manipulation towards them Firstly, the material of articulated objects significantly
highly non-trivial [1]. Some existing works tried to simplify impacts the quality of point cloud data. Most existing work
this problem by developing intermediate representations to relies on point clouds, and these methods struggle due to the
encode the similarities across different objects implicitly, sim-to-real gap of depth estimation [9], [10], [12], [13]. Some
such as affordance [2]–[5] and motion flow [6]–[8], thereby neural-based stereo-matching depth reconstruction methods
achieving generalization across objects. Another series of are proposed and show some success on rigid objects [14],
work [9]–[11] tried to tackle the articulated object perception [15]. These methods use neural networks to encode the dis-
and manipulation based on a more explicit and fundamental parity in stereo infrared (IR) patterns projected by structured
concept called Generalizable and Actionable Part (GAPart), light cameras. However, due to the limited diversity in the
demonstrating more manipulation capabilities attributed to stereo IR dataset, these methods are constrained to small
its 7-DoF pose representation compared to value map rep- rigid objects and perform poorly on large articulated objects.
resentation of visual affordance. However, we observe that
Secondly, there is no method that can predict stable and
two critical limitations impede their real-world performance.
actionable interactive poses across categories for articulated
*Equal Contribution. objects. Some work employs heuristic-based methods [9] to
1 Institute of Automation, Chinese Academy of Sciences, 2 Beijing interact with articulated objects, but it is limited in diversity
Academy of Artificial Intelligence, 3 CFCS, School of Computer Science, and fails to account for the geometric details necessary for
Peking University, 4 Carnegie Mellon University, 5 University of California,
Berkeley, 6 Galbot, robust interactions in real-world settings [3]. Some methods
†Corresponding to [email protected]. for rigid objects grasping pose prediction can generate stable
poses. However, due to the lack of data on articulated on rigid objects, neglecting the kinematic semantics specific
objects, it is challenging to discern whether each link can to articulated objects. Where2act [2] first introduces a data
interact independently, resulting in poses that are mostly non- generation pipeline for articulated objects, and they generate
actionable [16]. Affordance-based methods [2], [13], [17] data by sampling successful poses in the simulator. AO-
receive widespread attention for interacting with articulated Grasp [16] leverages a curvature-based sampling method to
objects by generating heatmaps. However, these heatmaps are accelerate data collection efficiency and proposes an 87k
ambiguous, hard to annotate and struggle to produce stable dataset of actionable poses. RPMart [12] manually anno-
grasping interactive poses [12]. tated affordance maps for articulated objects and provided
In this paper, we address these limitations from a data- rendered data in SAPIEN [1]. None of the current datasets
centric perspective. We introduce GAPartManip, a novel provide sufficient photo-realistic rendering data to improve
large-scale synthetic dataset that features two important algorithms’ perception of articulated objects during sim2real,
aspects: (1) realistic, physics-based IR image rendering of limiting real-world performance, especially with imperfect
various parts in diverse scenes, and (2) part-oriented ac- point clouds [10], [12]. Additionally, the data collection pro-
tionable interaction pose annotations for a wide range of cesses are inefficient and result in small datasets, hindering
articulated objects. Our GAPartManip inherits 918 object algorithm generalization to unknown objects. This work aims
instances across 19 categories from the previous GAPartNet to create a large-scale dataset with diverse photo-realistic and
dataset [9]. By leveraging these assets, we developed a actionable pose data covering all types of GAParts.
data generation pipeline for part manipulation, producing the
synthetic data needed to address the previously mentioned B. Articulated object manipulation
limitations. To improve generalizability and mitigate the sim- Due to unique kinematic structures and geometric shapes,
to-real gap, we incorporate domain randomization techniques articulated objects present significant challenges in manip-
[15] during data generation, ensuring a diverse range of ulation. Current methods can be broadly categorized into:
outputs. In total, our dataset contains approximately 14000 learning-based methods and prediction-planning methods.
scene-level samples with 8 billion part-oriented actionable Learning-based methods, such as reinforcement learning
pose annotations, encompassing a wide array of physical [10], [31] and imitation learning [32], [36], require a large
materials, object states, and camera perspectives. amount of high-quality robot demonstration. However, col-
Trained on the proposed dataset, we can obtain a depth lecting such data is both impractical and time-consuming,
reconstruction network and an actionable pose prediction and their sim-to-real performance heavily relies on simulator.
network separately to address the two limitations mentioned Current prediction-planning methods [2], [9], [11], [37]–[39]
earlier. Moreover, we compose these two neural networks focus on visual affordance but offer ambiguous interactive
modular to a novel articulated object manipulation frame- poses and struggle to generalize due to limited data. They
work. Through extensive experiments on both synthetic and rely on 3D point clouds, ignoring the impact of object materi-
real worlds, our method achieves state-of-the-art (SOTA) als. In the real world, depth cameras often miss critical points
performance in both individual module experiments and part like handles and lids, reducing sim-to-real performance.
manipulation experiments.
To summarize, our main contributions are as follows: III. GAPART M ANIP DATASET
• We introduce GAPartManip, a novel large-scale dataset
A. Overview
of various articulated objects featuring realistic, physics-
based rendering and diverse scene-level, part-oriented We construct a large-scale dataset, GAPartManip, to
actionable interaction pose annotations. address both depth estimation and actionable interaction
• We propose a novel framework for articulated ob- pose prediction challenges in articulated object manipulation
ject manipulation and evaluate each module separately, in real-world scenarios from a data-centric perspective. It
demonstrating superior effectiveness and robustness contains 19 common household articulated categories from
compared to baseline methods. GAPartNet, including Box, Bucket, CoffeeMachine, Dish-
• We conduct comprehensive experiments in the real washer, Door, KitchenPot, Laptop, Microwave, Oven, Printer,
world and achieve SOTA performance on articulated Refrigerator, Safe, StorageFurniture, Suitcase, Table, Toaster,
object manipulation tasks. Toilet, TrashCan, and WashingMachine, comprising a total
of 918 object instances after removing problematic assets.
II. R ELATED WORK
We build a photo-realistic rendering pipeline for each asset
A. Articulated object dataset in indoor scenes. We render RGB images, IR images, depth
Articulated object dataset and modeling is a crucial and maps, and part-level segmentations. Additionally, we create
longstanding research field in 3D vision and robotics, encom- high-quality and physics-plausible interaction pose annota-
passing a wide range of work in perception [9], [18]–[23], tions for each part of the articulated object. Then leverage
generation [24]–[28], and manipulation [9]–[11], [29]–[33]. our GPU-accelerated scene-level pose annotation pipeline
As to manipulation dataset, GAPartNet [9] annotates 6-DoF to generate dense, part-oriented actionable interaction pose
part pose to manipulate parts. Graspnet [34] and Contact- annotations for each data sample. Our dataset contains over
grasp [35] build several datasets, but these datasets all focus 8 billion actionable poses across 241680 data samples. Fig. 2
RGB Image
IR Rendering
Interaction Poses
Fig. 2. Data Examples in GAPartManip. GAPartManip is a novel large-scale synthetic dataset for articulated objects, featuring two important aspects:
(1) realistic, physics-based IR rendering for various object materials in diverse scenes, and (2) part-oriented actionable interaction pose annotations for a
wide range of articulated objects. Each column shows a data sample. From top to bottom, each column displays the RGB image, the IR image (only the
left IR image is shown here), and the scene-level actionable interaction pose annotations.
shows examples of data samples from our dataset. Our whole More importantly, we randomize the parameters of all dif-
data generation pipeline is illustrated in Fig. 3. fuse, transparent, specular, and metal materials of each
part corresponding to their semantics. Finally, we uniformly
randomize the joint poses of the object within its joint limits
Mesh Fusion FPS in each scene during the rendering process.
We render the objects and parts from different distances.
We render each scene with 5 object-centric camera per-
Photo-realistic Scene- Part-level Stable spectives for the whole object and 5 part-centric camera
level Rendering Pose Annotation
perspectives for each part. To place the object within the
camera view, i.e., the object-centric perspective, the camera
is positioned at a latitude of ranged in [10°,60°] and a
Scene-level Actionable
Pose Annotation longitude ranged in [-60°,60°] in the target object. To capture
the more fine-grained parts, i.e., the part-centric perspective,
we leverage part pose annotations from GAPartNet and the
Fig. 3. Dataset Generation Pipeline. For scene-level data sample
rendering, we input the object asset into our photo-realistic rendering current joint poses to determine the position and orientation
pipeline, generating one RGB image and two IR images (left and right) of each part in the scene. The camera is then randomly
for each camera perspective. For pose annotation, we begin by performing positioned around each part, aiming directly toward the part
mesh fusion on each GAPart of the object to establish a one-to-one
correspondence between GAParts and meshes. Then, we use FPS to obtain center. As a result, the target part occupies the primary area
the point cloud for each GAPart, enabling part-level stable interaction of the image. During this process, camera viewpoints are
pose annotation. These poses are further utilized for scene-level actionable randomly sampled within a latitude range of [0°,60°] and a
interaction pose annotation for each rendered data sample.
longitude range of [-75°,75°].
B. Photo-realistic Scene-level Rendering C. GPU-accelerated Scene-level Pose Annotation
Our photo-realistic rendering pipeline is built upon a) Part-level Stable Pose Annotation: We employ a
NVIDIA Isaac Sim [40]. Specifically, we simulate the RGB pose sampling strategy similar to GraspNet [34] to annotate
and IR imaging process of Intel RealSense D415, a widely- dense and diverse stable interaction poses for each GAPart,
used structured light camera for real-world depth estimation based on the original semantic annotations in GAPartNet [9].
in previous research works. We replicate the layout of the First, we perform mesh fusion for each part, merging the
D415 imaging system consisting of four hardware mod- meshes corresponding to the same part to establish a one-to-
ules, i.e., an IR projector, an RGB camera, and two infrared one correspondence between parts and meshes. Then, we
(IR) cameras. We also project a similar shadow pattern onto apply Farthest Point Sampling (FPS) to downsample the
the scenes with D415. mesh for each part, resulting in N candidate points for pose
Inspired by previous works [14], [41], we incorporate do- sampling. For each candidate point, we uniformly generate
main randomization techniques into our rendering pipeline to V × A × D candidate poses, where V is the number of
mimic the IR rendering under various lighting conditions and gripper views distributed uniformly over a spherical surface,
material properties in the real world. We render each object A represents the number of in-plane gripper rotations, and D
in 20 different scenes with various domain randomization refers to the number of gripper depths. In our case, N = 512,
settings. Concretely, we randomly vary ambient lighting, V = 64, A = 12, and D = 4. We follow GraspNet to
background, and object material properties in the scene, calculate the pose score based on antipodal analysis.
generating more diverse data that covers a wider range of b) Scene-level Actionable Pose Annotation: To obtain
real-world imaging conditions. we further randomize the part-centric grasping poses, We first project the part-level
ambient light positions and intensities within each scene. grasping poses into the scene using the part pose annotations,
and then filter out unreasonable and unreachable poses. More actionness score sP V
i and view-wise actionness score si are
concretely, We classify grasping poses that do not align with defined as:
single-view partial point clouds as unreasonable. Meanwhile, 1 X i,j
P i i,j
we consider poses that cause collisions between the gripper si = X ca 1 qk > T ck , (1)
i,j
Ak j,k
and other objects as unreachable. j,k
a CUDA-based optimization for the filtering process. Our where T is a predefined threshold to filter out inferior quality
optimization significantly reduces the processing time from poses. We then train Part-aware EcoGrasp [42] following
5 minutes to less than 2 seconds for each part which is nearly [42]. Additionally, we utilize the pre-trained GAPartNet [9]
a 150 times speed-up. As a result, the originally year-long to predict the motion direction which speficies the part
pose reduction process can now be completed within 3 days. movement direction after grasping the actionable part.
Raw Depth
Depth 7-DoF Actionable Poses cuRobo
Estimator
Reconsturted Motion Direction
TODO
Depth Estimator
Raw Depth
Ours
Ours
EconomicGrasp
Fig. 6. Qualitative comparison of actionable pose prediction on
synthetic data.
Ours
P recisionµ = nsucµ /ngrasp (3)
VI. C ONCLUSIONS
C. Real-World experiment In this paper, we build a large-scale synthetic dataset for
To validate the sim-to-real generalizability of GAPartMa- generalizable and actionable part manipulation with material-
nip, we conducte real-world experiments. We use a Franka agnostic articulated objects. Our dataset is the first large-
robot arm with an Intel RealSense camera to capture depth scale, diverse in instances, categories, scenes, and materi-
and IR images. We compare our method with three baselines: als articulated object dataset. Meanwhile, we propose an
Where2act, AO-Grasp, GSNet, and, like in V-B, We modified articulated object manipulation framework capable of zero-
the Where2act interaction pipeline to finish our tasks. The shot transfer to the real world. We conduct experiments on
individual modules and real-world overall experiments, with [18] L. Yi, H. Huang, D. Liu, E. Kalogerakis, H. Su, and L. Guibas,
results indicating the competitiveness of our approach. Our “Deep part induction from articulated object pairs,” arXiv preprint
arXiv:1809.07417, 2018. 2
dataset will be released. [19] C. Deng, J. Lei, W. B. Shen, K. Daniilidis, and L. J. Guibas, “Banana:
Banach fixed-point network for pointcloud segmentation with inter-
part equivariance,” in NeurIPS, 2024. 2
R EFERENCES
[20] X. Li, H. Wang, L. Yi, L. J. Guibas, A. L. Abbott, and S. Song,
“Category-level articulated object pose estimation,” in CVPR, 2020. 2
[1] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, [21] G. Liu, Q. Sun, H. Huang, C. Ma, Y. Guo, L. Yi, H. Huang, and
Y. Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, R. Hu, “Semi-weakly supervised object kinematic motion prediction,”
“SAPIEN: A simulated part-based interactive environment,” in The in CVPR, 2023. 2
IEEE Conference on Computer Vision and Pattern Recognition
[22] J. Lyu, Y. Chen, T. Du, F. Zhu, H. Liu, Y. Wang, and
(CVPR), June 2020. 1, 2
H. Wang, “Scissorbot: Learning generalizable scissor skill for
[2] K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani,
paper cutting via simulation, imitation, and sim2real,” in 8th
“Where2act: From pixels to actions for articulated 3d objects,” in
Annual Conference on Robot Learning, 2024. [Online]. Available:
Proceedings of the IEEE/CVF International Conference on Computer
https://openreview.net/forum?id=PAtsxVz0ND 2
Vision, 2021, pp. 6813–6823. 1, 2, 5, 6
[23] J. Zhang, N. Gireesh, J. Wang, X. Fang, C. Xu, W. Chen, L. Dai, and
[3] R. Wu, Y. Zhao, K. Mo, Z. Guo, Y. Wang, T. Wu, Q. Fan, X. Chen,
H. Wang, “Gamma: Graspability-aware mobile manipulation policy
L. Guibas, and H. Dong, “Vat-mart: Learning visual action trajectory
learning based on online grasping pose fusion,” in 2024 IEEE Inter-
proposals for manipulating 3d articulated objects,” arXiv preprint
national Conference on Robotics and Automation (ICRA). IEEE,
arXiv:2106.14440, 2021. 1
2024, pp. 1399–1405. 2
[4] Y. Wang, R. Wu, K. Mo, J. Ke, Q. Fan, L. J. Guibas, and H. Dong,
[24] Q. Chen, M. Memmel, A. Fang, A. Walsman, D. Fox, and A. Gupta,
“Adaafford: Learning to adapt manipulation affordance for 3d artic-
“Urdformer: Constructing interactive realistic scenes from real im-
ulated objects via few-shot interactions,” in European conference on
ages via simulation and generative modeling,” in Towards Generalist
computer vision. Springer, 2022, pp. 90–107. 1
Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL
[5] Y. Zhao, R. Wu, Z. Chen, Y. Zhang, Q. Fan, K. Mo, and H. Dong,
2023, 2023. 2
“Dualafford: Learning collaborative visual affordance for dual-gripper
[25] J. Mu, W. Qiu, A. Kortylewski, A. Yuille, N. Vasconcelos, and
manipulation,” arXiv preprint arXiv:2207.01971, 2022. 1
X. Wang, “A-sdf: Learning disentangled signed distance functions for
[6] B. Eisner, H. Zhang, and D. Held, “Flowbot3d: Learning 3d ar-
articulated shape representation,” in ICCV, 2021. 2
ticulation flow to manipulate articulated objects,” arXiv preprint
arXiv:2205.04382, 2022. 1 [26] Z. Jiang, C.-C. Hsu, and Y. Zhu, “Ditto: Building digital twins of
articulated objects from interaction,” in CVPR, 2022. 2
[7] H. Zhang, B. Eisner, and D. Held, “Flowbot++: Learning generalized
articulated objects manipulation via articulation projection,” arXiv [27] W.-C. Tseng, H.-J. Liao, L. Yen-Chen, and M. Sun, “Cla-nerf:
preprint arXiv:2306.12893, 2023. 1 Category-level articulated neural radiance field,” in ICRA, 2022. 2
[8] C. Zhong, Y. Zheng, Y. Zheng, H. Zhao, L. Yi, X. Mu, L. Wang, [28] R. Luo, H. Geng, C. Deng, P. Li, Z. Wang, B. Jia, L. Guibas,
P. Li, G. Zhou, C. Yang, et al., “3d implicit transporter for temporally and S. Huang, “Physpart: Physically plausible part completion for
consistent keypoint discovery,” in Proceedings of the IEEE/CVF interactable objects,” 2024. [Online]. Available: https://arxiv.org/abs/
International Conference on Computer Vision, 2023, pp. 3869–3880. 2408.13724 2
1 [29] J. Lei, C. Deng, B. Shen, L. Guibas, and K. Daniilidis, “Nap: Neural
[9] H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang, 3d articulation prior,” arXiv preprint arXiv:2305.16315, 2023. 2
“Gapartnet: Cross-category domain-generalizable object perception [30] J. Liu, H. I. I. Tam, A. Mahdavi-Amiri, and M. Savva, “Cage:
and manipulation via generalizable and actionable parts,” in Proceed- Controllable articulation generation,” in CVPR, 2024. 2
ings of the IEEE/CVF Conference on Computer Vision and Pattern [31] Y. Geng, B. An, H. Geng, Y. Chen, Y. Yang, and H. Dong, “End-
Recognition, 2023, pp. 7081–7091. 1, 2, 3, 4 to-end affordance learning for robotic manipulation,” in ICRA, 2023.
[10] H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang, “Partmanip: 2
Learning cross-category generalizable part manipulation policy from [32] R. Gong, J. Huang, Y. Zhao, H. Geng, X. Gao, Q. Wu, W. Ai, Z. Zhou,
point cloud observations,” in Proceedings of the IEEE/CVF Conference D. Terzopoulos, S.-C. Zhu, et al., “Arnold: A benchmark for language-
on Computer Vision and Pattern Recognition, 2023, pp. 2978–2988. grounded task learning with continuous states in realistic 3d scenes,”
1, 2 in ICCV, 2023. 2
[11] H. Geng, S. Wei, C. Deng, B. Shen, H. Wang, and L. Guibas, “Sage: [33] Y. Kuang, J. Ye, H. Geng, J. Mao, C. Deng, L. Guibas,
Bridging semantic and actionable parts for generalizable manipulation H. Wang, and Y. Wang, “Ram: Retrieval-based affordance transfer
of articulated objects,” 2024. 1, 2 for generalizable zero-shot robotic manipulation,” 2024. [Online].
[12] J. Wang, W. Liu, Q. Yu, Y. You, L. Liu, W. Wang, and C. Lu, “Rpmart: Available: https://arxiv.org/abs/2407.04689 2
Towards robust perception and manipulation for articulated objects,” [34] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-
arXiv preprint arXiv:2403.16023, 2024. 1, 2, 5 scale benchmark for general object grasping,” in Proceedings of the
[13] Y. Geng, B. An, H. Geng, Y. Chen, Y. Yang, and H. Dong, “Rlafford: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
End-to-end affordance learning for robotic manipulation,” in 2023 2020, pp. 11 444–11 453. 2, 3, 4, 5, 6
IEEE International Conference on Robotics and Automation (ICRA), [35] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-
2023, pp. 5880–5886. 1, 2 graspnet: Efficient 6-dof grasp generation in cluttered scenes,” 2021.
[14] S. Wei, H. Geng, J. Chen, C. Deng, W. Cui, C. Zhao, X. Fang, [Online]. Available: https://arxiv.org/abs/2103.14127 2
L. Guibas, and H. Wang, “D3roma: Disparity diffusion-based depth [36] P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and
sensing for material-agnostic robotic manipulation,” in 8th Annual C. Schmid, “Instruction-driven history-aware policies for robotic ma-
Conference on Robot Learning (CoRL), 2024. 1, 3, 4, 5 nipulations,” in Conference on Robot Learning. PMLR, 2023, pp.
[15] J. Shi, A. Yong, Y. Jin, D. Li, H. Niu, Z. Jin, and H. Wang, 175–187. 2
“Asgrasp: Generalizable transparent object reconstruction and 6-dof [37] W. Liu, J. Mao, J. Hsu, T. Hermans, A. Garg, and J. Wu,
grasp detection from rgb-d active stereo camera,” in 2024 IEEE “Composable part-based manipulation,” 2024. [Online]. Available:
International Conference on Robotics and Automation (ICRA). IEEE, https://arxiv.org/abs/2405.05876 2
2024, pp. 5441–5447. 1, 2 [38] S. Ling, Y. Wang, S. Wu, Y. Zhuang, T. Xu, Y. Li, C. Liu,
[16] C. P. Morlans, C. Chen, Y. Weng, M. Yi, Y. Huang, N. Heppert, and H. Dong, “Articulated object manipulation with coarse-to-fine
L. Zhou, L. Guibas, and J. Bohg, “Ao-grasp: Articulated object grasp affordance for mitigating the effect of point cloud noise,” 2024.
generation,” arXiv preprint arXiv:2310.15928, 2023. 2 [Online]. Available: https://arxiv.org/abs/2402.18699 2
[17] B. An, Y. Geng, K. Chen, X. Li, Q. Dou, and H. Dong, “Rgbmanip: [39] Y. Kuang, J. Ye, H. Geng, J. Mao, C. Deng, L. Guibas, H. Wang, and
Monocular image-based robotic manipulation through active object Y. Wang, “Ram: Retrieval-based affordance transfer for generalizable
pose estimation,” in 2024 IEEE International Conference on Robotics zero-shot robotic manipulation,” in 8th Annual Conference on Robot
and Automation (ICRA). IEEE, 2024, pp. 7748–7755. 2 Learning. 2
[40] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and
D. Fox, “Gpu-accelerated robotic simulation for distributed reinforce-
ment learning,” 2018. 3
[41] Q. Dai, J. Zhang, Q. Li, T. Wu, H. Dong, Z. Liu, P. Tan, and H. Wang,
“Domain randomization-enhanced depth simulation and restoration for
perceiving and grasping specular and transparent objects,” in European
Conference on Computer Vision (ECCV), 2022. 3
[42] X.-M. Wu, J.-F. Cai, J.-J. Jiang, D. Zheng, Y.-L. Wei, and W.-S.
Zheng, “An economic framework for 6-dof grasp detection,” 2024.
[Online]. Available: https://arxiv.org/abs/2407.08366 4, 5, 6
[43] B. Sundaralingam, S. K. S. Hari, A. Fishman, C. Garrett, K. Van Wyk,
V. Blukis, A. Millane, H. Oleynikova, A. Handa, F. Ramos, N. Ratliff,
and D. Fox, “Curobo: Parallelized collision-free robot motion gen-
eration,” in 2023 IEEE International Conference on Robotics and
Automation (ICRA), 2023, pp. 8112–8119. 4
[44] H. Hirschmuller, “Stereo processing by semiglobal matching and
mutual information,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008. 4, 5
[45] L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field
transforms for stereo matching,” in 2021 International Conference on
3D Vision (3DV). IEEE, 2021, pp. 218–227. 4, 5
[46] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms
for optical flow,” in Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II
16. Springer, 2020, pp. 402–419. 4
[47] C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Grasp-
ness discovery in clutters for fast and accurate grasp detection,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2021, pp. 15 964–15 973. 5, 6