Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan
Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan
Survey 1
Abstract: Robotic Grasping has always been an active topic in robotics since grasping is
one of the fundamental but most challenging skills of robots. It demands the coordination of
robotic perception, planning, and control for robustness and intelligence. However, current
arXiv:2202.03631v1 [cs.RO] 8 Feb 2022
solutions are still far behind humans, especially when confronting unstructured scenarios.
In this paper, we survey the advances of robotic grasping, starting from the classical formu-
lations and solutions to the modern ones. By reviewing the history of robotic grasping, we
want to provide a complete view of this community, and perhaps inspire the combination
and fusion of different ideas, which we think would be helpful to touch and explore the
essence of robotic grasping problems. In detail, we firstly give an overview of the analytic
methods for robotic grasping. After that, we provide a discussion on the recent state-of-
the-art data-driven grasping approaches rising in recent years. With the development of
computer vision, semantic grasping is being widely investigated and can be the basis of
intelligent manipulation and skill learning for autonomous robotic systems in the future.
Therefore, in our survey, we also briefly review the recent progress in this topic. Finally, we
discuss the open problems and the future research directions that may be important for the
human-level robustness, autonomy, and intelligence of robots.
© 2022 The Author(s)
1. Introduction
With the development of robotics, robots are gradually entering our homes. Before being capable of sophisticated
daily tasks, robots must be firstly seasoned in basic skills, among which grasping should be the most important
one. To perform robust grasping, perception, planning, and control are simultaneously required. Therefore, robotic
grasping is a fundamental but most challenging area in robotics. Though actively investigated for several decades,
robotic grasping is far behind being solved, especially when the robot is confronting complex and unstructured
environments, or demanded to perform tasks with high-level semantics, which, however, is what we always hope
an intelligent robot helper to be capable of. Fortunately, with the development of deep learning [131], there is
remarkable progress in the last decade for learning representations of high-level semantics, such as object recogni-
tion and relationship understanding [143], natural language understanding [178], and robotic skill learning [124].
By taking advantage of the recent progress in learning, it is promising to build an intelligent robot that could
percept and understand the world as a human can do, interact with humans using natural languages, and finish
grasping tasks autonomously and robustly with the abstracted semantics, serving as the basis of achieving more
complicated tasks.
Therefore, in this paper, we will review the recent advances achieved in robotic grasping in the past several
years, starting from the classical formulations to modern society. By this survey, we want to answer the following
1 This draft will be continuously updated. Therefore, if you find any problems with this draft, please do not hesitate to contact the first
author for updates, including but not limited to: 1) other interesting works or ideas not included in this draft; 2) problems with the statements
in this draft; 3) further discussion about the included works; 4) other useful suggestions.
questions:
Obviously, it is impossible to include all works related to robotic grasping in one paper. Therefore, according to the
above questions, we hope to select a representative subset of this field, and provide the readers a comprehensive
and well-organized overview from formulation to solutions.
In brief, to answer the above questions and find feasible solutions, researchers have explored for decades, and
it results in an extensive set of excellent works. In detail, in the early stage, most works focused on the analytic
form of grasp synthesis based on mechanics, e.g., force-closure and form-closure [19] grasp synthesis. However,
such methods always rely on the simplification of the physical models and the assumption of a fully-observable
environment, which could be hardly achieved in real-world scenarios. With the rapid development of learning
approaches, data-driven approaches gradually dominated the community since it was simple, efficient, and could
get rid of the strong assumptions made by the analytic approaches [24]. Nevertheless, data-driven methods are
always data-intensive, meaning that they usually require much more data for training the grasping policy, which
is always labor-intensive. To solve the problem of data, self-supervised learning and unsupervised learning are
extensively explored in recent years [15,111], including some excellent works in robotic grasping [17,18,154,188,
265]. It is also possible to train grasping policies in physical simulators and then transfer to the real world [103,
105,257,258,273,278]. By enough data, the performance of data-driven approaches substantially outperforms the
classical methods. Based on the current progress, there are several interesting questions:
• Grasping is essentially a physical action, and hence, the classical analytic methods are well motivated. There-
fore, could we take the best of both analytic and data-driven to develop robust and scalable grasping meth-
ods?
• Rapid development in computer vision reveals that large-scale learning could potentially abstract the internal
structure of complex data. Could we take advantage of the vision techniques for developing robust grasping
skills in semantic scenarios?
• The real world is full of uncertainty. Failing to handle uncertainty will severely affect the reliability and
robustness, limiting the practicality. Could we model the uncertainty from the learned models when planning
grasps for robustness?
approach. In Section 3, we will review a series of representative and impactful works related to analytic grasp
synthesis along with the mechanics-based grasp quality evaluation methods, which inspired later works and formed
a basis for the robotic grasping community. In Section 4, we will discuss the data-driven grasping approaches,
aiming at synthesizing grasps from experiences, which are usually represented by a dataset. In Section 5, we are
going to survey the object-centric grasping approaches, which are usually targeted at a certain object specified
by humans possibly with different interfaces such as a class name or a natural language command. Also, object-
centric grasp planning in dense clutter involves the understanding of object relationships, which is also discussed
in this section. Finally, in Section 6, we will discuss the open problems of robotic grasping that are important but
still remain unsolved, and the future trends in these areas.
2. Problem Formulation
Grasp synthesis in robotics means finding the proper configuration of the robot’s actuator related to the state of the
target for stable grasping. It could be formalized in different ways. Concretely, the classical formulation focuses
more on the mechanical properties, while modern approaches show more interest in the visual properties. In this
section, we will review different formulations of grasp synthesis.
2.1. Overview
Basically, given an object represented by 2D- or 3D-format, it would be a challenging problem to find an optimal,
or at least stable, grasp from infinite candidates based on the geometric or physical analysis. Therefore, to develop
robust grasping approaches, several challenges will instruct the discussion in this section:
• How can we efficiently sample high-quality grasp candidates from an infinite set?
In this section, we will discuss the first two questions, and leave the answer to the third question as to the main
body of this paper. Noticeably, there is another important issue about how to plan to execute the optimized grasp
given the kinematics of the actuator without collision with the environment, which is also crucial for a successful
grasping trial. However, it mostly relates to motion planning algorithms [129], and goes out of the scope of this
paper.
Formally, given a set of contact points p = {pi }Ni=1 , one wrench, ωi = ( fi , τi ), is imposed on each contact point
accordingly, where fi is the force exerted on the object at point pi and τi is the torque around the surface normal. A
grasp g j , j ∈ {1, 2, ..., M}, is defined as g j = (ω1 , ω2 , ..., ωN ), where all points, wrenches, and grasps are defined
in the object reference frame. Obviously, if there is an external wrench, ωe , imposed on the object, only when
ωg, j = ∑Ni=1 ωi = −ωe can the object be in equilibrium. This representation is widely used in analytic methods
to be introduced in Section 3 and early data-driven approaches (e.g. [5, 60, 114]). This kind of representation is
scalable to different grippers with different numbers of fingers. Therefore, it is still preferred in current days by
grasping using dextrous hands.
Noticeably, the contact-based grasp representation is based on the ideal contact models, i.e., the contact points
could be exactly positioned by the robot. However, due to the inherent system or random errors, it would always
be inaccurate to execute the planned grasps. And certainly, such errors should be taken into consideration when
synthesizing robust grasps. Therefore, a more practical representation, named Independent Contact Regions (ICRs)
[175], is introduced in spite of possibly introducing more computation. It is defined as a set of independent regions
on the object boundary such that putting one finger onto each ICR will result in a force-closure grasp (please refer
to Section 3) regardless of the exact position of each finger.
With the prevalence of parallel-jaw grippers, with some loss of scalability, it is more convenient to use simplified
representations. Since the kinematics of a parallel-jaw gripper is simple, the contact points on a specific object is
completely determined by the gripper’s 6-D pose g = (x, y, z, rx , ry , rz ) ∈ SE(3), including 3-D position (x, y, z) and
3-D orientation (rx , ry , rz ), which is a widely-used grasp representation based on 3-D perception [83,137,162,235].
For the convenience of computation, SE(3) grasp representation may have different specific but equivalent forms
in practice.
Table 2: Summary of Selected Analytic Grasp Synthesis
Recently, as 2-D vision develops rapidly, it is feasible to directly synthesize grasps on RGB images instead of the
3-D point clouds, and hence, the grasp representation is further simplified. [214, 215] detected grasp points on
multi-view observations and projected them back into a single 3-D grasp point. [198] and [11] used segmented
grasp affordance on 2-D images to represent grasps for parallel-jaw grippers. Such a single-dot representation is
also widely used for suction grasps [28, 109, 151]. Later, orientation was introduced to instruct the pose of the
gripper [149, 244].
The point-based grasp representation cannot model the size of the gripper. Moreover, it lacks a bounded feature
area to map grasp points to robot configurations according to [110]. Therefore, they presented oriented rectangles
for grasps on 2-D images. The oriented rectangle includes 5 dimensions: g = (x, y, w, h, θ ), with (x, y) denoting the
center, (w, h) denoting the distance between two jaws and the size of the gripper, and θ denoting the orientation
of the gripper. It is now widely used in grasp detection with image inputs.
Possible grasps are infinite on one object or one image. Therefore, based on the oriented-rectangle grasps, the
pixel-level dense grasp representation was proposed [10, 74, 240, 242]. Typically, it models the grasp synthesis as
a segmentation problem. By taking as the input images, it outputs a segmented image of grasp affordance and
possibly the corresponding gripper parameters. Formally, the grasp map G = (Q, W, H, Θ) where Q, W, H, and
Θ are all single-channel images with the same size of the input, representing the pixel-wise graspability and the
corresponding gripper parameters including width, height, and orientation. Note that not all elements in G are
mandatory. For example, the height map H is not included in the output of the method in [74].
In this section, we will review the mainstream approaches related to grasp synthesis. By reviewing both classical
and modern methods, we hope to inspire researchers to harness the best from both worlds for developing advanced
grasping approaches.
3.1. Overview
Analytic grasp synthesis is mostly based on mechanics. Under this setting, a grasp is generally represented by a set
of contact points and the corresponding wrenches imposed on each point [20]. Typically, there are three different
contact models:
• Soft-finger contact, meaning that the contact part is deformable and will be an area instead of a point, and
thus allows an additional torque around the surface normal.
We focus on the first two types of contact models, which are most widely explored in robotics. For the soft-finger
contact models, we refer to [63] for more detailed discussions.
In detail, we firstly review the methods to analytically evaluate the quality of a given grasp. With a given
evaluation metric, it would be non-trivial to synthesize the optimal grasps. Typically, either heuristic or analytic
methods could be applied to the computation and optimization of grasps. A summary of the included analytic
grasp synthesis methods is demonstrated in Table 2.
Definition 1. (Grasp-closure) is a property of a given grasp, including form-closure for grasps with frictionless
contact points and force-closure for grasps with frictional contact points. It occurs only when the grasp could
resist any possible external disturbing wrenches.
The research on grasp-closure can trace back to 19th century [202]. They proved that for a 2-D polygon, at least
4 frictionless wrenches are required for form-closure grasping. Much later, [127] showed that at least 7 contact
points are needed for 3-D polygons. Based on their analysis, [156] proved their conjectures. In particular, they
showed that iff without rotational symmetry, form-closure for any 3-D bounded object with piecewise smooth
boundary could be achieved by 12 fingers, and in most cases, 7 fingers could be enough. Moreover, they also
demonstrated that with Coulomb friction, the required number of fingers to achieve force-closure could reduce to
3 and 4 for 2-D and 3-D objects respectively under certain circumstances. Later, the definitions of form-closure
and force-closure were formally completed by [19]. Following that, [203] argued that the previous definition for
form-closure and force-closure (referred by 1st grasp-closure) are not adequate and should be considered together
with the mobility of the fingers. They proposed the definition of 2rd form-closure and force-closure to fix the
deficiency.
Though the grasp-closure property has been well-investigated, one could notice that the definition in Def. 1 is
unduly strict. In practice, the forces that the actuator could impose on the object are usually limited. Therefore, a
more practical metric is needed to evaluate a given grasp. One natural way to evaluate a grasp is the minimum force
needed to achieve equilibrium [140, 157, 190], and the directions of imposed forces should be close to the surface
normals for stability [89, 140]. However, such methods usually assume that the accessibility of the externally
exerted wrenches on the object.
To improve this deficiency, [69] proposed to utilize the Grasp Wrench Space (GWS) to evaluate the quality of
grasps:
Definition 2. (Grasp Wrench Space) of grasp gi is defined as the convex hull of all possible wrenches that could
be imposed through the contact points {pi }Ni=1 of grasp gi .
In particular, the minimum distance between the origin and the boundary of the GWS, called the Largest-
minimum Resisted Wrench (LRW), represents the minimum external wrench that could affect the stability of
the object. It quickly became one of the most well-known metrics for grasp quality evaluation. Based on GWS
and LRW, [164] proposed that different distances such as L1 and L∞ could be applied to the measurement of
LRW. [163] decoupled the forces and torques of wrenches in the wrench space to avoid the balancing factor
between them. [234] and [160] proposed to use the volume of GWS instead of the distance to get rid of the
dependence on the predefined reference frame on the object.
There are also other metrics used for quality evaluation in analytic grasp synthesis, such as the shape [118, 183]
and volume [36,163,232] of the grasp polygon formed through all contact points, the distance between the centers
of the object and the grasp polygon [37, 54, 192], and the size or radius of ICRs [42, 191, 228]. We refer to the
review by [207] for more detailed discussions.
As machine learning technology develops rapidly, it is promising to learn robot skills by training on a large amount
of data instead of planning with object models [24, 124]. Such methods are always called data-driven since the
quality and quantity of data are also essential parts for a good policy besides the methods. In this section, we will
review a series of data-driven grasp synthesis approaches.
4.1. Overview
In most cases, modern grasp synthesis is based on perception, especially the visual observations of the workspace.
However, different from traditional visual tasks, grasp synthesis usually involves the precise perception and anal-
ysis of geometric information, and sometimes intuitive physics, especially when facing unknown objects. And
compared to analytic methods, data-driven approaches substantially loosen the assumptions of accessible object
models since inspired by the neuropsychology [107], it is widely found that the heuristic abstraction of knowledge
is enough to derive reliable robotic control signals [6, 102, 113, 141, 174, 187, 213, 229].
As its name implies, the provided data plays a role of “experiences”, driving the robot to abstract “knowledge”
adaptively for skill learning. Different methods implement the abstraction in different ways. Regarding robotic
grasping, there are mainly three ways:
• Imitation-based methods: Given a dataset including stable grasps (e.g. force-closure grasps) and the corre-
sponding objects, the grasps could be transferred to similar objects by imitation. The imitative policy could
be formulated through the similarity between the target and object templates in the given dataset, or between
the real robot configuration and the given grasp templates. Early works focused more on this type of method
since it is more data-efficient.
• Sampling-based methods: To generate grasps on objects, another way is to sample a set of possible can-
didates, among which a discriminator is used to find the best one. Benefiting from the decoupling of grasp
sampling and classification, it has better interpretability and scalability. Nevertheless, it relies heavily on a
better grasp sampler in terms of both performance and running speed.
• End-to-end Learning: With the development of deep learning, it is possible to embed all things into one
neural network model and train it end-to-end. The input could be the raw observations such as information
from tactile sensors or cameras. And the output is proper grasp configurations. All steps including grasp sam-
pling and quality evaluation could be adaptively tuned with updates of trainable parameters. Such methods
usually run faster than the above two types and benefit mostly from large datasets.
will discuss them respectively. A selected set of imitation-based grasp synthesis methods are also summarized in
Table 3.
PbD means that successful grasping trajectories are recorded first. When testing, the robot will adjust and replay
the trajectory to grasp objects. Grasp recognition is one crucial component in PbD-based grasp synthesis and
was widely investigated [5, 50, 60, 114, 283]. It assigns a specific category to a given grasp configuration from
a predefined taxonomy [46, 68, 283]. Based on the recognized grasp type and the demonstration, the planner
could synthesize the grasp for the robot using mapping of kinematics [5], or search efficiently in a constrained
grasp space [138, 139]. The demonstration could be also combined with reinforcement learning to incrementally
improve the performance, achieving better adaptation to the robot [79].
Generative models are also feasible for PbD-based grasp synthesis. Demonstrations are used to train the gener-
ator. In the phase of testing, the trained generator takes as input features of objects, and output a distribution from
which one can sample grasps [8, 217, 233].
As machine learning develops, behavior cloning [276] and inverse reinforcement learning [1, 95, 254] have also
been explored in the context of robotic grasping. Behavior cloning transforms imitation learning to supervised
learning, in which the demonstrations are regarded as a labeled dataset, based on which a model is trained to
map from inputs to actions. Inverse reinforcement learning is used to infer an explanative reward for the given
demonstrations. The inferred reward is then applied to policy training with reinforcement learning.
Recently, meta-learning [238] enables few-shot and even one-shot imitation directly from raw visual observa-
tions such as videos or images. Meta-learning is the basis for one-shot imitation learning [58, 70, 153], in which
a model is trained using some off-the-shelf data first to get a meta policy, and during imitation learning, it will
be fine-tuned by the demonstration to get the final policy. It is even possible to transfer the given demonstration
across different agents with different kinds of morphology [25, 48, 263, 264] Besides, data augmenting is also
an interesting idea to achieve few-shot imitation learning [49]. The provided demonstrations will be augmented
by a delicately designed pipeline and used to train a neural network. However, the achieved generality is limited
compared to meta-learning methods.
Table 4: Summary of Selected Sampling-based Grasp Synthesis
MoT can be classified into two subclasses: 1) Matching of Object Template (MOOT) and 2) Matching of Shape
Template (MOST).
In MOOT, a set of object templates along with their corresponding grasps are usually predefined. When the
robot meets novel objects, it will look up the template set and find the most similar one so as to map the predefined
grasps onto targets. One straightforward way of doing so is by manipulation-oriented pose estimation [40, 59, 121,
155, 237]. Concretely, objects will be recognized and positioned first, and predefined grasps from demonstrations
could be directly projected into the reference frame of objects for grasping [55, 167], or simulation-based grasp
planners could be introduced to online grasp planning [16, 122, 161]. However, such methods could be only used
for known objects, i.e., it is usually assumed 3-D object geometric models (e.g. the mesh or point cloud) are
available for estimation of 6-D poses.
To grasp unknown objects, shape primitives were proposed, resulting in MOST. Instead of full objects, they
form a set of primitive shapes, based on which grasps are predefined. The primitive set could be infinite [189].
When grasping novel objects, a matching process will be conducted between the predefined primitives and the
target. Then the demonstrated grasps could be mapped and executed [45, 61, 92, 93, 96, 162] or a grasp planner
could be incorporated based on the matched primitives [62,76,101]. Such methods could grasp objects with similar
appearance to the primitives, but cannot handle objects with unknown geometric structures.
Object Grasp
Dataset Repr. Modality Source Size
/Scene /Scene
Cornell [110] Rect RGB-D Real 1035 1 ∼8
Dex-Net 2.0 [150] Rect Depth Sim 6.7M 1 1
Dex-Net 3.0 [151] Point Depth Sim 2.8M 1 1
Jacquard [52] Rect RGB-D Sim 54K 1 ∼20
VMRD [270] Rect RGB Real 4.7K ∼3 ∼20
[133] - RGB-D Real 800K - 1
[243] - RGB-D-T Real 2.55K 1 1
GraspNet-1billion [66] Rect + SE(3) RGB-D Real 97K ∼10 3-9M
ACRONYM [64] SE(3) Depth Sim 8.8K 1 2K
SuctionNet-1billion [28] Point RGB-D Real 97K ∼10 3-8M
REGRAD [273] Rect + SE(3) RGB-D Sim 900k 1-20 1.02K
use of large-scale datasets and learning techniques. The included algorithms are summarized in Table 4.
4.3.1. Discriminator
To learn the discriminator, supervised learning is usually applied. Noticeably, learning of discriminators is similar
to grasp recognition of PbD (Section 4.2.1): both of them are trained using labeled data and supervised learning.
The essential difference is that in grasp recognition of PbD, the grasp category based on the predefined taxonomy
is the focus, which is important to decide a certain grasp pattern for the robot. By contrast, the discriminator
here is used to evaluate whether a grasp is good or not. Possible quality metrics could either derive from analytic
methods [207], or a simple indicator of success or failure [214, 215].
Before the prevalence of deep learning, Support Vector Machines (SVM) [44] or probabilistic models [120]
are widely used to train such a discriminator [14, 22, 23, 47, 110, 130, 186, 214–216]. Training data are either
collected in the real world and manually labeled [110,216] or automatically synthesized using physical simulators
[14, 77, 186, 214, 215]. However, if synthetic data are used, there might be a reality gap when trained models are
applied in real-world scenarios due to domain shift [241]. Sizes of datasets in this period are always limited since
such models are quite data-efficient and could achieve commendable performance with a few (usually hundreds
of) data points.
As deep learning shows categorical advantages over other methods, it dominates learning of grasp discrimina-
tors recently. Nevertheless, compared to SVM, it needs much more data to train a good model. Therefore, datasets
including more and more data are proposed to meet the demands of deep networks [26–28, 52, 64, 66, 151, 152,
243, 270, 273]. A summary of robotic grasp datasets is shown in Table 5. One also could refer to [100] for a com-
prehensive summary of large-scale robotic manipulation datasets. Generally speaking, deep-learning-based grasp
discriminators are in essence the same as SVM-based discriminators despite much stronger representability and
performance. The main difference is that deep networks support much complex input data modality, such as raw
2-D images [132] and point clouds [83, 135, 137, 235].
There are also methods not relying on the learned discriminators to evaluate the quality of grasps. In this case, a
model of the target is usually needed to be estimated first, such as the 3-D shape, friction, and center of mass. Such
a model is not necessarily accurate in most cases. For example, PROMPT [31] only builds a 3-D particle-based
model through multi-view images for the target and applied NVIDIA Flex with predefined friction to the evaluation
of whether a grasp will succeed or not. It chooses the best sample for real-world execution. By comparison of the
difference between the simulator and the reality, PROMPT could update parameters of object models in an online
and close-loop way. Many grasp planners based on physical properties such as [122], [161], and [16] are possible
to be introduced here to replace learning-based discriminators given the estimation of object models. For example,
GraspIt! [161] is widely used in the early works for grasp quality evaluation given a reasonable set of grasp
candidates from learned samplers.
4.3.2. Sampler
A sampler could be either data-driven or heuristic. Different data modalities and grasp representations usually
correspond to different sampling methods.
For image inputs, the sampler is used to sample points for point-based grasp representation (Section 2.2.4), or
image patches for oriented-rect grasp representation (Section 2.2.5). One naive way to sample points is pixel-wise
random sampling. However, it is inefficient and sometimes intractable because: 1) sample space is too large; 2)
a point is not representative and does not include enough features to indicate the quality of a grasp. Therefore,
learning is used to obtain a prior for sampling. In this case, the output of a learning-based sample is usually a
grasping affordance map, with higher values denoting the more graspable area [22, 23, 78, 135, 198, 214, 215].
Based on the affordance map, all points could be ranked and tested one by one to find the best grasp configuration.
To solve the problem of representability, the sampled point could be mapped to 3-D space with camera models
[22, 23, 78, 214, 215], or extended to a full grasp configuration heuristically [198]. To sample image patches for
oriented-rect grasp synthesis, random sampling methods such as the sliding window (SW) method are intractable
due to unacceptable time complexity. A more efficient way is to learn a patch sampler by taking the image as the
input as long as the inference speed of the learned sampler is much faster than the discriminator [110, 132, 246].
Moreover, some heuristics could be used to further reduce search space. For example, given a predefined patch
size and the assumption that the background is a flat table, one could uniformly sample surface normals computed
from depth gradients [150, 151].
For point cloud inputs, the sampler is used to sample different SE(3) grasp poses (Section 2.2.3) in most cases.
Different from 2-D points, 3-D points include much richer geometric information which will help to efficiently
filter out undesired regions. For example, [83] and [235] voxelized and uniformly sampled points in regions of
interest, and performed local grid search to generate a set of SE(3) grasp candidates. Finally, they filtered out
the ones causing collisions between the gripper or including no object points within the closing region of the
gripper. [236] improved grid search in this sampling method for higher efficiency, which is applied and further
modified by [137]. An alternative way is to apply the Cross Entropy Method (CEM) [211] starting from a randomly
sampled grasp set and finally converging to an optimal graspable point distribution [150, 151, 258]. Learning-
based samplers are also feasible and show higher efficiency especially in terms of speed [171, 259]. For antipodal
grasps [30], a mapping between single points and grasps could be built on object meshes, which results in a
simplification from grasp sampling to point sampling [150, 198].
Deep convolutional networks enable end-to-end visual perception based on image inputs [91, 123, 225]. To detect
grasps on images, the most straightforward way is to directly transfer object detection algorithms to the domain of
grasp detection, since detection algorithms share a similar basis: they both are essentially a classification problem
based on a set of extracted proposals. From this view, end-to-end learning also shares a similar idea as sampling-
based methods. The difference is that end-to-end learning integrates the sampler and the discriminator into one
single model and trains them end-to-end. For example, [38, 86] transfers Faster R-CNN [201] to a two-stage
Table 6: Summary of Selected End-to-end Grasp Synthesis
grasp detection algorithm. And vise versa, grasp detection like [199] sometimes also inspired object detection
research [200].
Nevertheless, grasp detection is essentially different from object detection since 1) grasp detection relies heavily
on the local geometry of grasps; 2) grasp quality is sensitive to orientations; 3) grasping should be a close-loop pro-
cess, meaning that failures should be handled during grasping. For 1), image-based grasp detection algorithms usu-
ally take a combination of color and geometric channels, such as depth images as input [38,126,199,227,250,274]
and surface normals [180]. For 2), the dimension of the orientation could be discretized and the prediction of
orientations could be simplified as a classification problem [38, 250]. However, discretization suffers from per-
formance loss, especially for orientation-sensitive grasps. Therefore, oriented anchors were introduced to handle
this problem [274, 280]. Besides, Spatial Transformer Network (STN) [104] could also be used for more accurate
classification of oriented grasp candidates [72,180]. Recently, [181] proposed rotation ensemble module to handle
rotation-invariance for grasp detection. For 3), reactive policies could be trained for grasping by taking raw images
as inputs [12, 128, 133, 149, 265].
Recent developments in 3-D vision enables end-to-end learning with point clouds as inputs [134,159,193,194,279,
281]. Such methods have also been explored for grasp synthesis. In this case, a backbone is usually used to pre-
process input point clouds, including subsampling, grouping, de-noising, etc., following which a feature extractor
designed based on strong inductive bias is used to extract features of points. The extracted features are then fed
into a grasp detector to regress SE(3) grasps as well as confidence scores indicating grasp quality of each point
correspondingly. The most simple framework is the one-stage anchor-free grasp detection [176, 195, 231], which
directly output results right after the feature extraction stage. To improve performance, SE(3) grasp anchors and
sphere-region features instead of point features were introduced in [278]. Their method also includes a fine-tune
stage, which further improves robustness. A similar idea has also been explored by [247, 249]. Another problem
is that when synthesizing grasping in point clouds, gripper models are crucial to evaluate grasping stability. Most
works are now designed based on a specific type of grippers, and can hardly generalize to other grippers. [218]
and [256] proposed to encode gripper-specific features in the inputs and train gripper-specific grasp detectors.
They proved that by doing so, the model could learn to adapt to different types of grippers.
Different from grasp detection, grasp map synthesis is similar to image segmentation, where the output is usually
represented by a set of heat maps, indicating where and how to grasp. One thing needed to be clarified is that there
is no clear gap between grasp detection and grasp map synthesis. One can imagine that in grasp detection, the
dense estimation of grasp quality (e.g. in [38, 274, 278]) for each pixel on image features is also a kind of grasp
map synthesis, which is even more representative, but with a smaller-size output compared to the input. Such
methods could be seen as a transition between grasp detection and grasp map synthesis.
For pixel-level grasp map synthesis, transfer from segmentation algorithms is also widely explored. For ex-
ample, U-Net [209] has been widely used for grasp map synthesis [29, 142, 219, 220]. Such an encoder-decoder
architecture is widely used to synthesize pixel-wise grasps [11, 28, 116, 239, 255]. Another similar formulation
for pixel-wise grasp synthesis is called grasp manifolds, proposed by [88]. Since grasp map is more informative
and could provide a global grasp affordance which indicates the grasp quality of the current viewpoints, it en-
ables selection of the best view [116, 239], provided the assumption that the camera is not fixed, which holds
in most cases for robots. It is defined by a close set of points on objects representing graspable areas. Besides,
with the mobility of robots, the interaction could be imposed on the workspace to actively clear out around the
graspable area [51, 142, 265] when no grasps are available. Also, as mentioned above, reactivation is needed to
recover from failures, and some works have explored reactive grasping policy learning based on pixel-level grasp
maps [168–170].
In real applications, grasping usually serves for more complicated tasks requiring object-centric perception. De-
velopments of learning methods make it possible to integrate the understanding of high-level concepts while
executing grasping. In this section, we are going to review recent algorithms based on object-centric semantics.
5.1. Overview
Different object-centric semantics should be considered under different situations. In this paper, we will discuss
three types of them:
• Object-specific Grasp Synthesis: Object-specific grasp synthesis aims to retrieve and grasp objects belong-
ing to a specific class in clutter scenes. To specify a target, a class name is usually specified as the condition
of grasping.
• Interactive Grasp Synthesis: Interactive grasp synthesis means specifying targets using natural languages,
which includes richer information about objects including attributes and relationships with other objects. It
is worth noting that interactive grasping is different from interactive perception in robotics, which in most
cases means perception based on interaction with environments [21]. We also included some works related
to grasp synthesis based on interactive perception in Section 4.4.3.
• Relational Grasp Synthesis: Relational grasp synthesis is needed when grasping may have a possible
negative effect on other objects. Planning algorithms [129] are feasible to handle such situations given
environment models. With deep learning, the model-free understanding of object relations has also been
explored in recent years.
All types of object-centric grasping methods are built on top of robust grasp synthesis algorithms, and the differ-
ence lies in the introduction of semantics, which, to some extent, is parallel to grasp synthesis. The motivation
behind this is that we want robots to understand the world as a human can do, interact with humans in a natural
way, and finish complicated tasks autonomously and robustly, which has been pursued for decades by almost all
roboticists.
6. Open Problems
From the discussions above, it is obvious that the development of grasping is from structural to intelligent. Early
works mostly focused on mechanical analysis and optimality, which requires full models of objects, grasps, and
environments. Later, learning is applicable to relax assumptions on environments, which enables the deployment
of grasping algorithms in partially unstructured scenarios. Recently, most works are exploring grasping with high-
level visual concepts, aiming for adaptation to daily home scenarios, though currently, it is still far behind this
final goal.
There are also some interesting points which are worth noting.
Firstly, there is no doubt that mechanics should be directly responsible for the stability of grasping. However,
most current works focus on heuristic methods based on learning and show surprisingly good performance on
grasp synthesis on unknown objects. Though investigated by some works already (e.g. [137,150,176]), there is still
a gap between these two domains, i.e., analytic grasp synthesis and data-driven grasp synthesis. So the problem is
could we take the best of both analytic and data-driven to develop robust and scalable grasping methods? One way
we believe promising is to learn intuitive physics [125,204,261] for stable grasping. It comes from the observation
of how our humans grasp objects. In most cases, we implicitly roughly infer some physical properties of objects
before grasping so as to avoid failures, e.g., the friction coefficient, center of mass, and 3D geometry of unseen
parts based on our knowledge base. With these rough models, we then implicitly plan a reasonable (which may not
be optimal) grasp. Such a pipeline is similar to sample-based methods introduced in Section 4.3, most of which
only focus on local geometric information instead of object-level physical properties. Another alternative way to
involve consideration of physics is using physical simulators [13], especially with the help of recent advances in
differentiable simulators [98, 99, 248].
Secondly, scene understanding with high-level semantics is also closely related to robotic manipulation tasks,
which has also been actively explored in grasp synthesis and is critical for robot intelligence. Currently, the main
difficulties are generalization to open-set objects and worlds. Recent progress in unsupervised representation learn-
ing shows that it is promising to learn structured representations for unknown objects and concepts [65, 196].
Therefore, the problem is could we take advantage of large-scale unsupervised representation learning for devel-
oping robust grasping skills in semantic scenarios? To grasp open-set targets, one straightforward way is to train
grasp policies directly based on the large-scale pre-trained models. Also, one may consider a composable alterna-
tive, in which semantics could be analyzed first, and grasps could be then synthesized in an object-centric way.
To do so, a robust object-centric grasp detector is needed. Another important thing is relationship understanding
with open-set objects. Humans can retrieve and grasp targets efficiently in daily scenes even with unrecognized
distractors. This ability is built on top of the hierarchical understanding of semantics. For example, a command
“fetch me the red bottle on the dinner table” will involve a two-layer relationship “dinner room - dinner table - red
bottle”, where the first relationship “dinner room - dinner table” is implicit and based on prior semantic knowledge.
Such tasks are still challenging for robots to complete.
Finally, uncertainty is everywhere in practice. In traditional robotics, it is critical to handle uncertainty for plan-
ning and control. However, most learning-based grasping approaches simply utilize one-shot greedy inference
models. Thus, it is meaningful to consider could we model the uncertainty from the learned models when plan-
ning grasps for robustness? To some extent, neural networks are like sensors, providing high-level noisy semantic
information for decision making. We believe that one-shot greedy policies are not the optimal way to use these
observations. To consider model uncertainty, the first thing needed to be handled is how to model reasonable un-
certainty for outputs of neural networks. Model calibration [84] is a useful tool to calibrate output uncertainties.
With uncertainties, better decisions could be made to optimize final goals with given constraints (safety, success
rate, etc.). Since decision-making in robotics usually involves a sequential decision-making problem with par-
tial observations, historic information is also helpful to optimize actions. Partially observable Markov Decision
Process (POMDP) [166] is a natural candidate to consider uncertainty, long-term decision making, and partial ob-
servability in a principled way. Recent advances also illustrated promising results for POMDP to solve large-scale
problems [71, 260].
7. Conclusions
In this paper, we review the history of robotic grasp synthesis approaches, including analytic methods, data-driven
methods, and recent object-centric methods. Analytic methods are usually based on top of known object models
and mechanical analysis, which can theoretically ensure stability but with strong assumptions and simplifications
limiting its application in practical scenarios. Data-driven methods are derived from neuropsychology and are
mostly heuristic. However, it relaxed the assumptions made by analytic methods and hence, is widely used in
real-world scenarios. In particular, benefitting from recent progress in learning techniques, it achieves commend-
able performance in grasping tasks and is promising to play important roles in robotic autonomy. Recently, with
developments in semantic vision, object-centric methods have been more and more actively investigated. Object-
centric methods combine the understanding of object semantics with grasp synthesis for semantic grasping, which
is more close to our daily life instead of industrial applications. We believe that in the future, vision-based intuitive
physics, open-set grasping with semantic representations, and planning under partial observability and uncertainty
will be the future trends for robotic grasping.
Author Contributions
Hanbo Zhang finished most parts of this manuscripts. Jian Tang and Shiguang Sun helped to collect and organize
the literature. Xuguang Lan is the supervisor of Hanbo Zhang, Jian Tang, and Shiguang Sun, and he is also the
corresponding author and responsible for all contents.
References
1. Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the
twenty-first international conference on Machine learning, page 1, 2004.
2. William Agnew, Christopher Xie, Aaron Walsman, Octavian Murad, Yubo Wang, Pedro Domingos, and Siddhartha
Srinivasa. Amodal 3d reconstruction for robotic manipulation via stability and connectivity. In Conference on Robot
Learning, pages 1498–1508. PMLR, 2021.
3. Stefan Ainetter, Christoph Böhm, Rohit Dhakate, Stephan Weiss, and Friedrich Fraundorfer. Depth-aware object seg-
mentation and grasp detection for robotic picking tasks. In The British Machine Vision Conference (BMVC), 2021.
4. Stefan Ainetter and Friedrich Fraundorfer. End-to-end trainable deep neural network for robotic grasp detection and
semantic segmentation from rgb. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages
13452–13458. IEEE, 2021.
5. Jacopo Aleotti and Stefano Caselli. Grasp recognition in virtual reality for robot pregrasp planning by demonstration. In
Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 2801–2806.
IEEE, 2006.
6. Peter K Allen and Ruzena Bajcsy. Object recognition using vision and touch. In Proceedings. International Joint
Conference on Artificial Intelligence, 1985.
7. Muhannad Alomari, Paul Duckworth, Majd Hawasly, David C Hogg, and Anthony G Cohn. Natural language ground-
ing and grammar induction for robotic manipulation commands. In Proceedings of the First Workshop on Language
Grounding for Robotics, pages 35–43, 2017.
8. Ermano Arruda, Claudio Zito, Mohan Sridharan, Marek Kopicki, and Jeremy L Wyatt. Generative grasp synthesis from
demonstration using parametric mixtures. arXiv preprint arXiv:1906.11548, 2019.
9. Umar Asif, Mohammed Bennamoun, and Ferdous A Sohel. Rgb-d object recognition and grasp detection using hierar-
chical cascaded forests. IEEE Transactions on Robotics, 33(3):547–564, 2017.
10. Umar Asif, Jianbin Tang, and Stefan Harrer. Graspnet: An efficient convolutional neural network for real-time grasp
detection for low-powered devices. In IJCAI, volume 7, pages 4875–4882, 2018.
11. Umar Asif, Jianbin Tang, and Stefan Harrer. Densely supervised grasp detector (dsgd). In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 8085–8093, 2019.
12. Tim Baier-Lowenstein and Jianwei Zhang. Learning to grasp everyday objects using reinforcement-learning with
automatic value cut-off. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1551–
1556. IEEE, 2007.
13. Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understand-
ing. Proceedings of the National Academy of Sciences, 110(45):18327–18332, 2013.
14. Yasemin Bekiroglu, Janne Laaksonen, Jimmy Alison Jorgensen, Ville Kyrki, and Danica Kragic. Assessing grasp
stability based on learning and haptic data. IEEE Transactions on Robotics, 27(3):616–629, 2011.
15. Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review
and new perspectives. CoRR, abs/1206.5538, 1:2012, 2012.
16. Dmitry Berenson, Siddhartha S Srinivasa, Dave Ferguson, Alvaro Collet, and James J Kuffner. Manipulation planning
with workspace goal regions. In 2009 IEEE International Conference on Robotics and Automation, pages 618–624.
IEEE, 2009.
17. Lars Berscheid, Pascal Meißner, and Torsten Kröger. Self-supervised learning for precise pick-and-place without object
model. IEEE Robotics and Automation Letters, 5(3):4828–4835, 2020.
18. Lars Berscheid, Thomas Rühr, and Torsten Kröger. Improving data efficiency of self-supervised learning for robotic
grasping. In 2019 International Conference on Robotics and Automation (ICRA), pages 2125–2131. IEEE, 2019.
19. Antonio Bicchi. On the closure properties of robotic grasping. The International Journal of Robotics Research,
14(4):319–334, 1995.
20. Antonio Bicchi and Vijay Kumar. Robotic grasping and contact: A review. In Proceedings 2000 ICRA. Millennium Con-
ference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065),
volume 1, pages 348–353. IEEE, 2000.
21. Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gaurav S
Sukhatme. Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on
Robotics, 33(6):1273–1291, 2017.
22. Jeannette Bohg and Danica Kragic. Grasping familiar objects using shape context. In 2009 International Conference
on Advanced Robotics, pages 1–6. IEEE, 2009.
23. Jeannette Bohg and Danica Kragic. Learning grasping points with shape context. Robotics and Autonomous Systems,
58(4):362–377, 2010.
24. Jeannette Bohg, Antonio Morales, Tamim Asfour, and Danica Kragic. Data-driven grasp synthesis—a survey. IEEE
Transactions on Robotics, 30(2):289–309, 2013.
25. Alessandro Bonardi, Stephen James, and Andrew J Davison. Learning one-shot imitation from humans without humans.
IEEE Robotics and Automation Letters, 5(2):3533–3539, 2020.
26. Ian M Bullock, Thomas Feix, and Aaron M Dollar. The yale human grasping dataset: Grasp, object, and task data in
household and machine shop environments. The International Journal of Robotics Research, 34(3):251–255, 2015.
27. Berk Calli, Arjun Singh, James Bruce, Aaron Walsman, Kurt Konolige, Siddhartha Srinivasa, Pieter Abbeel, and
Aaron M Dollar. Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics
Research, 36(3):261–268, 2017.
28. Hanwen Cao, Hao-Shu Fang, Wenhai Liu, and Cewu Lu. Suctionnet-1billion: A large-scale benchmark for suction
grasping. arXiv preprint arXiv:2103.12311, 2021.
29. Georgia Chalvatzaki, Nikolaos Gkanatsios, Petros Maragos, and Jan Peters. Orientation attentive robotic grasp synthesis
with augmented grasp map representation. arXiv preprint arXiv:2006.05123, 2020.
30. I-Ming Chen and Joel W Burdick. Finding antipodal point grasps on irregularly shaped objects. IEEE transactions on
Robotics and Automation, 9(4):507–512, 1993.
31. Siwei Chen, Xiao Ma, Yunfan Lu, and David Hsu. Ab initio particle-based object manipulation. In Robotics: Science
and Systems, 2021.
32. Xiangyu Chen, Zelin Ye, Jiankai Sun, Yuda Fan, Fang Hu, Chenxi Wang, and Cewu Lu. Transferable active grasping
and real embodied dataset. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3611–
3618. IEEE, 2020.
33. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter:
Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer,
2020.
34. Yiye Chen, Ruinian Xu, Yunzhi Lin, and Patricio A Vela. A joint network for grasp detection conditioned on natural
language commands. arXiv preprint arXiv:2104.00492, 2021.
35. Zhixin Chen, Mengxiang Lin, Zhixin Jia, and Shibo Jian. Towards generalization and data efficient learning of deep
robotic grasping. arXiv preprint arXiv:2007.00982, 2020.
36. Eris Chinellato, Robert B Fisher, Antonio Morales, and Angel P Del Pobil. Ranking planar grasp configurations for
a three-finger hand. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422),
volume 1, pages 1133–1138. IEEE, 2003.
37. Eris Chinellato, Antonio Morales, Robert B Fisher, and Angel P del Pobil. Visual quality measures for characteriz-
ing planar robot grasps. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
35(1):30–41, 2005.
38. Fu-Jen Chu, Ruinian Xu, and Patricio A Vela. Real-world multiobject, multigrasp detection. IEEE Robotics and
Automation Letters, 3(4):3355–3362, 2018.
39. Vanya Cohen, Benjamin Burchfiel, Thao Nguyen, Nakul Gopalan, Stefanie Tellex, and George Konidaris. Grounding
language attributes to objects using bayesian eigenobjects. In 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 1187–1194. IEEE, 2019.
40. Alvaro Collet, Dmitry Berenson, Siddhartha S Srinivasa, and Dave Ferguson. Object recognition and full pose registra-
tion from a single image for robotic manipulation. In 2009 IEEE International Conference on Robotics and Automation,
pages 48–55. IEEE, 2009.
41. Yang Cong, Ronghan Chen, Bingtao Ma, Hongsen Liu, Dongdong Hou, and Chenguang Yang. A comprehensive study
of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics, 2021.
42. Jordi Cornella and Raúl Suárez. Determining independent grasp regions on 2d discrete objects. In 2005 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 2941–2946. IEEE, 2005.
43. Jordi Cornella and Raúl Suárez. Fast and flexible determination of force-closure independent regions to grasp polygonal
objects. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages 766–771. IEEE,
2005.
44. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
45. Noel Curtis and Jing Xiao. Efficient and effective grasping of novel objects through learning and adapting a knowledge
base. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2252–2257. IEEE, 2008.
46. Mark R Cutkosky et al. On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE
Transactions on robotics and automation, 5(3):269–279, 1989.
47. Hao Dang and Peter K Allen. Learning grasp stability. In 2012 IEEE International Conference on Robotics and
Automation, pages 2392–2397. IEEE, 2012.
48. Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020.
49. Elias De Coninck, Tim Verbelen, Pieter Van Molle, Pieter Simoens, and Bart Dhoedt IDLab. Learning to grasp arbitrary
household objects from a single demonstration. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 2372–2377. IEEE, 2019.
50. Charles de Granville, Joshua Southerland, and Andrew H Fagg. Learning grasp affordances through human demonstra-
tion. In Proceedings of the International Conference on Development and Learning (ICDL’06), 2006.
51. Yuhong Deng, Xiaofeng Guo, Yixuan Wei, Kai Lu, Bin Fang, Di Guo, Huaping Liu, and Fuchun Sun. Deep reinforce-
ment learning for robotic pushing and picking in cluttered environment. In 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pages 619–626. IEEE, 2019.
52. Amaury Depierre, Emmanuel Dellandréa, and Liming Chen. Jacquard: A large scale dataset for robotic grasp detection.
In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3511–3516. IEEE, 2018.
53. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
54. Dan Ding, Yun-Hui Lee, and Shuguo Wang. Computation of 3-d form-closure grasps. IEEE Transactions on Robotics
and Automation, 17(4):515–522, 2001.
55. Martin Do, Javier Romero, Hedvig Kjellström, Pedram Azad, Tamim Asfour, Danica Kragic, and Rüdiger Dillmann.
Grasp recognition and mapping on humanoid robots. In 2009 9th IEEE-RAS International Conference on Humanoid
Robots, pages 465–471. IEEE, 2009.
56. Mingshuai Dong, Shimin Wei, Jianqin Yin, and Xiuli Yu. Real-world semantic grasping detection. arXiv preprint
arXiv:2111.10522, 2021.
57. Mingshuai Dong, Shimin Wei, Xiuli Yu, and Jianqin Yin. Mask-gd segmentation based robotic grasp detection. arXiv
preprint arXiv:2101.08183, 2021.
58. Yan Duan, Marcin Andrychowicz, Bradly C Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and
Wojciech Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017.
59. Staffan Ekvall, Frank Hoffmann, and Danica Kragic. Object recognition and pose estimation for robotic manipulation
using color cooccurrence histograms. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2003)(Cat. No. 03CH37453), volume 2, pages 1284–1289. IEEE, 2003.
60. Staffan Ekvall and Danica Kragic. Grasp recognition for programming by demonstration. In Proceedings of the 2005
IEEE International Conference on Robotics and Automation, pages 748–753. IEEE, 2005.
61. Staffan Ekvall and Danica Kragic. Learning and evaluation of the approach vector for automatic grasp generation and
planning. In Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 4715–4720. IEEE,
2007.
62. Sahar El-Khoury and Anis Sahbani. Handling objects by their handles. In IEEE/RSJ International Conference on
Intelligent Robots and Systems: Post Talk., 2008.
63. N Elango and AAM Faudzi. A review article: investigations on soft materials for soft robot manipulations. The
International Journal of Advanced Manufacturing Technology, 80(5):1027–1037, 2015.
64. Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In
2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6222–6227. IEEE, 2021.
65. Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised pre-training help deep
learning? In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 201–
208. JMLR Workshop and Conference Proceedings, 2010.
66. Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general
object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–
11453, 2020.
67. Bernard Faverjon and Jean Ponce. On computing two-finger force-closure grasps of curved 2d objects. In Proceedings.
1991 IEEE International Conference on Robotics and Automation, pages 424–429. IEEE, 1991.
68. Thomas Feix, Roland Pawlik, Heinz-Bodo Schmiedmayer, Javier Romero, and Danica Kragic. A comprehensive grasp
taxonomy. In Robotics, science and systems: workshop on understanding the human hand for advancing robotic
manipulation, volume 2, pages 2–3. Seattle, WA, USA;, 2009.
69. C Ferrari and J Canny. Planning optimal grasps. In Proceedings 1992 IEEE International Conference on Robotics and
Automation, pages 2290–2295. IEEE, 1992.
70. Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via
meta-learning. In Conference on Robot Learning, pages 357–368. PMLR, 2017.
71. Neha P Garg, David Hsu, and Wee Sun Lee. Learning to grasp under uncertainty using pomdps. In 2019 International
Conference on Robotics and Automation (ICRA), pages 2751–2757. IEEE, 2019.
72. Alexandre Gariépy, Jean-Christophe Ruel, Brahim Chaib-Draa, and Philippe Giguere. Gq-stn: Optimizing one-shot
grasp detection based on robustness classifier. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 3996–4003. IEEE, 2019.
73. Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás
Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems,
4:265–293, 2021.
74. Nikolaos Gkanatsios, Georgia Chalvatzaki, Petros Maragos, and Jan Peters. Orientation attentive robot grasp synthesis.
arXiv e-prints, pages arXiv–2006, 2020.
75. Jared Glover, Daniela Rus, and Nicholas Roy. Probabilistic models of object geometry for grasp planning. Proceedings
of Robotics: Science and Systems IV, Zurich, Switzerland, pages 278–285, 2008.
76. Corey Goldfeder, Peter K Allen, Claire Lackner, and Raphael Pelossof. Grasp planning via decomposition trees. In
Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 4679–4684. IEEE, 2007.
77. Corey Goldfeder, Matei Ciocarlie, Hao Dang, and Peter K Allen. The columbia grasp database. In 2009 IEEE interna-
tional conference on robotics and automation, pages 1710–1716. IEEE, 2009.
78. Minghao Gou, Hao-Shu Fang, Zhanda Zhu, Sheng Xu, Chenxi Wang, and Cewu Lu. Rgb matters: Learning 7-dof grasp
poses on monocular rgbd images. arXiv preprint arXiv:2103.02184, 2021.
79. Kathrin Gräve, Jörg Stückler, and Sven Behnke. Improving imitated grasping motions through interactive expected
deviation learning. In 2010 10th IEEE-RAS International Conference on Humanoid Robots, pages 397–404. IEEE,
2010.
80. Markus Grotz, David Sippel, and Tamim Asfour. Active vision for extraction of physically plausible support relations.
In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 439–445. IEEE, 2019.
81. Sergio Guadarrama, Lorenzo Riano, Dave Golland, Daniel Go, Yangqing Jia, Dan Klein, Pieter Abbeel, Trevor Dar-
rell, et al. Grounding spatial relations for human-robot interaction. In 2013 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pages 1640–1647. IEEE, 2013.
82. Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. Open-
vocabulary object retrieval. In Robotics: science and systems, 2014.
83. Marcus Gualtieri, Andreas Ten Pas, Kate Saenko, and Robert Platt. High precision grasp pose detection in dense clutter.
In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 598–605. IEEE, 2016.
84. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International
Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
85. Di Guo, Tao Kong, Fuchun Sun, and Huaping Liu. Object discovery and grasp detection with a shared convolutional
neural network. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2038–2043. IEEE,
2016.
86. Di Guo, Fuchun Sun, Tao Kong, and Huaping Liu. Deep vision networks for real-time robotic grasp detection. Inter-
national Journal of Advanced Robotic Systems, 14(1):1729881416682706, 2016.
87. Di Guo, Fuchun Sun, Huaping Liu, Tao Kong, Bin Fang, and Ning Xi. A hybrid deep architecture for robotic grasp
detection. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1609–1614. IEEE, 2017.
88. Janik Hager, Ruben Bauer, Marc Toussaint, and Jim Mainprice. Graspme-grasp manifold estimator. In 2021 30th IEEE
International Conference on Robot & Human Interactive Communication (RO-MAN), pages 626–632. IEEE, 2021.
89. Li Han, Jeffrey C Trinkle, and Zexiang X Li. Grasp analysis as linear matrix inequality problems. IEEE Transactions
on Robotics and Automation, 16(6):663–674, 2000.
90. Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, Kuniyuki Takahashi, Yuta Tsuboi, Yuya Unno, Wilson Ko, and Jethro Tan.
Interactively picking real-world objects with unconstrained spoken language instructions. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 3774–3781. IEEE, 2018.
91. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
92. Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, Ludovic Righetti, Tamim Asfour, and Stefan Schaal. Template-
based learning of grasp selection. In 2012 IEEE International Conference on Robotics and Automation, pages 2379–
2384. IEEE, 2012.
93. Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, Ludovic Righetti, Jeannette Bohg, Tamim Asfour, and Stefan
Schaal. Learning of grasp selection based on shape-templates. Autonomous Robots, 36(1):51–65, 2014.
94. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
95. Matthew William Horn et al. Quantifying grasp quality using an inverse reinforcement learning algorithm. PhD thesis,
2017.
96. Kaijen Hsiao and Tomas Lozano-Perez. Imitation learning of whole-body grasps. In 2006 IEEE/RSJ international
conference on intelligent robots and systems, pages 5657–5662. IEEE, 2006.
97. Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referen-
tial expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1115–1124, 2017.
98. Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. Diff-
taichi: Differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935, 2019.
99. Yuanming Hu, Jiancheng Liu, Andrew Spielberg, Joshua B Tenenbaum, William T Freeman, Jiajun Wu, Daniela Rus,
and Wojciech Matusik. Chainqueen: A real-time differentiable physical simulator for soft robotics. In 2019 Interna-
tional conference on robotics and automation (ICRA), pages 6265–6271. IEEE, 2019.
100. Yongqiang Huang, Matteo Bianchi, Minas Liarokapis, and Yu Sun. Recent data sets on object manipulation: A survey.
Big data, 4(4):197–216, 2016.
101. Kai Huebner, Steffen Ruthotto, and Danica Kragic. Minimum volume bounding box decomposition for shape approx-
imation in robot grasping. In 2008 IEEE International Conference on Robotics and Automation, pages 1628–1633.
IEEE, 2008.
102. Thea Iberall, Joe Jackson, Liz Labbe, and Ralph Zampano. Knowledge-based prehension: Capturing human dexterity.
In Proceedings. 1988 IEEE International Conference on Robotics and Automation, pages 82–87. IEEE, 1988.
103. Shariq Iqbal, Jonathan Tremblay, Andy Campbell, Kirby Leung, Thang To, Jia Cheng, Erik Leitch, Duncan McKay,
and Stan Birchfield. Toward sim-to-real directional semantic grasping. In 2020 IEEE International Conference on
Robotics and Automation (ICRA), pages 7247–7253. IEEE, 2020.
104. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural informa-
tion processing systems, 28:2017–2025, 2015.
105. Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine,
Raia Hadsell, and Konstantinos Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-
to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 12627–12637, 2019.
106. Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. End-to-end learning of seman-
tic grasping. arXiv preprint arXiv:1707.01932, 2017.
107. Marc Jeannerod. The neural and behavioural organization of goal-directed movements. Clarendon Press/Oxford
University Press, 1988.
108. Yan-Bin Jia. Computation on parametric curves with an application in grasping. The International Journal of Robotics
Research, 23(7-8):827–857, 2004.
109. Ping Jiang, Junji Oaki, Yoshiyuki Ishihara, Junichiro Ooga, Haifeng Han, Atsushi Sugahara, Seiji Tokura, Haruna Eto,
Kazuma Komoda, and Akihito Ogawa. Learning suction graspability considering grasp quality and robot reachability
for bin-picking. arXiv preprint arXiv:2111.02571, 2021.
110. Yun Jiang, Stephen Moseson, and Ashutosh Saxena. Efficient grasping from rgbd images: Learning using a new
rectangle representation. In 2011 IEEE International conference on robotics and automation, pages 3304–3311. IEEE,
2011.
111. Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE
transactions on pattern analysis and machine intelligence, 2020.
112. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense cap-
tioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016.
113. Ishay Kamon, Tamar Flash, and Shimon Edelman. Learning to grasp using visual information. In Proceedings of IEEE
International Conference on Robotics and Automation, volume 3, pages 2470–2476. IEEE, 1996.
114. Sing Bing Kang and Katsushi Ikeuchi. Toward automatic robot instruction from perception-recognizing a grasp from
observation. IEEE Transactions on Robotics and Automation, 9(4):432–443, 1993.
115. Rainer Kartmann, Fabian Paus, Markus Grotz, and Tamim Asfour. Extraction of physically plausible support relations
to predict and validate manipulation action effects. IEEE Robotics and Automation Letters, 3(4):3991–3998, 2018.
116. Hamidreza Kasaei and Mohammadreza Kasaei. Mvgrasp: Real-time multi-view 3d object grasping in highly cluttered
environments. arXiv preprint arXiv:2103.10997, 2021.
117. Hamidreza Kasaei, Sha Luo, Remo Sasso, and Mohammadreza Kasaei. Simultaneous multi-view object recognition
and grasping in open-ended domains. arXiv preprint arXiv:2106.01866, 2021.
118. Byoung-Ho Kim, Sang-Rok Oh, Byung-Ju Yi, and Il Hong Suh. Optimal grasping based on non-dimensionalized
performance indices. In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems.
Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No. 01CH37180), volume 2, pages 949–956.
IEEE, 2001.
119. Kilian Kleeberger, Richard Bormann, Werner Kraus, and Marco F Huber. A survey on learning-based robotic grasping.
Current Robotics Reports, pages 1–11, 2020.
120. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
121. Danica Kragic and Henrik I Christensen. Model based techniques for robotic servoing and grasping. In IEEE/RSJ
international conference on intelligent robots and systems, volume 1, pages 299–304. IEEE, 2002.
122. Danica Kragic, Andrew T Miller, and Peter K Allen. Real-time tracking meets online grasp planning. In Proceedings
2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), volume 3, pages
2460–2465. IEEE, 2001.
123. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net-
works. Advances in neural information processing systems, 25:1097–1105, 2012.
124. Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot learning for manipulation: Challenges, repre-
sentations, and algorithms. Journal of Machine Learning Research, 22:30–1, 2021.
125. James R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends in
cognitive sciences, 21(10):749–759, 2017.
126. Sulabh Kumra and Christopher Kanan. Robotic grasp detection using deep convolutional neural networks. In 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 769–776. IEEE, 2017.
127. K Lakshiminarayana. Mechanics of form closure. ASME paper, 78-DET-32, 1978.
128. Thomas Lampe and Martin Riedmiller. Acquiring visual servoing reaching and grasping skills using neural reinforce-
ment learning. In The 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013.
129. Steven M LaValle. Planning algorithms. Cambridge university press, 2006.
130. Quoc V Le, David Kamm, Arda F Kara, and Andrew Y Ng. Learning to grasp objects with multiple contact points. In
2010 IEEE International Conference on Robotics and Automation, pages 5062–5069. IEEE, 2010.
131. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
132. Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal
of Robotics Research, 34(4-5):705–724, 2015.
133. Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for
robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research,
37(4-5):421–436, 2018.
134. Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed
points. Advances in neural information processing systems, 31:820–830, 2018.
135. Yikun Li, Lambert Schomaker, and S Hamidreza Kasaei. Learning to grasp 3d objects using deep residual u-nets. In
2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 781–787.
IEEE, 2020.
136. Yiming Li, Tao Kong, Ruihang Chu, Yifeng Li, Peng Wang, and Lei Li. Simultaneous semantic and collision learning
for 6-dof grasp pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pages 3571–3578. IEEE, 2021.
137. Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, and Jianwei Zhang.
Pointnetgpd: Detecting grasp configurations from point sets. In 2019 International Conference on Robotics and Au-
tomation (ICRA), pages 3629–3635. IEEE, 2019.
138. Yun Lin and Yu Sun. Grasp planning based on strategy extracted from demonstration. In 2014 IEEE/RSJ International
Conference on Intelligent Robots and Systems, pages 4458–4463. IEEE, 2014.
139. Yun Lin and Yu Sun. Robot grasp planning based on demonstrated grasp strategies. The International Journal of
Robotics Research, 34(1):26–42, 2015.
140. Guanfeng Liu, Jijie Xu, Xin Wang, and Zexiang Li. On quality functions for grasp synthesis, fixture planning, and
coordinated manipulation. IEEE Transactions on Automation Science and Engineering, 1(2):146–162, 2004.
141. Huan Liu, Thea Iberall, and George A Bekey. The multi-dimensional quality of task requirements for dextrous robot
hand control. In 1989 IEEE International Conference on Robotics and Automation, pages 452–453. IEEE Computer
Society, 1989.
142. Huaping Liu, Yuan Yuan, Yuhong Deng, Xiaofeng Guo, Yixuan Wei, Kai Lu, Bin Fang, Di Guo, and Fuchun Sun.
Active affordance exploration for robot grasping. In International Conference on Intelligent Robotics and Applications,
pages 426–438. Springer, 2019.
143. Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning
for generic object detection: A survey. International journal of computer vision, 128(2):261–318, 2020.
144. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg.
Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
145. Yun-Hui Liu. Computing n-finger form-closure grasps on polygonal objects. The International journal of robotics
research, 19(2):149–158, 2000.
146. Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In
European conference on computer vision, pages 852–869. Springer, 2016.
147. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
148. Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. Proceedings of
Robotics: Science and Systems. doi, 10, 2021.
149. Jeffrey Mahler and Ken Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences.
In Conference on robot learning, pages 515–524. PMLR, 2017.
150. Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken
Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics.
arXiv preprint arXiv:1703.09312, 2017.
151. Jeffrey Mahler, Matthew Matl, Xinyu Liu, Albert Li, David Gealy, and Ken Goldberg. Dex-net 3.0: Computing robust
vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In 2018 IEEE International
Conference on robotics and automation (ICRA), pages 5620–5627. IEEE, 2018.
152. Jeffrey Mahler, Florian T Pokorny, Brian Hou, Melrose Roderick, Michael Laskey, Mathieu Aubry, Kai Kohlhoff,
Torsten Kröger, James Kuffner, and Ken Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp
planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE international conference on robotics
and automation (ICRA), pages 1957–1964. IEEE, 2016.
153. Zhao Mandi, Fangchen Liu, Kimin Lee, and Pieter Abbeel. Towards more generalizable one-shot visual imitation
learning. arXiv preprint arXiv:2110.13423, 2021.
154. Tanis Mar, Vadim Tikhanoff, Giorgio Metta, and Lorenzo Natale. Self-supervised learning of grasp dependent tool
affordances on the icub humanoid robot. In 2015 IEEE International Conference on Robotics and Automation (ICRA),
pages 3200–3206. IEEE, 2015.
155. Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. Pose estimation for augmented reality: a hands-on survey.
IEEE transactions on visualization and computer graphics, 22(12):2633–2651, 2015.
156. Xanthippi Markenscoff, Luqun Ni, and Christos H Papadimitriou. The geometry of grasping. The International Journal
of Robotics Research, 9(1):61–74, 1990.
157. Xanthippi Markenscoff and Christos H Papadimitriou. Optimum grip of a polygon. The International Journal of
Robotics Research, 8(2):17–29, 1989.
158. Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. Learning to parse natural language commands to
a robot control system. In Experimental robotics, pages 403–415. Springer, 2013.
159. Kirill Mazur and Victor Lempitsky. Cloud transformers: A universal approach to point cloud processing tasks. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10715–10724, 2021.
160. Andrew T Miller and Peter K Allen. Examples of 3d grasp quality computations. In Proceedings 1999 IEEE Interna-
tional Conference on Robotics and Automation (Cat. No. 99CH36288C), volume 2, pages 1240–1246. IEEE, 1999.
161. Andrew T Miller and Peter K Allen. Graspit! a versatile simulator for robotic grasping. IEEE Robotics & Automation
Magazine, 11(4):110–122, 2004.
162. Andrew T Miller, Steffen Knoop, Henrik I Christensen, and Peter K Allen. Automatic grasp planning using shape
primitives. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), volume 2,
pages 1824–1829. IEEE, 2003.
163. Brian Mirtich and John Canny. Easily computable optimum grasps in 2-d and 3-d. In Proceedings of the 1994 IEEE
International Conference on Robotics and Automation, pages 739–747. IEEE, 1994.
164. Bhubaneswar Mishra. Grasp metrics: Optimality and complexity. In Algorithmic Foundations of Robotics, pages
137–166. AK Peters, 1995.
165. Rasoul Mojtahedzadeh, Abdelbaki Bouguerra, Erik Schaffernicht, and Achim J Lilienthal. Support relation analysis
and decision making for safe robotic manipulation tasks. Robotics and Autonomous Systems, 71:99–117, 2015.
166. George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and
algorithms. Management science, 28(1):1–16, 1982.
167. A Morales, P Azad, T Asfour, D Kraft, S Knoop, R Dillmann, A Kargov, CH Pylatiuk, and S Schulz. An anthropomor-
phic grasping approach for an assistant humanoid robot. In International Symposium on Robotics (ISR), 2006.
168. Douglas Morrison, Peter Corke, and Jürgen Leitner. Closing the loop for robotic grasping: A real-time, generative grasp
synthesis approach. arXiv preprint arXiv:1804.05172, 2018.
169. Douglas Morrison, Peter Corke, and Jürgen Leitner. Multi-view picking: Next-best-view reaching for improved grasp-
ing in clutter. In 2019 International Conference on Robotics and Automation (ICRA), pages 8762–8768. IEEE, 2019.
170. Douglas Morrison, Peter Corke, and Jürgen Leitner. Learning robust, real-time, reactive robotic grasping. The Interna-
tional journal of robotics research, 39(2-3):183–201, 2020.
171. Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipu-
lation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2901–2910, 2019.
172. Adithyavairavan Murali, Arsalan Mousavian, Clemens Eppner, Chris Paxton, and Dieter Fox. 6-dof grasping for target-
driven object manipulation in clutter. In 2020 IEEE International Conference on Robotics and Automation (ICRA),
pages 6232–6238. IEEE, 2020.
173. Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression
understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
174. Shree K Nayar, Hiroshi Murase, and Sameer A Nene. Learning, positioning, and tracking visual appearance. In
Proceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 3237–3244. IEEE, 1994.
175. Van-Duc Nguyen. Constructing force-closure grasps. The International Journal of Robotics Research, 7(3):3–16, 1988.
176. Peiyuan Ni, Wenguang Zhang, Xiaoxiao Zhu, and Qixin Cao. Pointnet++ grasping: Learning an end-to-end spatial
grasp generation algorithm from sparse point clouds. In 2020 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 3619–3625. IEEE, 2020.
177. Nattee Niparnan and Attawith Sudsang. Computing all force-closure grasps of 2d objects from contact point set. In
2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1599–1604. IEEE, 2006.
178. Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language
processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2):604–624, 2020.
179. Swagatika Panda, AH Abdul Hafez, and CV Jawahar. Learning support order for manipulation in clutter. In 2013
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 809–815. IEEE, 2013.
180. Dongwon Park and Se Young Chun. Classification based grasp detection using spatial transformer network. arXiv
preprint arXiv:1803.01356, 2018.
181. Dongwon Park, Yonghyeok Seo, and Se Young Chun. Real-time, highly accurate robotic grasp detection using fully
convolutional neural network with rotation ensemble module. In 2020 IEEE International Conference on Robotics and
Automation (ICRA), pages 9397–9403. IEEE, 2020.
182. Dongwon Park, Yonghyeok Seo, Dongju Shin, Jaesik Choi, and Se Young Chun. A single multi-task deep neural net-
work with post-processing for object detection with reasoning and robotic grasp detection. In 2020 IEEE International
Conference on Robotics and Automation (ICRA), pages 7300–7306. IEEE, 2020.
183. Young C Park and Gregory P Starr. Grasp synthesis of polygonal objects using a three-fingered robot hand. The
International journal of robotics research, 11(3):163–184, 1992.
184. Rohan Paul, Jacob Arkin, Derya Aksaray, Nicholas Roy, and Thomas M Howard. Efficient grounding of abstract
spatial concepts for natural language interaction with robot platforms. The International Journal of Robotics Research,
37(10):1269–1299, 2018.
185. Fabian Paus and Tamim Asfour. Probabilistic representation of objects and their support relations. In International
Symposium on Experimental Robotics, pages 510–519, 2021.
186. Raphael Pelossof, Andrew Miller, Peter Allen, and Tony Jebara. An svm learning approach to robotic grasping. In
IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 4, pages
3512–3518. IEEE, 2004.
187. Justus H. Piater and Roderic a. Grupen. Learning appearance features to support robotic manipulation. Cognitive Vision
Workshop, pages 19–20, 2001.
188. Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours.
In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
189. Florian T Pokorny, Kaiyu Hang, and Danica Kragic. Grasp moduli spaces. In Robotics: Science and Systems, 2013.
190. Nancy S Pollard. Closure and quality equivalence for efficient synthesis of grasps from examples. The International
Journal of Robotics Research, 23(6):595–613, 2004.
191. Jean Ponce and Bernard Faverjon. On computing three-finger force-closure grasps of polygonal objects. IEEE Trans-
actions on robotics and automation, 11(6):868–881, 1995.
192. Jean Ponce, Steve Sullivan, Attawith Sudsang, Jean-Daniel Boissonnat, and Jean-Pierre Merlet. On computing four-
finger equilibrium and force-closure grasps of polyhedral objects. The International Journal of Robotics Research,
16(1):11–35, 1997.
193. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification
and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
2017.
194. Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a
metric space. arXiv preprint arXiv:1706.02413, 2017.
195. Yuzhe Qin, Rui Chen, Hao Zhu, Meng Song, Jing Xu, and Hao Su. S4g: Amodal single-view single-shot se (3) grasp
detection in cluttered scenes. In Conference on robot learning, pages 53–65. PMLR, 2020.
196. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv
preprint arXiv:2103.00020, 2021.
197. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
198. Deepak Rao, Quoc V Le, Thanathorn Phoka, Morgan Quigley, Attawith Sudsang, and Andrew Y Ng. Grasping novel
objects with depth segmentation. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
2578–2585. IEEE, 2010.
199. Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolutional neural networks. In 2015 IEEE
International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
200. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
201. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
202. F Reuleax. The kinematics of machinery, macmilly and company, 1876. Republished by Dover in, 1876.
203. Elon Rimon and Joel Burdick. On force and form closure for multiple finger grasps. In Proceedings of IEEE Interna-
tional Conference on Robotics and Automation, volume 2, pages 1795–1800. IEEE, 1996.
204. Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel
Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616,
2018.
205. Máximo A Roa and Raúl Suárez. Independent contact regions for frictional grasps on 3d objects. In 2008 IEEE
International Conference on Robotics and Automation, pages 1622–1627. IEEE, 2008.
206. Máximo A Roa and Raúl Suárez. Computation of independent contact regions for grasping 3-d objects. IEEE Transac-
tions on Robotics, 25(4):839–850, 2009.
207. Máximo A Roa and Raúl Suárez. Grasp quality measures: review and performance. Autonomous robots, 38(1):65–88,
2015.
208. Alberto Rodriguez, Matthew T Mason, and Steve Ferry. From caging to grasping. The International Journal of Robotics
Research, 31(7):886–900, 2012.
209. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241.
Springer, 2015.
210. Carlos Rosales, Raúl Suárez, Marco Gabiccini, and Antonio Bicchi. On the synthesis of feasible and prehensile robotic
grasps. In 2012 IEEE International Conference on Robotics and Automation, pages 550–556. IEEE, 2012.
211. Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization,
Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004.
212. Anis Sahbani, Sahar El-Khoury, and Philippe Bidaud. An overview of 3d object grasp synthesis algorithms. Robotics
and Autonomous Systems, 60(3):326–336, 2012.
213. Marcos Salganicoff, Lyle H Ungar, and Ruzena Bajcsy. Active learning for vision-based robot grasping. Machine
Learning, 23(2):251–278, 1996.
214. Ashutosh Saxena, Justin Driemeyer, Justin Kearns, and Andrew Y Ng. Robotic grasping of novel objects. In Proceed-
ings of the 19th International Conference on Neural Information Processing Systems, pages 1209–1216, 2006.
215. Ashutosh Saxena, Justin Driemeyer, and Andrew Y Ng. Robotic grasping of novel objects using vision. The Interna-
tional Journal of Robotics Research, 27(2):157–173, 2008.
216. J Schill, J Laaksonen, M Przybylski, V Kyrki, T Asfour, and R Dillmann. Learning continuous grasp stability for a
humanoid robot hand based on tactile sensing. In 2012 4th IEEE RAS & EMBS International Conference on Biomedical
Robotics and Biomechatronics (BioRob), pages 1901–1906. IEEE, 2012.
217. Alexander M Schmidts, Dongheui Lee, and Angelika Peer. Imitation learning of human grasping skills from motion
and force data. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1002–1007. IEEE,
2011.
218. Lin Shao, Fabio Ferreira, Mikael Jorda, Varun Nambiar, Jianlan Luo, Eugen Solowjow, Juan Aparicio Ojea, Oussama
Khatib, and Jeannette Bohg. Unigrasp: Learning a unified model to grasp with multifingered robotic hands. IEEE
Robotics and Automation Letters, 5(2):2286–2293, 2020.
219. Quanquan Shao and Jie Hu. Combining rgb and points to predict grasping region for robotic bin-picking. arXiv preprint
arXiv:1904.07394, 2019.
220. Quanquan Shao, Jie Hu, Weiming Wang, Yi Fang, Wenhai Liu, Jin Qi, and Jin Ma. Suction grasp region prediction
using self-supervised learning for object picking in dense clutter. In 2019 IEEE 5th International Conference on
Mechatronics System and Robots (ICMSR), pages 7–12. IEEE, 2019.
221. Karun B Shimoga. Robot grasp synthesis algorithms: A survey. The International Journal of Robotics Research,
15(3):230–266, 1996.
222. Mohit Shridhar and David Hsu. Interactive visual grounding of referring expressions for human-robot interaction. arXiv
preprint arXiv:1806.03831, 2018.
223. Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. arXiv
preprint arXiv:2109.12098, 2021.
224. Mohit Shridhar, Dixant Mittal, and David Hsu. Ingress: Interactive visual grounding of referring expressions. The
International Journal of Robotics Research, 39(2-3):217–232, 2020.
225. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
226. Gordon Smith, Eric Lee, Ken Goldberg, Karl Bohringer, and John Craig. Computing parallel-jaw grips. In Proceedings
1999 IEEE International Conference on Robotics and Automation (Cat. No. 99CH36288C), volume 3, pages 1897–
1903. IEEE, 1999.
227. Yanan Song, Liang Gao, Xinyu Li, and Weiming Shen. A novel robotic grasp detection method based on region
proposal networks. Robotics and Computer-Integrated Manufacturing, 65:101963, 2020.
228. Darrell Stam, Jean Ponce, and Bernard Faverjon. A system for planning and executing two-finger force-closure grasps
of curved 2d objects. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 1, pages 210–217. IEEE, 1992.
229. S Stansfield. Visually-aided tactile exploration. In Proceedings. 1987 IEEE International Conference on Robotics and
Automation, volume 4, pages 1487–1492. IEEE, 1987.
230. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
231. Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp
generation in cluttered scenes. arXiv preprint arXiv:2103.14127, 2021.
232. Tamara Supuk, Timotej Kodek, and Tadej Bajd. Estimation of hand preshaping during human grasping. Medical
engineering & physics, 27(9):790–797, 2005.
233. John D Sweeney and Rod Grupen. A model of shared grasp affordances from demonstration. In 2007 7th IEEE-RAS
International Conference on Humanoid Robots, pages 27–35. IEEE, 2007.
234. Marek Teichmann. A grasp metric invariant under rigid motions. In Proceedings of IEEE International Conference on
Robotics and Automation, volume 3, pages 2143–2148. IEEE, 1996.
235. Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose detection in point clouds. The Interna-
tional Journal of Robotics Research, 36(13-14):1455–1473, 2017.
236. Andreas Ten Pas and Robert Platt. Using geometry to detect grasp poses in 3d point clouds. In Robotics Research,
pages 307–324. Springer, 2018.
237. Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose
estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018.
238. Joaquin Vanschoren. Meta-learning: A survey. arXiv preprint arXiv:1810.03548, 2018.
239. Chenxi Wang, Hao-Shu Fang, Minghao Gou, Hongjie Fang, Jin Gao, and Cewu Lu. Graspness discovery in clutters
for fast and accurate grasp detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 15964–15973, 2021.
240. Dexin Wang, Chunsheng Liu, Faliang Chang, Nanjun Li, and Guangxin Li. High-performance pixel-level grasp detec-
tion based on adaptive grasping and grasp-aware network. IEEE Transactions on Industrial Electronics, 2021.
241. Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
242. Shengfan Wang, Xin Jiang, Jie Zhao, Xiaoman Wang, Weiguo Zhou, and Yunhui Liu. Efficient fully convolution neural
network for generating pixel wise robotic grasps with high resolution images. In 2019 IEEE International Conference
on Robotics and Biomimetics (ROBIO), pages 474–480. IEEE, 2019.
243. Tao Wang, Chao Yang, Frank Kirchner, Peng Du, Fuchun Sun, and Bin Fang. Multimodal grasp data set: A novel visual–
tactile data set for robotic manipulation. International Journal of Advanced Robotic Systems, 16(1):1729881418821571,
2019.
244. Yao Wang, Yangtao Zheng, Boyang Gao, and Di Huang. Double-dot network for antipodal grasp detection. In 2021
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4654–4661. IEEE, 2021.
245. Yefei Wang, Kaili Wang, Yi Wang, Di Guo, Huaping Liu, and Fuchun Sun. Audio-visual grounding referring expression
for robotic manipulation. arXiv preprint arXiv:2109.10571, 2021.
246. Zhichao Wang, Zhiqi Li, Bin Wang, and Hong Liu. Robot grasp detection using multimodal deep convolutional neural
networks. Advances in Mechanical Engineering, 8(9):1687814016668077, 2016.
247. Wei Wei, Yongkang Luo, Fuyu Li, Guangyun Xu, Jun Zhong, Wanyi Li, and Peng Wang. Gpr: Grasp pose refinement
network for cluttered scenes. arXiv preprint arXiv:2105.08502, 2021.
248. Keenon Werling, Dalton Omens, Jeongseok Lee, Ioannis Exarchos, and C Karen Liu. Fast and feature-complete differ-
entiable physics for articulated rigid bodies with contact. arXiv preprint arXiv:2103.16021, 2021.
249. Chaozheng Wu, Jian Chen, Qiaoyu Cao, Jianchi Zhang, Yunxin Tai, Lin Sun, and Kui Jia. Grasp proposal networks:
An end-to-end solution for visual learning of robotic grasps. arXiv preprint arXiv:2009.12606, 2020.
250. Yongxiang Wu, Fuhai Zhang, and Yili Fu. Real-time robotic multi-grasp detection using anchor-free fully convolutional
grasp detector. IEEE Transactions on Industrial Electronics, 2021.
251. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey
on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
252. Yu Xiang, Christopher Xie, Arsalan Mousavian, and Dieter Fox. Learning rgb-d feature embeddings for unseen object
instance segmentation. arXiv preprint arXiv:2007.15157, 2020.
253. Christopher Xie, Yu Xiang, Arsalan Mousavian, and Dieter Fox. Unseen object instance segmentation for robotic
environments. IEEE Transactions on Robotics, 2021.
254. Xu Xie, Changyang Li, Chi Zhang, Yixin Zhu, and Song-Chun Zhu. Learning virtual grasp with failed demonstrations
via bayesian inverse reinforcement learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 1812–1817. IEEE, 2019.
255. Ruinian Xu, Fu-Jen Chu, and Patricio A Vela. Gknet: grasp keypoint network for grasp candidates detection. arXiv
preprint arXiv:2106.08497, 2021.
256. Zhenjia Xu, Beichun Qi, Shubham Agrawal, and Shuran Song. Adagrasp: Learning an adaptive gripper-aware grasping
policy. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4620–4626. IEEE, 2021.
257. Mengyuan Yan, Iuri Frosio, Stephen Tyree, and Jan Kautz. Sim-to-real transfer of accurate grasping with eye-in-hand
observations and continuous control. arXiv preprint arXiv:1712.03303, 2017.
258. Xinchen Yan, Mohi Khansari, Jasmine Hsu, Yuanzheng Gong, Yunfei Bai, Sören Pirk, and Honglak Lee. Data-efficient
learning for sim-to-real robotic grasping using deep point cloud prediction networks. arXiv preprint arXiv:1906.08989,
2019.
259. Daniel Yang, Tarik Tosun, Benjamin Eisner, Volkan Isler, and Daniel Lee. Robotic grasping through combined image-
based grasp proposal and 3d reconstruction. In 2021 IEEE International Conference on Robotics and Automation
(ICRA), pages 6350–6356. IEEE, 2021.
260. Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. Journal
of Artificial Intelligence Research, 58:231–266, 2017.
261. Tian Ye, Xiaolong Wang, James Davidson, and Abhinav Gupta. Interpretable intuitive physics model. In Proceedings
of the European Conference on Computer Vision (ECCV), pages 87–102, 2018.
262. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention
network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1307–1315, 2018.
263. Tianhe Yu, Pieter Abbeel, Sergey Levine, and Chelsea Finn. One-shot hierarchical imitation learning of compound
visuomotor tasks. arXiv preprint arXiv:1810.11043, 2018.
264. Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot
imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557, 2018.
265. Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Learning synergies
between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE, 2018.
266. Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R Hogan, Maria Bauza, Daolin Ma, Orion Taylor,
Melody Liu, Eudald Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping
and cross-domain image matching. In 2018 IEEE international conference on robotics and automation (ICRA), pages
3750–3757. IEEE, 2018.
267. Hanbo Zhang, Xuguang Lan, Site Bai, Lipeng Wan, Chenjie Yang, and Nanning Zheng. A multi-task convolutional
neural network for autonomous robotic grasping in object stacking scenes. In 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pages 6435–6442. IEEE, 2019.
268. Hanbo Zhang, Xuguang Lan, Site Bai, Xinwen Zhou, Zhiqiang Tian, and Nanning Zheng. Roi-based robotic grasp
detection for object overlapping scenes. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pages 4768–4775. IEEE, 2019.
269. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Robotic grasping in
multi-object stacking scenes based on visual reasoning. Scientia Sinica Technologica, 48(12):1341–1356, 2018.
270. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Visual manipulation
relationship network for autonomous robotics. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots
(Humanoids), pages 118–125. IEEE, 2018.
271. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Visual manipulation
relationship recognition in object-stacking scenes. Pattern Recognition Letters, 140:34–42, 2020.
272. Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang La, and Nanning Zheng. Invigorate: Interactive visual
grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
273. Hanbo Zhang, Deyu Yang, Han Wang, Binglei Zhao, Xuguang Lan, Jishiyu Ding, and Nanning Zheng. Regrad: A
large-scale relational grasp dataset for safe and object-specic robotic grasping in clutter. IEEE Robotics and Automation
Letters, 2022.
274. Hanbo Zhang, Xinwen Zhou, Xuguang Lan, Jin Li, Zhiqiang Tian, and Nanning Zheng. A real-time robotic grasping
approach with oriented anchor box. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
275. Li Zhang and Jeffrey C Trinkle. The application of particle filtering to grasping acquisition with visual occlusion and
tactile sensing. In 2012 IEEE International Conference on Robotics and Automation, pages 3805–3812. IEEE, 2012.
276. Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation
learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on
Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
277. Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and
Data Engineering, 2020.
278. Binglei Zhao, Hanbo Zhang, Xuguang Lan, Haoyu Wang, Zhiqiang Tian, and Nanning Zheng. Regnet: Region-based
grasp network for end-to-end grasp detection in point clouds. In 2021 IEEE International Conference on Robotics and
Automation (ICRA), pages 13474–13480. IEEE, 2021.
279. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
280. Xinwen Zhou, Xuguang Lan, Hanbo Zhang, Zhiqiang Tian, Yang Zhang, and Narming Zheng. Fully convolutional
grasp detection network with oriented anchor box. In 2018 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 7223–7230. IEEE, 2018.
281. Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.
282. Xiangyang Zhu and Jun Wang. Synthesis of force-closure grasps on 3-d objects based on the q distance. IEEE
Transactions on robotics and Automation, 19(4):669–679, 2003.
283. R Zollner, O Rogalla, R Dillmann, and JM Zollner. Dynamic grasp recognition within the framework of programming
by demonstration. In Proceedings 10th IEEE International Workshop on Robot and Human Interactive Communication.
ROMAN 2001 (Cat. No. 01TH8591), pages 418–423. IEEE, 2001.
284. Guoyu Zuo, Jiayuan Tong, Hongxing Liu, Wenbai Chen, and Jianfeng Li. Graph-based visual manipulation relationship
reasoning network for robotic grasping. Frontiers in Neurorobotics, 15, 2021.