0% found this document useful (0 votes)

39 views31 pages

Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan

Uploaded by

金刚至尊黑

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views31 pages

Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan

Uploaded by

金刚至尊黑

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Robotic Grasping from Classical to Modern: A

Survey 1

Hanbo Zhang 1 , Jian Tang 1 , Shiguang Sun 1 and Xuguang Lan1

Xi’an Jiaotong University
[email protected]

Abstract: Robotic Grasping has always been an active topic in robotics since grasping is
one of the fundamental but most challenging skills of robots. It demands the coordination of
robotic perception, planning, and control for robustness and intelligence. However, current
arXiv:2202.03631v1 [cs.RO] 8 Feb 2022

solutions are still far behind humans, especially when confronting unstructured scenarios.
In this paper, we survey the advances of robotic grasping, starting from the classical formu-
lations and solutions to the modern ones. By reviewing the history of robotic grasping, we
want to provide a complete view of this community, and perhaps inspire the combination
and fusion of different ideas, which we think would be helpful to touch and explore the
essence of robotic grasping problems. In detail, we firstly give an overview of the analytic
methods for robotic grasping. After that, we provide a discussion on the recent state-of-
the-art data-driven grasping approaches rising in recent years. With the development of
computer vision, semantic grasping is being widely investigated and can be the basis of
intelligent manipulation and skill learning for autonomous robotic systems in the future.
Therefore, in our survey, we also briefly review the recent progress in this topic. Finally, we
discuss the open problems and the future research directions that may be important for the
human-level robustness, autonomy, and intelligence of robots.
© 2022 The Author(s)

1. Introduction

With the development of robotics, robots are gradually entering our homes. Before being capable of sophisticated
daily tasks, robots must be firstly seasoned in basic skills, among which grasping should be the most important
one. To perform robust grasping, perception, planning, and control are simultaneously required. Therefore, robotic
grasping is a fundamental but most challenging area in robotics. Though actively investigated for several decades,
robotic grasping is far behind being solved, especially when the robot is confronting complex and unstructured
environments, or demanded to perform tasks with high-level semantics, which, however, is what we always hope
an intelligent robot helper to be capable of. Fortunately, with the development of deep learning [131], there is
remarkable progress in the last decade for learning representations of high-level semantics, such as object recogni-
tion and relationship understanding [143], natural language understanding [178], and robotic skill learning [124].
By taking advantage of the recent progress in learning, it is promising to build an intelligent robot that could
percept and understand the world as a human can do, interact with humans using natural languages, and finish
grasping tasks autonomously and robustly with the abstracted semantics, serving as the basis of achieving more
complicated tasks.
Therefore, in this paper, we will review the recent advances achieved in robotic grasping in the past several
years, starting from the classical formulations to modern society. By this survey, we want to answer the following
1 This draft will be continuously updated. Therefore, if you find any problems with this draft, please do not hesitate to contact the first

author for updates, including but not limited to: 1) other interesting works or ideas not included in this draft; 2) problems with the statements
in this draft; 3) further discussion about the included works; 4) other useful suggestions.
questions:

• Mathematically, what is grasping?

• How can we solve the problem of grasping?

• What are the advantages and disadvantages of the existing methods?

• What could the future trends and directions in this field?

Obviously, it is impossible to include all works related to robotic grasping in one paper. Therefore, according to the
above questions, we hope to select a representative subset of this field, and provide the readers a comprehensive
and well-organized overview from formulation to solutions.
In brief, to answer the above questions and find feasible solutions, researchers have explored for decades, and
it results in an extensive set of excellent works. In detail, in the early stage, most works focused on the analytic
form of grasp synthesis based on mechanics, e.g., force-closure and form-closure [19] grasp synthesis. However,
such methods always rely on the simplification of the physical models and the assumption of a fully-observable
environment, which could be hardly achieved in real-world scenarios. With the rapid development of learning
approaches, data-driven approaches gradually dominated the community since it was simple, efficient, and could
get rid of the strong assumptions made by the analytic approaches [24]. Nevertheless, data-driven methods are
always data-intensive, meaning that they usually require much more data for training the grasping policy, which
is always labor-intensive. To solve the problem of data, self-supervised learning and unsupervised learning are
extensively explored in recent years [15,111], including some excellent works in robotic grasping [17,18,154,188,
265]. It is also possible to train grasping policies in physical simulators and then transfer to the real world [103,
105,257,258,273,278]. By enough data, the performance of data-driven approaches substantially outperforms the
classical methods. Based on the current progress, there are several interesting questions:

• Grasping is essentially a physical action, and hence, the classical analytic methods are well motivated. There-
fore, could we take the best of both analytic and data-driven to develop robust and scalable grasping meth-
ods?

• Rapid development in computer vision reveals that large-scale learning could potentially abstract the internal
structure of complex data. Could we take advantage of the vision techniques for developing robust grasping
skills in semantic scenarios?

• The real world is full of uncertainty. Failing to handle uncertainty will severely affect the reliability and
robustness, limiting the practicality. Could we model the uncertainty from the learned models when planning
grasps for robustness?

We will discuss all the above questions in detail in this survey.

There are also some other reviews on the topic of robotic grasp synthesis. For example, [24] and [119] have
reviewed the recent data-driven grasping approaches with different taxonomies. To be more specific, [24] mainly
discussed the methods based on classical machine learning techniques by categorizing them into methods for
known objects, familiar objects, and unknown objects. By contrast, [119] focused more on the recent deep learn-
ing methods. More generally, [124] and [41] reviewed the learning-based robotic manipulation, in which the
grasping skill is certainly included. On the other hand, [212], [20], and [221] surveyed the analytic grasp synthesis
approaches in different years. Besides, as for the grasp quality evaluation methods, one can refer to [207] for a
comprehensive study. Different from them all, our paper mainly focuses on the topic of semantic grasping, and
only includes the necessary and representative backgrounds of the traditional grasp synthesis. In Table 1, we also
provide a comparison among all these surveys for the readers to choose the interested ones.
Our survey is organized as follows. In Section 2, we give a short overview of the mainstream formulations of
robotic grasp synthesis. Generally speaking, a specific formulation usually accords with a specific kind of grasping
Table 1: Robotic Grasping Surveys in the Last Few Decades

Author & Year Type Summary

[221] Analytic Grasping Analyze grasping approaches in the aspects of dexterity,
equilibrium, stability, and dynamic behaviors based on me-
chanics.
[20] Analytic Grasping An educational survey on analytic grasping, introducing all
key parts for analytically synthesizing mechanically robust
grasps.
[212] Analytic and Data-driven A short survey including both analytic and data-driven ap-
Grasping proaches.
[24] Data-driven Grasping Study the data-driven methods, mainly focusing on grasp-
ing using traditional machine learning techniques.
[63] Soft-hand Manipulation A survey on soft-hand design and manipulation, including
both hardware and software parts.
[207] Grasp Evaluation Focus on grasp quality evaluation methods, and clas-
sify them into contact-point-based and hand-configuration-
based metrics.
[100] Datasets Summarize the datasets before 2016 related to robotic ma-
nipulation, including grasping.
[119] Deep-learning Grasping Categorize grasping methods into model-based and model-
free, and focus on the recent deep-learning methods.
[41] Visual Manipulation A short survey on vision-based manipulation, especially
grasping approaches.
[124] Data-driven Manipulation A comprehensive survey on learning-based robotic manip-
ulation, mainly focusing on the (PO)MDP formulation.
Ours (2022) Analytic, Data-driven, and A survey to review the progress in the last few decades,
Object-centric Grasping including the traditional analytic approaches, recent data-
driven approaches, and arising object-centric approaches.

approach. In Section 3, we will review a series of representative and impactful works related to analytic grasp
synthesis along with the mechanics-based grasp quality evaluation methods, which inspired later works and formed
a basis for the robotic grasping community. In Section 4, we will discuss the data-driven grasping approaches,
aiming at synthesizing grasps from experiences, which are usually represented by a dataset. In Section 5, we are
going to survey the object-centric grasping approaches, which are usually targeted at a certain object specified
by humans possibly with different interfaces such as a class name or a natural language command. Also, object-
centric grasp planning in dense clutter involves the understanding of object relationships, which is also discussed
in this section. Finally, in Section 6, we will discuss the open problems of robotic grasping that are important but
still remain unsolved, and the future trends in these areas.

2. Problem Formulation

Grasp synthesis in robotics means finding the proper configuration of the robot’s actuator related to the state of the
target for stable grasping. It could be formalized in different ways. Concretely, the classical formulation focuses
more on the mechanical properties, while modern approaches show more interest in the visual properties. In this
section, we will review different formulations of grasp synthesis.
2.1. Overview
Basically, given an object represented by 2D- or 3D-format, it would be a challenging problem to find an optimal,
or at least stable, grasp from infinite candidates based on the geometric or physical analysis. Therefore, to develop
robust grasping approaches, several challenges will instruct the discussion in this section:

• How can we properly represent a grasp?

• How can we evaluate the quality of a given grasp?

• How can we efficiently sample high-quality grasp candidates from an infinite set?

In this section, we will discuss the first two questions, and leave the answer to the third question as to the main
body of this paper. Noticeably, there is another important issue about how to plan to execute the optimized grasp
given the kinematics of the actuator without collision with the environment, which is also crucial for a successful
grasping trial. However, it mostly relates to motion planning algorithms [129], and goes out of the scope of this
paper.

2.2. Grasp Representation

Before reviewing the specific solutions, it is necessary to firstly formalize the definitions of a grasp. In this section,
we will introduce several mainstream grasp representations, including the classical contact-based grasp representa-
tion, 6-D grasp representation, point-based grasp representation, oriented-rect grasp representation, and pixel-level
grasp maps.

2.2.1. Contact-based Grasp Representation

Formally, given a set of contact points p = {pi }Ni=1 , one wrench, ωi = ( fi , τi ), is imposed on each contact point
accordingly, where fi is the force exerted on the object at point pi and τi is the torque around the surface normal. A
grasp g j , j ∈ {1, 2, ..., M}, is defined as g j = (ω1 , ω2 , ..., ωN ), where all points, wrenches, and grasps are defined
in the object reference frame. Obviously, if there is an external wrench, ωe , imposed on the object, only when
ωg, j = ∑Ni=1 ωi = −ωe can the object be in equilibrium. This representation is widely used in analytic methods
to be introduced in Section 3 and early data-driven approaches (e.g. [5, 60, 114]). This kind of representation is
scalable to different grippers with different numbers of fingers. Therefore, it is still preferred in current days by
grasping using dextrous hands.

2.2.2. Independent Contact Regions

Noticeably, the contact-based grasp representation is based on the ideal contact models, i.e., the contact points
could be exactly positioned by the robot. However, due to the inherent system or random errors, it would always
be inaccurate to execute the planned grasps. And certainly, such errors should be taken into consideration when
synthesizing robust grasps. Therefore, a more practical representation, named Independent Contact Regions (ICRs)
[175], is introduced in spite of possibly introducing more computation. It is defined as a set of independent regions
on the object boundary such that putting one finger onto each ICR will result in a force-closure grasp (please refer
to Section 3) regardless of the exact position of each finger.

2.2.3. SE(3) Grasp Representation

With the prevalence of parallel-jaw grippers, with some loss of scalability, it is more convenient to use simplified
representations. Since the kinematics of a parallel-jaw gripper is simple, the contact points on a specific object is
completely determined by the gripper’s 6-D pose g = (x, y, z, rx , ry , rz ) ∈ SE(3), including 3-D position (x, y, z) and
3-D orientation (rx , ry , rz ), which is a widely-used grasp representation based on 3-D perception [83,137,162,235].
For the convenience of computation, SE(3) grasp representation may have different specific but equivalent forms
in practice.
Table 2: Summary of Selected Analytic Grasp Synthesis

Author & Year Repr. Type Fingers Object

[175] Contact points & ICRs Frictional 2,3,4,7 Polygons & Polyhedra
[157] Contact points Frictionless 3,4 Polygons
[67] Contact points Frictional 2 Curved Shapes
[69] Contact points Frictional 2,3 Polygons
[191] ICRs Frictional 3 Polygons
[192] ICRs Frictional 4 Polyhedra
[226] Contact points Frictional 2 Polygons
[145] Contact points Frictional n Polygons
[54] Contact points Frictional n 3D Objects
[282] Contact points Frictional n 3D Curved Objects
[190] Contact points & ICRs Frictional n 3D Objects
[108] Contact points Frictional 2 Curved Shapes
[42] ICRs Frictionless 4 2D Discrete Objects
[43] ICRs Frictional n Polygons
[177] Contact points Frictional 3 2D Objects
[205] ICRs Frictional n 3D Objects
[206] ICRs Frictional n 3D Objects

2.2.4. Point-based Grasp Representation

Recently, as 2-D vision develops rapidly, it is feasible to directly synthesize grasps on RGB images instead of the
3-D point clouds, and hence, the grasp representation is further simplified. [214, 215] detected grasp points on
multi-view observations and projected them back into a single 3-D grasp point. [198] and [11] used segmented
grasp affordance on 2-D images to represent grasps for parallel-jaw grippers. Such a single-dot representation is
also widely used for suction grasps [28, 109, 151]. Later, orientation was introduced to instruct the pose of the
gripper [149, 244].

2.2.5. Oriented-rect Grasp Representation

The point-based grasp representation cannot model the size of the gripper. Moreover, it lacks a bounded feature
area to map grasp points to robot configurations according to [110]. Therefore, they presented oriented rectangles
for grasps on 2-D images. The oriented rectangle includes 5 dimensions: g = (x, y, w, h, θ ), with (x, y) denoting the
center, (w, h) denoting the distance between two jaws and the size of the gripper, and θ denoting the orientation
of the gripper. It is now widely used in grasp detection with image inputs.

2.2.6. Pixel-level Grasp Maps

Possible grasps are infinite on one object or one image. Therefore, based on the oriented-rectangle grasps, the
pixel-level dense grasp representation was proposed [10, 74, 240, 242]. Typically, it models the grasp synthesis as
a segmentation problem. By taking as the input images, it outputs a segmented image of grasp affordance and
possibly the corresponding gripper parameters. Formally, the grasp map G = (Q, W, H, Θ) where Q, W, H, and
Θ are all single-channel images with the same size of the input, representing the pixel-wise graspability and the
corresponding gripper parameters including width, height, and orientation. Note that not all elements in G are
mandatory. For example, the height map H is not included in the output of the method in [74].

3. Analytic Grasp Synthesis

In this section, we will review the mainstream approaches related to grasp synthesis. By reviewing both classical
and modern methods, we hope to inspire researchers to harness the best from both worlds for developing advanced
grasping approaches.
3.1. Overview
Analytic grasp synthesis is mostly based on mechanics. Under this setting, a grasp is generally represented by a set
of contact points and the corresponding wrenches imposed on each point [20]. Typically, there are three different
contact models:

• Frictionless contact, meaning that there is no friction at the contact point.

• Frictional contact, meaning that there is friction at the contact point.

• Soft-finger contact, meaning that the contact part is deformable and will be an area instead of a point, and
thus allows an additional torque around the surface normal.

We focus on the first two types of contact models, which are most widely explored in robotics. For the soft-finger
contact models, we refer to [63] for more detailed discussions.
In detail, we firstly review the methods to analytically evaluate the quality of a given grasp. With a given
evaluation metric, it would be non-trivial to synthesize the optimal grasps. Typically, either heuristic or analytic
methods could be applied to the computation and optimization of grasps. A summary of the included analytic
grasp synthesis methods is demonstrated in Table 2.

3.2. Grasp Quality Evaluation

Form-closure and force-closure are often used to evaluate the quality of a given grasp provided the frictionless and
frictional models respectively [19]. It is often assumed that the object models are fully observable, including the
geometry and the friction coefficient at each contact point. One can regard the form-closure grasp synthesis as a
special case of force-closure with frictionless contact points. They are both defined as follows:

Definition 1. (Grasp-closure) is a property of a given grasp, including form-closure for grasps with frictionless
contact points and force-closure for grasps with frictional contact points. It occurs only when the grasp could
resist any possible external disturbing wrenches.

The research on grasp-closure can trace back to 19th century [202]. They proved that for a 2-D polygon, at least
4 frictionless wrenches are required for form-closure grasping. Much later, [127] showed that at least 7 contact
points are needed for 3-D polygons. Based on their analysis, [156] proved their conjectures. In particular, they
showed that iff without rotational symmetry, form-closure for any 3-D bounded object with piecewise smooth
boundary could be achieved by 12 fingers, and in most cases, 7 fingers could be enough. Moreover, they also
demonstrated that with Coulomb friction, the required number of fingers to achieve force-closure could reduce to
3 and 4 for 2-D and 3-D objects respectively under certain circumstances. Later, the definitions of form-closure
and force-closure were formally completed by [19]. Following that, [203] argued that the previous definition for
form-closure and force-closure (referred by 1st grasp-closure) are not adequate and should be considered together
with the mobility of the fingers. They proposed the definition of 2rd form-closure and force-closure to fix the
deficiency.
Though the grasp-closure property has been well-investigated, one could notice that the definition in Def. 1 is
unduly strict. In practice, the forces that the actuator could impose on the object are usually limited. Therefore, a
more practical metric is needed to evaluate a given grasp. One natural way to evaluate a grasp is the minimum force
needed to achieve equilibrium [140, 157, 190], and the directions of imposed forces should be close to the surface
normals for stability [89, 140]. However, such methods usually assume that the accessibility of the externally
exerted wrenches on the object.
To improve this deficiency, [69] proposed to utilize the Grasp Wrench Space (GWS) to evaluate the quality of
grasps:

Definition 2. (Grasp Wrench Space) of grasp gi is defined as the convex hull of all possible wrenches that could
be imposed through the contact points {pi }Ni=1 of grasp gi .
In particular, the minimum distance between the origin and the boundary of the GWS, called the Largest-
minimum Resisted Wrench (LRW), represents the minimum external wrench that could affect the stability of
the object. It quickly became one of the most well-known metrics for grasp quality evaluation. Based on GWS
and LRW, [164] proposed that different distances such as L1 and L∞ could be applied to the measurement of
LRW. [163] decoupled the forces and torques of wrenches in the wrench space to avoid the balancing factor
between them. [234] and [160] proposed to use the volume of GWS instead of the distance to get rid of the
dependence on the predefined reference frame on the object.
There are also other metrics used for quality evaluation in analytic grasp synthesis, such as the shape [118, 183]
and volume [36,163,232] of the grasp polygon formed through all contact points, the distance between the centers
of the object and the grasp polygon [37, 54, 192], and the size or radius of ICRs [42, 191, 228]. We refer to the
review by [207] for more detailed discussions.

3.3. Grasp Synthesis on Simple Shapes

Early works mostly focused on the grasp synthesis for simple shapes like polygons or polyhedra, which approx-
imately satisfy the assumptions made by grasp closure properties. [175] developed the principles for the force-
closure grasp synthesis. They proposed three basic contact types: frictionless contact, hard-finger (frictional) con-
tact, and soft-finger contact, and proposed that any complex contact types like edge contact and face contact could
be factorized using the three basic types. Based on their analysis, they also developed methods for finding the force-
closure grasps and ICRs for simple polygons and polyhedra. [157] applied elementary optimization techniques to
the synthesis of grasps for any polygons with the minimization of the needed forces to balance the object. [69]
proposed GWS for grasp evaluation along with an iterative heuristic method for searching the optimal grasp on
polygons with two- or three-finger grippers. [191] proved new sufficient conditions for force-closure grasping of
polygons, resulting in a more efficient polygonal grasping method with linear optimization. Later, they developed
the method to handle 3-D polyhedra with a four-finger gripper [192]. [226] simplified 3-D objects by their inter-
sections, and planned two-point grasps on the corresponding polygon for parallel-jaw grippers considering the
uncertainty of the mass center. [145] presented new sufficient and necessary conditions for form-closure grasp-
ing of polygons with n-finger grippers. [89] formalized the force closure, force feasibility, and force optimization
problems in a unified convex optimization problem with linear matrix inequalities, and solved it numerically in
polynomial time. [43] proposed a fast approach based on the two-dimensional problem formulation in the object
space instead of the contact space to efficiently synthesize ICRs.
Though theoretically sound and optimal, such kinds of methods always rely on the simplification on contact
models and geometry of objects, which severely restricts their application in real-world scenarios.

3.4. Grasp Synthesis on General Shapes

Following the methods mentioned in Section 3.3, researchers explored the ways to relax the shape assumption to
improve the real-world grasping performance. [67] focused on the force-closure grasp synthesis for parametrically
curved objects instead of simple polygons. Later, [108] proposed the method to compute all pairs of antipodal
points [30] on twice continuously differentiable shapes. [54] derived the sufficient and necessary conditions of an
incremental method to construct a n-finger force-closure grasp (though referred by form-closure in their paper)
given a k-finger non-force-closure grasp for any 3-D objects. [282] presented the Q-distance and applied it to
the computation of force-closure grasps. With the differentiable Q-distance, they could apply gradient descent
to find the optimal grasp configuration. [190] argued that more contact points could result in more stable grasps.
Therefore, they proposed a method to handle a large number of contacts in polynomial time w.r.t. the contact
number instead of exponential time previously on any object geometry. Their method searched the ICRs efficiently
based on initial examples. [42] considered the uncertainty of the object description and proposed to compute the
ICRs without the iterative search. [177] illustrated that it is feasible to achieve a complexity of O(n2 log2 n + K)
to get K three-finger solutions from n candidates of contract points on an arbitrary shaped 2-D object. Similar
to [190], a method based on the initial examples is proposed to incrementally synthesize ICRs on arbitrary 3-D
objects in [205]. Following their previous work, they generalized their method to the computation of ICRs with
any number and type of contact points with hard-finger grippers [206].
Methods presented in this section relax the shape assumptions and are certainly more flexible and practical.
However, there are still some assumptions preventing them from the widespread application. For example, some
methods assume that the objects could be represented by parametric curves [67, 108, 282], and all the methods
require the complete 2-D or 3-D model of objects including the friction coefficients, which do not always hold in
real-world scenarios. Though some works tried to get rid of the impractical assumptions by the online estimation
of object models [122, 208, 210, 275], such methods are always not satisfactory due to the gap between the models
and reality.

4. Data-driven Grasp Synthesis

As machine learning technology develops rapidly, it is promising to learn robot skills by training on a large amount
of data instead of planning with object models [24, 124]. Such methods are always called data-driven since the
quality and quantity of data are also essential parts for a good policy besides the methods. In this section, we will
review a series of data-driven grasp synthesis approaches.

4.1. Overview
In most cases, modern grasp synthesis is based on perception, especially the visual observations of the workspace.
However, different from traditional visual tasks, grasp synthesis usually involves the precise perception and anal-
ysis of geometric information, and sometimes intuitive physics, especially when facing unknown objects. And
compared to analytic methods, data-driven approaches substantially loosen the assumptions of accessible object
models since inspired by the neuropsychology [107], it is widely found that the heuristic abstraction of knowledge
is enough to derive reliable robotic control signals [6, 102, 113, 141, 174, 187, 213, 229].
As its name implies, the provided data plays a role of “experiences”, driving the robot to abstract “knowledge”
adaptively for skill learning. Different methods implement the abstraction in different ways. Regarding robotic
grasping, there are mainly three ways:

• Imitation-based methods: Given a dataset including stable grasps (e.g. force-closure grasps) and the corre-
sponding objects, the grasps could be transferred to similar objects by imitation. The imitative policy could
be formulated through the similarity between the target and object templates in the given dataset, or between
the real robot configuration and the given grasp templates. Early works focused more on this type of method
since it is more data-efficient.

• Sampling-based methods: To generate grasps on objects, another way is to sample a set of possible can-
didates, among which a discriminator is used to find the best one. Benefiting from the decoupling of grasp
sampling and classification, it has better interpretability and scalability. Nevertheless, it relies heavily on a
better grasp sampler in terms of both performance and running speed.

• End-to-end Learning: With the development of deep learning, it is possible to embed all things into one
neural network model and train it end-to-end. The input could be the raw observations such as information
from tactile sensors or cameras. And the output is proper grasp configurations. All steps including grasp sam-
pling and quality evaluation could be adaptively tuned with updates of trainable parameters. Such methods
usually run faster than the above two types and benefit mostly from large datasets.

4.2. Imitation-based Methods

Early works often relied on imitation-based methods to extract knowledge from the given data. Two ideas were
often considered as solutions: 1) programming by demonstrations (PbD); 2) matching of templates (MoT). We
Table 3: Summary of Selected Imitation-based Grasp Synthesis

Author & Year Imit. Type Modality Abstractor Planner

[114] PbD Vision + Tactile Heuristics K. Map
[122] MOOT Vision Pose Est. GraspIt!
[283] PbD Vision + Tactile SVM K. Map
HMM +
[60] PbD Trajectories -
Similarity
[5] PbD Trajectories Similarity K. Map
[96] MOOT Vision Similarity G. Map
[167] MOOT Vision Similarity G. Map
PbD + HMM +
[61] Vision + Tactile K. Map + GraspIt!
MOST Similarity
[233] Generative Vision - PGM
[45] MOOT Vision Similarity G. Map
[55] PbD 2D Vision Similarity K. Map
[79] PbD Trajectories GPR K. Map + RL
[217] PbD Trajectories HMM GMR
[92] MOOT Vision Similarity G. Map
[189] MOST Vision Similarity Interpolation
[138] PbD Trajectories Similarity K. Map + Search

will discuss them respectively. A selected set of imitation-based grasp synthesis methods are also summarized in
Table 3.

4.2.1. Programming by Demonstrations

PbD means that successful grasping trajectories are recorded first. When testing, the robot will adjust and replay
the trajectory to grasp objects. Grasp recognition is one crucial component in PbD-based grasp synthesis and
was widely investigated [5, 50, 60, 114, 283]. It assigns a specific category to a given grasp configuration from
a predefined taxonomy [46, 68, 283]. Based on the recognized grasp type and the demonstration, the planner
could synthesize the grasp for the robot using mapping of kinematics [5], or search efficiently in a constrained
grasp space [138, 139]. The demonstration could be also combined with reinforcement learning to incrementally
improve the performance, achieving better adaptation to the robot [79].
Generative models are also feasible for PbD-based grasp synthesis. Demonstrations are used to train the gener-
ator. In the phase of testing, the trained generator takes as input features of objects, and output a distribution from
which one can sample grasps [8, 217, 233].
As machine learning develops, behavior cloning [276] and inverse reinforcement learning [1, 95, 254] have also
been explored in the context of robotic grasping. Behavior cloning transforms imitation learning to supervised
learning, in which the demonstrations are regarded as a labeled dataset, based on which a model is trained to
map from inputs to actions. Inverse reinforcement learning is used to infer an explanative reward for the given
demonstrations. The inferred reward is then applied to policy training with reinforcement learning.
Recently, meta-learning [238] enables few-shot and even one-shot imitation directly from raw visual observa-
tions such as videos or images. Meta-learning is the basis for one-shot imitation learning [58, 70, 153], in which
a model is trained using some off-the-shelf data first to get a meta policy, and during imitation learning, it will
be fine-tuned by the demonstration to get the final policy. It is even possible to transfer the given demonstration
across different agents with different kinds of morphology [25, 48, 263, 264] Besides, data augmenting is also
an interesting idea to achieve few-shot imitation learning [49]. The provided demonstrations will be augmented
by a delicately designed pipeline and used to train a neural network. However, the achieved generality is limited
compared to meta-learning methods.
Table 4: Summary of Selected Sampling-based Grasp Synthesis

Author & Year Repr. Modality Generator Discriminator View

[214] Point RGB SW LR Multi
[76] Contacts 3D Models Heuristics GraspIt! -
[101] Contacts 3D Models Heuristics GraspIt! -
[22] Point RGB Seg + SW SVM Two
[198] Point RGB Seg SVM Single
SVM, AdaBoost,
[14] Contacts Tactile - -
HMM
[110] Rect RGB-D SW SVM Single
[47] Contacts Tactile - SVM -
[216] Contacts Tactile - SVM -
[132] Rect RGB-D NN NN Single
[83] SE(3) Point Clouds Random CNN Single
[150] Rect Depth CEM CNN Single
[149] Rect Depth CEM CNN Single
[151] Point Depth CEM CNN Single
[236] SE(3) Point Clouds Heuristics SVM Single
[137] SE(3) Point Clouds Heuristics NN Single
[171] SE(3) Point Clouds VAE NN Single
[258] SE(3) Point Clouds CEM NN Single
[135] SE(3) Point Clouds CNN Heuristics Single
[31] Rect RGB CEM Physical Simulator Multi
[78] SE(3) RGB-D CNN Heuristics Single

4.2.2. Matching of Templates

MoT can be classified into two subclasses: 1) Matching of Object Template (MOOT) and 2) Matching of Shape
Template (MOST).
In MOOT, a set of object templates along with their corresponding grasps are usually predefined. When the
robot meets novel objects, it will look up the template set and find the most similar one so as to map the predefined
grasps onto targets. One straightforward way of doing so is by manipulation-oriented pose estimation [40, 59, 121,
155, 237]. Concretely, objects will be recognized and positioned first, and predefined grasps from demonstrations
could be directly projected into the reference frame of objects for grasping [55, 167], or simulation-based grasp
planners could be introduced to online grasp planning [16, 122, 161]. However, such methods could be only used
for known objects, i.e., it is usually assumed 3-D object geometric models (e.g. the mesh or point cloud) are
available for estimation of 6-D poses.
To grasp unknown objects, shape primitives were proposed, resulting in MOST. Instead of full objects, they
form a set of primitive shapes, based on which grasps are predefined. The primitive set could be infinite [189].
When grasping novel objects, a matching process will be conducted between the predefined primitives and the
target. Then the demonstrated grasps could be mapped and executed [45, 61, 92, 93, 96, 162] or a grasp planner
could be incorporated based on the matched primitives [62,76,101]. Such methods could grasp objects with similar
appearance to the primitives, but cannot handle objects with unknown geometric structures.

4.3. Sampling-based Methods

Sampling-based methods derived from the extensively explored grasp quality evaluation approaches, and hence,
are widely applied to grasp synthesis with 3-D perception. The main idea is to select the best grasp according to
a well-trained discriminator provided a set of sampled candidates. Therefore, two main components are equally
important: 1) the discriminator; 2) the sampler. Noticeably, different from analytic grasp quality evaluation intro-
duced in Section 3.2, the discriminator is designed to get rid of the strong reliance on object models by making
Table 5: Summary of Selected Robotic Grasp Dataset

Object Grasp
Dataset Repr. Modality Source Size
/Scene /Scene
Cornell [110] Rect RGB-D Real 1035 1 ∼8
Dex-Net 2.0 [150] Rect Depth Sim 6.7M 1 1
Dex-Net 3.0 [151] Point Depth Sim 2.8M 1 1
Jacquard [52] Rect RGB-D Sim 54K 1 ∼20
VMRD [270] Rect RGB Real 4.7K ∼3 ∼20
[133] - RGB-D Real 800K - 1
[243] - RGB-D-T Real 2.55K 1 1
GraspNet-1billion [66] Rect + SE(3) RGB-D Real 97K ∼10 3-9M
ACRONYM [64] SE(3) Depth Sim 8.8K 1 2K
SuctionNet-1billion [28] Point RGB-D Real 97K ∼10 3-8M
REGRAD [273] Rect + SE(3) RGB-D Sim 900k 1-20 1.02K

use of large-scale datasets and learning techniques. The included algorithms are summarized in Table 4.

4.3.1. Discriminator

To learn the discriminator, supervised learning is usually applied. Noticeably, learning of discriminators is similar
to grasp recognition of PbD (Section 4.2.1): both of them are trained using labeled data and supervised learning.
The essential difference is that in grasp recognition of PbD, the grasp category based on the predefined taxonomy
is the focus, which is important to decide a certain grasp pattern for the robot. By contrast, the discriminator
here is used to evaluate whether a grasp is good or not. Possible quality metrics could either derive from analytic
methods [207], or a simple indicator of success or failure [214, 215].
Before the prevalence of deep learning, Support Vector Machines (SVM) [44] or probabilistic models [120]
are widely used to train such a discriminator [14, 22, 23, 47, 110, 130, 186, 214–216]. Training data are either
collected in the real world and manually labeled [110,216] or automatically synthesized using physical simulators
[14, 77, 186, 214, 215]. However, if synthetic data are used, there might be a reality gap when trained models are
applied in real-world scenarios due to domain shift [241]. Sizes of datasets in this period are always limited since
such models are quite data-efficient and could achieve commendable performance with a few (usually hundreds
of) data points.
As deep learning shows categorical advantages over other methods, it dominates learning of grasp discrimina-
tors recently. Nevertheless, compared to SVM, it needs much more data to train a good model. Therefore, datasets
including more and more data are proposed to meet the demands of deep networks [26–28, 52, 64, 66, 151, 152,
243, 270, 273]. A summary of robotic grasp datasets is shown in Table 5. One also could refer to [100] for a com-
prehensive summary of large-scale robotic manipulation datasets. Generally speaking, deep-learning-based grasp
discriminators are in essence the same as SVM-based discriminators despite much stronger representability and
performance. The main difference is that deep networks support much complex input data modality, such as raw
2-D images [132] and point clouds [83, 135, 137, 235].
There are also methods not relying on the learned discriminators to evaluate the quality of grasps. In this case, a
model of the target is usually needed to be estimated first, such as the 3-D shape, friction, and center of mass. Such
a model is not necessarily accurate in most cases. For example, PROMPT [31] only builds a 3-D particle-based
model through multi-view images for the target and applied NVIDIA Flex with predefined friction to the evaluation
of whether a grasp will succeed or not. It chooses the best sample for real-world execution. By comparison of the
difference between the simulator and the reality, PROMPT could update parameters of object models in an online
and close-loop way. Many grasp planners based on physical properties such as [122], [161], and [16] are possible
to be introduced here to replace learning-based discriminators given the estimation of object models. For example,
GraspIt! [161] is widely used in the early works for grasp quality evaluation given a reasonable set of grasp
candidates from learned samplers.

4.3.2. Sampler

A sampler could be either data-driven or heuristic. Different data modalities and grasp representations usually
correspond to different sampling methods.
For image inputs, the sampler is used to sample points for point-based grasp representation (Section 2.2.4), or
image patches for oriented-rect grasp representation (Section 2.2.5). One naive way to sample points is pixel-wise
random sampling. However, it is inefficient and sometimes intractable because: 1) sample space is too large; 2)
a point is not representative and does not include enough features to indicate the quality of a grasp. Therefore,
learning is used to obtain a prior for sampling. In this case, the output of a learning-based sample is usually a
grasping affordance map, with higher values denoting the more graspable area [22, 23, 78, 135, 198, 214, 215].
Based on the affordance map, all points could be ranked and tested one by one to find the best grasp configuration.
To solve the problem of representability, the sampled point could be mapped to 3-D space with camera models
[22, 23, 78, 214, 215], or extended to a full grasp configuration heuristically [198]. To sample image patches for
oriented-rect grasp synthesis, random sampling methods such as the sliding window (SW) method are intractable
due to unacceptable time complexity. A more efficient way is to learn a patch sampler by taking the image as the
input as long as the inference speed of the learned sampler is much faster than the discriminator [110, 132, 246].
Moreover, some heuristics could be used to further reduce search space. For example, given a predefined patch
size and the assumption that the background is a flat table, one could uniformly sample surface normals computed
from depth gradients [150, 151].
For point cloud inputs, the sampler is used to sample different SE(3) grasp poses (Section 2.2.3) in most cases.
Different from 2-D points, 3-D points include much richer geometric information which will help to efficiently
filter out undesired regions. For example, [83] and [235] voxelized and uniformly sampled points in regions of
interest, and performed local grid search to generate a set of SE(3) grasp candidates. Finally, they filtered out
the ones causing collisions between the gripper or including no object points within the closing region of the
gripper. [236] improved grid search in this sampling method for higher efficiency, which is applied and further
modified by [137]. An alternative way is to apply the Cross Entropy Method (CEM) [211] starting from a randomly
sampled grasp set and finally converging to an optimal graspable point distribution [150, 151, 258]. Learning-
based samplers are also feasible and show higher efficiency especially in terms of speed [171, 259]. For antipodal
grasps [30], a mapping between single points and grasps could be built on object meshes, which results in a
simplification from grasp sampling to point sampling [150, 198].

4.4. End-to-end Learning

End-to-end learning of grasp synthesis means that a model will be trained taking as the input raw observations
(e.g. RGB images or point clouds), and directly output the best grasp to be executed. There are mainly two ideas
to do so: 1) grasp detection inspired by object detection; 2) pixel-level grasp map synthesis inspired by scene
segmentation. We will review both of them in this section. A summary is also available in Table 6.

4.4.1. Grasp Detection on Images

Deep convolutional networks enable end-to-end visual perception based on image inputs [91, 123, 225]. To detect
grasps on images, the most straightforward way is to directly transfer object detection algorithms to the domain of
grasp detection, since detection algorithms share a similar basis: they both are essentially a classification problem
based on a set of extracted proposals. From this view, end-to-end learning also shares a similar idea as sampling-
based methods. The difference is that end-to-end learning integrates the sampler and the discriminator into one
single model and trains them end-to-end. For example, [38, 86] transfers Faster R-CNN [201] to a two-stage
Table 6: Summary of Selected End-to-end Grasp Synthesis

Author & Year Repr. Modality Method Structure Anchor Gripper

[199] Rect RGB-D Open-loop 1-stage ✗ Parallel
[86] Rect RGB Open-loop 1-stage Vertical Parallel
[87] Rect RGB-T Open-loop 1-stage Vertical Parallel
[126] Rect RGB-D Open-loop 1-stage ✗ Parallel
[38] Rect RGB-D Open-loop 2-stage Vertical Parallel
[133] - RGB Close-loop 1-stage ✗ Parallel
[168] GMap Depth Close-loop 1-stage ✗ Parallel
[265] GMap RGB-D Close-loop 1-stage ✗ Parallel
[280] Rect RGB Open-loop 1-stage Oriented Parallel
[11] Rect + GMap RGB-D Open-loop 1-stage ✗ Parallel
[72] Rect Depth Open-loop 4-stage ✗ Parallel
[142] GMap RGB-D Close-loop 1-stage ✗ Parallel
[169] GMap Depth Close-loop 1-stage ✗ Parallel
[219] GMap RGB-D Open-loop 1-stage ✗ Suction
[220] GMap RGB-D Open-loop 1-stage ✗ Suction
[29] GMap RGB-D Open-loop 1-stage ✗ Parallel
[274] Rect RGB-D Open-loop 1-stage Oriented Parallel
[176] SE(3) Point Clouds Open-loop 1-stage ✗ Parallel
[195] SE(3) Point Clouds Open-loop 1-stage ✗ Parallel
[218] SE(3) Point Clouds Open-loop n-stage ✗ Dextrous
[249] SE(3) Point Clouds Open-loop 1-stage Point Parallel
[28] GMap RGB-D Open-loop 1-stage ✗ Suction
[231] SE(3) Point Clouds Open-loop 1-stage ✗ Parallel
[239] SE(3) Point Clouds Open-loop 2-stage ✗ Parallel
[247] SE(3) Point Clouds Open-loop 2-stage ✗ Parallel
[244] Point RGB-D Open-loop 1-stage ✗ Parallel
[250] Rect RGB-D Open-loop 1-stage ✗ Parallel
[255] Point RGB-D Open-loop 1-stage ✗ Parallel
[256] GMap RGB-D Open-loop 1-stage ✗ Dextrous
[278] SE(3) Point Clouds Open-loop 3-stage SE(3) Parallel

grasp detection algorithm. And vise versa, grasp detection like [199] sometimes also inspired object detection
research [200].
Nevertheless, grasp detection is essentially different from object detection since 1) grasp detection relies heavily
on the local geometry of grasps; 2) grasp quality is sensitive to orientations; 3) grasping should be a close-loop pro-
cess, meaning that failures should be handled during grasping. For 1), image-based grasp detection algorithms usu-
ally take a combination of color and geometric channels, such as depth images as input [38,126,199,227,250,274]
and surface normals [180]. For 2), the dimension of the orientation could be discretized and the prediction of
orientations could be simplified as a classification problem [38, 250]. However, discretization suffers from per-
formance loss, especially for orientation-sensitive grasps. Therefore, oriented anchors were introduced to handle
this problem [274, 280]. Besides, Spatial Transformer Network (STN) [104] could also be used for more accurate
classification of oriented grasp candidates [72,180]. Recently, [181] proposed rotation ensemble module to handle
rotation-invariance for grasp detection. For 3), reactive policies could be trained for grasping by taking raw images
as inputs [12, 128, 133, 149, 265].

4.4.2. Grasp Detection on Point Clouds

Recent developments in 3-D vision enables end-to-end learning with point clouds as inputs [134,159,193,194,279,
281]. Such methods have also been explored for grasp synthesis. In this case, a backbone is usually used to pre-
process input point clouds, including subsampling, grouping, de-noising, etc., following which a feature extractor
designed based on strong inductive bias is used to extract features of points. The extracted features are then fed
into a grasp detector to regress SE(3) grasps as well as confidence scores indicating grasp quality of each point
correspondingly. The most simple framework is the one-stage anchor-free grasp detection [176, 195, 231], which
directly output results right after the feature extraction stage. To improve performance, SE(3) grasp anchors and
sphere-region features instead of point features were introduced in [278]. Their method also includes a fine-tune
stage, which further improves robustness. A similar idea has also been explored by [247, 249]. Another problem
is that when synthesizing grasping in point clouds, gripper models are crucial to evaluate grasping stability. Most
works are now designed based on a specific type of grippers, and can hardly generalize to other grippers. [218]
and [256] proposed to encode gripper-specific features in the inputs and train gripper-specific grasp detectors.
They proved that by doing so, the model could learn to adapt to different types of grippers.

4.4.3. Pixel-level Grasp Map Synthesis

Different from grasp detection, grasp map synthesis is similar to image segmentation, where the output is usually
represented by a set of heat maps, indicating where and how to grasp. One thing needed to be clarified is that there
is no clear gap between grasp detection and grasp map synthesis. One can imagine that in grasp detection, the
dense estimation of grasp quality (e.g. in [38, 274, 278]) for each pixel on image features is also a kind of grasp
map synthesis, which is even more representative, but with a smaller-size output compared to the input. Such
methods could be seen as a transition between grasp detection and grasp map synthesis.
For pixel-level grasp map synthesis, transfer from segmentation algorithms is also widely explored. For ex-
ample, U-Net [209] has been widely used for grasp map synthesis [29, 142, 219, 220]. Such an encoder-decoder
architecture is widely used to synthesize pixel-wise grasps [11, 28, 116, 239, 255]. Another similar formulation
for pixel-wise grasp synthesis is called grasp manifolds, proposed by [88]. Since grasp map is more informative
and could provide a global grasp affordance which indicates the grasp quality of the current viewpoints, it en-
ables selection of the best view [116, 239], provided the assumption that the camera is not fixed, which holds
in most cases for robots. It is defined by a close set of points on objects representing graspable areas. Besides,
with the mobility of robots, the interaction could be imposed on the workspace to actively clear out around the
graspable area [51, 142, 265] when no grasps are available. Also, as mentioned above, reactivation is needed to
recover from failures, and some works have explored reactive grasping policy learning based on pixel-level grasp
maps [168–170].

5. Object-centric Grasp Synthesis

In real applications, grasping usually serves for more complicated tasks requiring object-centric perception. De-
velopments of learning methods make it possible to integrate the understanding of high-level concepts while
executing grasping. In this section, we are going to review recent algorithms based on object-centric semantics.

5.1. Overview
Different object-centric semantics should be considered under different situations. In this paper, we will discuss
three types of them:

• Object-specific Grasp Synthesis: Object-specific grasp synthesis aims to retrieve and grasp objects belong-
ing to a specific class in clutter scenes. To specify a target, a class name is usually specified as the condition
of grasping.

• Interactive Grasp Synthesis: Interactive grasp synthesis means specifying targets using natural languages,
which includes richer information about objects including attributes and relationships with other objects. It
is worth noting that interactive grasping is different from interactive perception in robotics, which in most
cases means perception based on interaction with environments [21]. We also included some works related
to grasp synthesis based on interactive perception in Section 4.4.3.

• Relational Grasp Synthesis: Relational grasp synthesis is needed when grasping may have a possible
negative effect on other objects. Planning algorithms [129] are feasible to handle such situations given
environment models. With deep learning, the model-free understanding of object relations has also been
explored in recent years.

All types of object-centric grasping methods are built on top of robust grasp synthesis algorithms, and the differ-
ence lies in the introduction of semantics, which, to some extent, is parallel to grasp synthesis. The motivation
behind this is that we want robots to understand the world as a human can do, interact with humans in a natural
way, and finish complicated tasks autonomously and robustly, which has been pursued for decades by almost all
roboticists.

5.2. Object-Specific Grasp Synthesis

Object-specific grasp synthesis involves visual semantics of objects. When executing grasping, robots need to
associate synthesized grasps to object instances, clearly be aware of which object it is going to grasp and how to
grasp, and if possible, avoid possible collisions with other objects.
Methods based on template matching introduced in Section 4.2.2 are possible candidates to build associations
between grasps and objects given grasp demonstrations or a grasp planner, though partial observability of object
models may have negative effects on final performance. To handle partial observability, reconstruction methods
could be used to recover unseen parts of objects [2, 75, 259], or a better view could be explored for better grasps
[32].
Another way is following a recognize-first-then-grasp workflow. By breaking down the problem into two inde-
pendent components, advanced methods could be integrated as solutions [269]. However, object-specific grasping
is usually needed in dense clutter, in which objects may occlude and overlap each other severely, invalidating such
naive matching methods. To solve this problem, [266] proposed grasp-first-then-recognize, which avoids associ-
ations between detected objects and grasps with an acceptable loss of efficiency. Nevertheless, when a specified
object is requested with many other disturbance terms, such methods will be costly.
It is also possible to directly bind grasps onto objects when synthesizing grasping. It can be achieved by com-
bining grasp detection with object recognition [85, 106, 268] or semantic segmentation [3, 4, 9, 35, 56, 57, 136, 172].
An alternative way is using reinforcement learning to encourage actions of grasping a specified object [103]. Ad-
vantages of such methods include faster inference speed and higher accuracy.
One main drawback of the above methods is that they usually sacrifice generality to unknown objects since they
cannot be recognized by most object detectors or scene segmentation algorithms. One possible way to solve this
problem is leveraging recent unseen object instance segmentation methods [252, 253], which makes it possible to
remove unknown objects in clutter for getting targets [231]. Nonetheless, it still cannot recognize the semantics of
unknown objects. Another way is to involve human interventions for online learning [117]. Such methods require
expert knowledge, which is expensive to get.

5.3. Interactive Grasp Synthesis

Similar to object-specific but more challenging, interactive grasp synthesis also requires an understanding of visual
semantics by interaction with humans. Benefiting from advances in visual-linguistic grounding [33, 97, 147, 173,
230, 262], natural languages could be the interface of interaction based on visual observations. One advantage
of interactive grasping is that natural languages always include much richer semantics than a simple word, and
therefore, it is possible for robots to recognize and grasp unknown objects by their attributes or spatial relationships
with other objects [82]. To keep this survey consistent, we only review works directly relating to robotic grasp
synthesis.
Most works focused on building interactive grasping systems with separated components. In terms of grounding
spatial relationships, [81] grounded spatial relationships by modeling it as a multi-class logistic regression problem
by taking a preposition and referential object as input. [7] proposed a method based on Robot Control Language
[158] to ground natural languages for robotic manipulation. Alternatively, [184] proposed a probabilistic model to
handle abstract concepts, like “one”, “two” or “the first”, “the second”.
With the help of deep learning, grounding performance and scalability have been further improved. [90] used
a simple multi-branch network for end-to-end grounding of target objects and destinations. In their method, they
firstly detected objects in their workspace by SSD [144]. After that, based on detected regions, they extracted
visual features of objects and linguistic features of the given command by CNN and LSTM [94] respectively.
Finally, the network directly grounded referred objects and corresponding target positions. [222] presented a gen-
erative grounding method, named “INGRESS”, in which objects are grounded by the similarity between the given
command and a set of generated self-referential [112] and relational captions [173]. It also allows robots to ask
questions when ambiguity is detected. Later, they expanded this work with a POMDP planner for decision-making
between interaction and grasping [224]. [39] modeled a shared space for visual attributes and linguistic concepts,
and grounded objects by similarities. Besides visual-linguistic inputs, [245] introduced additional audio infor-
mation of objects to finish grounding, since sometimes only visual information is not sufficient (e.g. an opaque
bottle with different substances in it). Followed by a grasp planner, such methods could grasp objects specified
by attributes or relationships with other objects. However, the overall pipeline of these systems is similar to the
recognized-first-then-grasp workflow, and hence, it can hardly transfer to dense clutters. Actually, most of them
assumed objects are scattered. Therefore, to handle clutter scenes, discriminative models were demonstrated to be
more effective by [272]. In their paper, they proposed INVIGORATE, an interactive visual grounding and grasping
method based on POMDP, with visual and linguistic observations and grasp-sequence actions.
End-to-end methods are also explored recently based on visual and linguistic inputs. [34] presented a simple
multi-modality network with ResNet-based [91] visual branch and LSTM-based [94] linguistic branch, which
directly regresses a suitable grasp for specified targets as output. To train their models, they re-labeled Visual
Manipulation Relationship Dataset [270] with natural language commands, which is labor-exhaustive. To prevent
labeling large-scale datasets, [148] proposed to train policies in a semi-supervised way with a small amount of
labeled data (< 1%). They also took advantage of large-scale pre-trained language models [53, 197]. They demon-
strated that the final goal-conditioned policy performed extremely well. [223] proposed CLIPort, also based on a
large-scale unsupervised representation learning model named CLIP [196], which demonstrated surprisingly good
generality when trained with a few demonstrations for natural-language-conditioned robotic manipulation.

5.4. Relational Grasp Synthesis

Relational grasp synthesis means that when grasps are being executed, relations among objects should be consid-
ered to plan for a grasping sequence in clutter. This planning is necessary because improper order of grasping will
result in irrevocable damages to objects. In this section, we will review feasible solutions for this problem.
Planning algorithms such as task planning are possible candidates for solving relational grasping problems [73].
However, they usually require environment dynamics to plan. For grasping in clutters with possible piles of objects,
support relationships can represent simplified dynamics of objects. [179] presented a learning-based segment-
first-then-recognize approach for support relation analysis. Interestingly, they demonstrated that the rule-based
approach is better than the proposed learning-based approach. Actually, assuming that objects are convex and
their models and poses are known, support relationships can indeed be synthesized by geometric and physical
analysis [115, 165]. In [165], they proposed 4 basic types of support relations and solved them using static equi-
librium analysis (SEA) and learning methods when objects are fully and partially detected respectively. In [115],
they adapted the (SEA) approach from [165] to single-view point with incomplete object models. Later, they tried
to select the best view for SEA [80] or consider object uncertainty of types and shapes [185], which improved
performance in real scenarios. However, as mentioned, such methods rely on assumptions that either object mod-
els are convex and fully accessible or detected objects could be approximated by convex shapes, limiting their
practicality in real scenarios.
To get rid of these assumptions, visual relationship understanding [146] in vision also provides insights to
grasp planning in clutter. [146] demonstrated that object pair features from deep learning could be directly used
for high-performance semantic relationship understanding of arbitrary objects. Inspired by visual relationship de-
tection, [270] presented the concept of visual manipulation relationship (VMR), which is similar to the support
relationship but defined on purely visual features. They designed an end-to-end network for the detection of VMRs
and demonstrated that deep learning could be directly used for the classification of VMRs among non-convex ob-
jects. Further, they extended this work with more object detectors and showed that advanced approaches in object
detection and visual relationship detection could be directly transferred for better performance of VMR analy-
sis [271]. Recently, graph neural networks (GNNs) are widely proved to be efficient for detecting relationships
among objects [251, 277]. [284] proposed a GNN-based method for VMR detection and achieved better perfor-
mance. Such kind of methods has been successfully applied to the decision of grasping sequence in dense-clutter
scenes [182, 267, 272].

6. Open Problems

From the discussions above, it is obvious that the development of grasping is from structural to intelligent. Early
works mostly focused on mechanical analysis and optimality, which requires full models of objects, grasps, and
environments. Later, learning is applicable to relax assumptions on environments, which enables the deployment
of grasping algorithms in partially unstructured scenarios. Recently, most works are exploring grasping with high-
level visual concepts, aiming for adaptation to daily home scenarios, though currently, it is still far behind this
final goal.
There are also some interesting points which are worth noting.
Firstly, there is no doubt that mechanics should be directly responsible for the stability of grasping. However,
most current works focus on heuristic methods based on learning and show surprisingly good performance on
grasp synthesis on unknown objects. Though investigated by some works already (e.g. [137,150,176]), there is still
a gap between these two domains, i.e., analytic grasp synthesis and data-driven grasp synthesis. So the problem is
could we take the best of both analytic and data-driven to develop robust and scalable grasping methods? One way
we believe promising is to learn intuitive physics [125,204,261] for stable grasping. It comes from the observation
of how our humans grasp objects. In most cases, we implicitly roughly infer some physical properties of objects
before grasping so as to avoid failures, e.g., the friction coefficient, center of mass, and 3D geometry of unseen
parts based on our knowledge base. With these rough models, we then implicitly plan a reasonable (which may not
be optimal) grasp. Such a pipeline is similar to sample-based methods introduced in Section 4.3, most of which
only focus on local geometric information instead of object-level physical properties. Another alternative way to
involve consideration of physics is using physical simulators [13], especially with the help of recent advances in
differentiable simulators [98, 99, 248].
Secondly, scene understanding with high-level semantics is also closely related to robotic manipulation tasks,
which has also been actively explored in grasp synthesis and is critical for robot intelligence. Currently, the main
difficulties are generalization to open-set objects and worlds. Recent progress in unsupervised representation learn-
ing shows that it is promising to learn structured representations for unknown objects and concepts [65, 196].
Therefore, the problem is could we take advantage of large-scale unsupervised representation learning for devel-
oping robust grasping skills in semantic scenarios? To grasp open-set targets, one straightforward way is to train
grasp policies directly based on the large-scale pre-trained models. Also, one may consider a composable alterna-
tive, in which semantics could be analyzed first, and grasps could be then synthesized in an object-centric way.
To do so, a robust object-centric grasp detector is needed. Another important thing is relationship understanding
with open-set objects. Humans can retrieve and grasp targets efficiently in daily scenes even with unrecognized
distractors. This ability is built on top of the hierarchical understanding of semantics. For example, a command
“fetch me the red bottle on the dinner table” will involve a two-layer relationship “dinner room - dinner table - red
bottle”, where the first relationship “dinner room - dinner table” is implicit and based on prior semantic knowledge.
Such tasks are still challenging for robots to complete.
Finally, uncertainty is everywhere in practice. In traditional robotics, it is critical to handle uncertainty for plan-
ning and control. However, most learning-based grasping approaches simply utilize one-shot greedy inference
models. Thus, it is meaningful to consider could we model the uncertainty from the learned models when plan-
ning grasps for robustness? To some extent, neural networks are like sensors, providing high-level noisy semantic
information for decision making. We believe that one-shot greedy policies are not the optimal way to use these
observations. To consider model uncertainty, the first thing needed to be handled is how to model reasonable un-
certainty for outputs of neural networks. Model calibration [84] is a useful tool to calibrate output uncertainties.
With uncertainties, better decisions could be made to optimize final goals with given constraints (safety, success
rate, etc.). Since decision-making in robotics usually involves a sequential decision-making problem with par-
tial observations, historic information is also helpful to optimize actions. Partially observable Markov Decision
Process (POMDP) [166] is a natural candidate to consider uncertainty, long-term decision making, and partial ob-
servability in a principled way. Recent advances also illustrated promising results for POMDP to solve large-scale
problems [71, 260].

7. Conclusions

In this paper, we review the history of robotic grasp synthesis approaches, including analytic methods, data-driven
methods, and recent object-centric methods. Analytic methods are usually based on top of known object models
and mechanical analysis, which can theoretically ensure stability but with strong assumptions and simplifications
limiting its application in practical scenarios. Data-driven methods are derived from neuropsychology and are
mostly heuristic. However, it relaxed the assumptions made by analytic methods and hence, is widely used in
real-world scenarios. In particular, benefitting from recent progress in learning techniques, it achieves commend-
able performance in grasping tasks and is promising to play important roles in robotic autonomy. Recently, with
developments in semantic vision, object-centric methods have been more and more actively investigated. Object-
centric methods combine the understanding of object semantics with grasp synthesis for semantic grasping, which
is more close to our daily life instead of industrial applications. We believe that in the future, vision-based intuitive
physics, open-set grasping with semantic representations, and planning under partial observability and uncertainty
will be the future trends for robotic grasping.

Author Contributions

Hanbo Zhang finished most parts of this manuscripts. Jian Tang and Shiguang Sun helped to collect and organize
the literature. Xuguang Lan is the supervisor of Hanbo Zhang, Jian Tang, and Shiguang Sun, and he is also the
corresponding author and responsible for all contents.

References
1. Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the
twenty-first international conference on Machine learning, page 1, 2004.
2. William Agnew, Christopher Xie, Aaron Walsman, Octavian Murad, Yubo Wang, Pedro Domingos, and Siddhartha
Srinivasa. Amodal 3d reconstruction for robotic manipulation via stability and connectivity. In Conference on Robot
Learning, pages 1498–1508. PMLR, 2021.
3. Stefan Ainetter, Christoph Böhm, Rohit Dhakate, Stephan Weiss, and Friedrich Fraundorfer. Depth-aware object seg-
mentation and grasp detection for robotic picking tasks. In The British Machine Vision Conference (BMVC), 2021.
4. Stefan Ainetter and Friedrich Fraundorfer. End-to-end trainable deep neural network for robotic grasp detection and
semantic segmentation from rgb. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages
13452–13458. IEEE, 2021.
5. Jacopo Aleotti and Stefano Caselli. Grasp recognition in virtual reality for robot pregrasp planning by demonstration. In
Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 2801–2806.
IEEE, 2006.
6. Peter K Allen and Ruzena Bajcsy. Object recognition using vision and touch. In Proceedings. International Joint
Conference on Artificial Intelligence, 1985.
7. Muhannad Alomari, Paul Duckworth, Majd Hawasly, David C Hogg, and Anthony G Cohn. Natural language ground-
ing and grammar induction for robotic manipulation commands. In Proceedings of the First Workshop on Language
Grounding for Robotics, pages 35–43, 2017.
8. Ermano Arruda, Claudio Zito, Mohan Sridharan, Marek Kopicki, and Jeremy L Wyatt. Generative grasp synthesis from
demonstration using parametric mixtures. arXiv preprint arXiv:1906.11548, 2019.
9. Umar Asif, Mohammed Bennamoun, and Ferdous A Sohel. Rgb-d object recognition and grasp detection using hierar-
chical cascaded forests. IEEE Transactions on Robotics, 33(3):547–564, 2017.
10. Umar Asif, Jianbin Tang, and Stefan Harrer. Graspnet: An efficient convolutional neural network for real-time grasp
detection for low-powered devices. In IJCAI, volume 7, pages 4875–4882, 2018.
11. Umar Asif, Jianbin Tang, and Stefan Harrer. Densely supervised grasp detector (dsgd). In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 8085–8093, 2019.
12. Tim Baier-Lowenstein and Jianwei Zhang. Learning to grasp everyday objects using reinforcement-learning with
automatic value cut-off. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1551–
1556. IEEE, 2007.
13. Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understand-
ing. Proceedings of the National Academy of Sciences, 110(45):18327–18332, 2013.
14. Yasemin Bekiroglu, Janne Laaksonen, Jimmy Alison Jorgensen, Ville Kyrki, and Danica Kragic. Assessing grasp
stability based on learning and haptic data. IEEE Transactions on Robotics, 27(3):616–629, 2011.
15. Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review
and new perspectives. CoRR, abs/1206.5538, 1:2012, 2012.
16. Dmitry Berenson, Siddhartha S Srinivasa, Dave Ferguson, Alvaro Collet, and James J Kuffner. Manipulation planning
with workspace goal regions. In 2009 IEEE International Conference on Robotics and Automation, pages 618–624.
IEEE, 2009.
17. Lars Berscheid, Pascal Meißner, and Torsten Kröger. Self-supervised learning for precise pick-and-place without object
model. IEEE Robotics and Automation Letters, 5(3):4828–4835, 2020.
18. Lars Berscheid, Thomas Rühr, and Torsten Kröger. Improving data efficiency of self-supervised learning for robotic
grasping. In 2019 International Conference on Robotics and Automation (ICRA), pages 2125–2131. IEEE, 2019.
19. Antonio Bicchi. On the closure properties of robotic grasping. The International Journal of Robotics Research,
14(4):319–334, 1995.
20. Antonio Bicchi and Vijay Kumar. Robotic grasping and contact: A review. In Proceedings 2000 ICRA. Millennium Con-
ference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065),
volume 1, pages 348–353. IEEE, 2000.
21. Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gaurav S
Sukhatme. Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on
Robotics, 33(6):1273–1291, 2017.
22. Jeannette Bohg and Danica Kragic. Grasping familiar objects using shape context. In 2009 International Conference
on Advanced Robotics, pages 1–6. IEEE, 2009.
23. Jeannette Bohg and Danica Kragic. Learning grasping points with shape context. Robotics and Autonomous Systems,
58(4):362–377, 2010.
24. Jeannette Bohg, Antonio Morales, Tamim Asfour, and Danica Kragic. Data-driven grasp synthesis—a survey. IEEE
Transactions on Robotics, 30(2):289–309, 2013.
25. Alessandro Bonardi, Stephen James, and Andrew J Davison. Learning one-shot imitation from humans without humans.
IEEE Robotics and Automation Letters, 5(2):3533–3539, 2020.
26. Ian M Bullock, Thomas Feix, and Aaron M Dollar. The yale human grasping dataset: Grasp, object, and task data in
household and machine shop environments. The International Journal of Robotics Research, 34(3):251–255, 2015.
27. Berk Calli, Arjun Singh, James Bruce, Aaron Walsman, Kurt Konolige, Siddhartha Srinivasa, Pieter Abbeel, and
Aaron M Dollar. Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics
Research, 36(3):261–268, 2017.
28. Hanwen Cao, Hao-Shu Fang, Wenhai Liu, and Cewu Lu. Suctionnet-1billion: A large-scale benchmark for suction
grasping. arXiv preprint arXiv:2103.12311, 2021.
29. Georgia Chalvatzaki, Nikolaos Gkanatsios, Petros Maragos, and Jan Peters. Orientation attentive robotic grasp synthesis
with augmented grasp map representation. arXiv preprint arXiv:2006.05123, 2020.
30. I-Ming Chen and Joel W Burdick. Finding antipodal point grasps on irregularly shaped objects. IEEE transactions on
Robotics and Automation, 9(4):507–512, 1993.
31. Siwei Chen, Xiao Ma, Yunfan Lu, and David Hsu. Ab initio particle-based object manipulation. In Robotics: Science
and Systems, 2021.
32. Xiangyu Chen, Zelin Ye, Jiankai Sun, Yuda Fan, Fang Hu, Chenxi Wang, and Cewu Lu. Transferable active grasping
and real embodied dataset. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3611–
3618. IEEE, 2020.
33. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter:
Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer,
2020.
34. Yiye Chen, Ruinian Xu, Yunzhi Lin, and Patricio A Vela. A joint network for grasp detection conditioned on natural
language commands. arXiv preprint arXiv:2104.00492, 2021.
35. Zhixin Chen, Mengxiang Lin, Zhixin Jia, and Shibo Jian. Towards generalization and data efficient learning of deep
robotic grasping. arXiv preprint arXiv:2007.00982, 2020.
36. Eris Chinellato, Robert B Fisher, Antonio Morales, and Angel P Del Pobil. Ranking planar grasp configurations for
a three-finger hand. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422),
volume 1, pages 1133–1138. IEEE, 2003.
37. Eris Chinellato, Antonio Morales, Robert B Fisher, and Angel P del Pobil. Visual quality measures for characteriz-
ing planar robot grasps. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
35(1):30–41, 2005.
38. Fu-Jen Chu, Ruinian Xu, and Patricio A Vela. Real-world multiobject, multigrasp detection. IEEE Robotics and
Automation Letters, 3(4):3355–3362, 2018.
39. Vanya Cohen, Benjamin Burchfiel, Thao Nguyen, Nakul Gopalan, Stefanie Tellex, and George Konidaris. Grounding
language attributes to objects using bayesian eigenobjects. In 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 1187–1194. IEEE, 2019.
40. Alvaro Collet, Dmitry Berenson, Siddhartha S Srinivasa, and Dave Ferguson. Object recognition and full pose registra-
tion from a single image for robotic manipulation. In 2009 IEEE International Conference on Robotics and Automation,
pages 48–55. IEEE, 2009.
41. Yang Cong, Ronghan Chen, Bingtao Ma, Hongsen Liu, Dongdong Hou, and Chenguang Yang. A comprehensive study
of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics, 2021.
42. Jordi Cornella and Raúl Suárez. Determining independent grasp regions on 2d discrete objects. In 2005 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 2941–2946. IEEE, 2005.
43. Jordi Cornella and Raúl Suárez. Fast and flexible determination of force-closure independent regions to grasp polygonal
objects. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages 766–771. IEEE,
2005.
44. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
45. Noel Curtis and Jing Xiao. Efficient and effective grasping of novel objects through learning and adapting a knowledge
base. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2252–2257. IEEE, 2008.
46. Mark R Cutkosky et al. On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE
Transactions on robotics and automation, 5(3):269–279, 1989.
47. Hao Dang and Peter K Allen. Learning grasp stability. In 2012 IEEE International Conference on Robotics and
Automation, pages 2392–2397. IEEE, 2012.
48. Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020.
49. Elias De Coninck, Tim Verbelen, Pieter Van Molle, Pieter Simoens, and Bart Dhoedt IDLab. Learning to grasp arbitrary
household objects from a single demonstration. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 2372–2377. IEEE, 2019.
50. Charles de Granville, Joshua Southerland, and Andrew H Fagg. Learning grasp affordances through human demonstra-
tion. In Proceedings of the International Conference on Development and Learning (ICDL’06), 2006.
51. Yuhong Deng, Xiaofeng Guo, Yixuan Wei, Kai Lu, Bin Fang, Di Guo, Huaping Liu, and Fuchun Sun. Deep reinforce-
ment learning for robotic pushing and picking in cluttered environment. In 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pages 619–626. IEEE, 2019.
52. Amaury Depierre, Emmanuel Dellandréa, and Liming Chen. Jacquard: A large scale dataset for robotic grasp detection.
In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3511–3516. IEEE, 2018.
53. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
54. Dan Ding, Yun-Hui Lee, and Shuguo Wang. Computation of 3-d form-closure grasps. IEEE Transactions on Robotics
and Automation, 17(4):515–522, 2001.
55. Martin Do, Javier Romero, Hedvig Kjellström, Pedram Azad, Tamim Asfour, Danica Kragic, and Rüdiger Dillmann.
Grasp recognition and mapping on humanoid robots. In 2009 9th IEEE-RAS International Conference on Humanoid
Robots, pages 465–471. IEEE, 2009.
56. Mingshuai Dong, Shimin Wei, Jianqin Yin, and Xiuli Yu. Real-world semantic grasping detection. arXiv preprint
arXiv:2111.10522, 2021.
57. Mingshuai Dong, Shimin Wei, Xiuli Yu, and Jianqin Yin. Mask-gd segmentation based robotic grasp detection. arXiv
preprint arXiv:2101.08183, 2021.
58. Yan Duan, Marcin Andrychowicz, Bradly C Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and
Wojciech Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017.
59. Staffan Ekvall, Frank Hoffmann, and Danica Kragic. Object recognition and pose estimation for robotic manipulation
using color cooccurrence histograms. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2003)(Cat. No. 03CH37453), volume 2, pages 1284–1289. IEEE, 2003.
60. Staffan Ekvall and Danica Kragic. Grasp recognition for programming by demonstration. In Proceedings of the 2005
IEEE International Conference on Robotics and Automation, pages 748–753. IEEE, 2005.
61. Staffan Ekvall and Danica Kragic. Learning and evaluation of the approach vector for automatic grasp generation and
planning. In Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 4715–4720. IEEE,
2007.
62. Sahar El-Khoury and Anis Sahbani. Handling objects by their handles. In IEEE/RSJ International Conference on
Intelligent Robots and Systems: Post Talk., 2008.
63. N Elango and AAM Faudzi. A review article: investigations on soft materials for soft robot manipulations. The
International Journal of Advanced Manufacturing Technology, 80(5):1027–1037, 2015.
64. Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In
2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6222–6227. IEEE, 2021.
65. Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised pre-training help deep
learning? In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 201–
208. JMLR Workshop and Conference Proceedings, 2010.
66. Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general
object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–
11453, 2020.
67. Bernard Faverjon and Jean Ponce. On computing two-finger force-closure grasps of curved 2d objects. In Proceedings.
1991 IEEE International Conference on Robotics and Automation, pages 424–429. IEEE, 1991.
68. Thomas Feix, Roland Pawlik, Heinz-Bodo Schmiedmayer, Javier Romero, and Danica Kragic. A comprehensive grasp
taxonomy. In Robotics, science and systems: workshop on understanding the human hand for advancing robotic
manipulation, volume 2, pages 2–3. Seattle, WA, USA;, 2009.
69. C Ferrari and J Canny. Planning optimal grasps. In Proceedings 1992 IEEE International Conference on Robotics and
Automation, pages 2290–2295. IEEE, 1992.
70. Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via
meta-learning. In Conference on Robot Learning, pages 357–368. PMLR, 2017.
71. Neha P Garg, David Hsu, and Wee Sun Lee. Learning to grasp under uncertainty using pomdps. In 2019 International
Conference on Robotics and Automation (ICRA), pages 2751–2757. IEEE, 2019.
72. Alexandre Gariépy, Jean-Christophe Ruel, Brahim Chaib-Draa, and Philippe Giguere. Gq-stn: Optimizing one-shot
grasp detection based on robustness classifier. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 3996–4003. IEEE, 2019.
73. Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás
Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems,
4:265–293, 2021.
74. Nikolaos Gkanatsios, Georgia Chalvatzaki, Petros Maragos, and Jan Peters. Orientation attentive robot grasp synthesis.
arXiv e-prints, pages arXiv–2006, 2020.
75. Jared Glover, Daniela Rus, and Nicholas Roy. Probabilistic models of object geometry for grasp planning. Proceedings
of Robotics: Science and Systems IV, Zurich, Switzerland, pages 278–285, 2008.
76. Corey Goldfeder, Peter K Allen, Claire Lackner, and Raphael Pelossof. Grasp planning via decomposition trees. In
Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 4679–4684. IEEE, 2007.
77. Corey Goldfeder, Matei Ciocarlie, Hao Dang, and Peter K Allen. The columbia grasp database. In 2009 IEEE interna-
tional conference on robotics and automation, pages 1710–1716. IEEE, 2009.
78. Minghao Gou, Hao-Shu Fang, Zhanda Zhu, Sheng Xu, Chenxi Wang, and Cewu Lu. Rgb matters: Learning 7-dof grasp
poses on monocular rgbd images. arXiv preprint arXiv:2103.02184, 2021.
79. Kathrin Gräve, Jörg Stückler, and Sven Behnke. Improving imitated grasping motions through interactive expected
deviation learning. In 2010 10th IEEE-RAS International Conference on Humanoid Robots, pages 397–404. IEEE,
2010.
80. Markus Grotz, David Sippel, and Tamim Asfour. Active vision for extraction of physically plausible support relations.
In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 439–445. IEEE, 2019.
81. Sergio Guadarrama, Lorenzo Riano, Dave Golland, Daniel Go, Yangqing Jia, Dan Klein, Pieter Abbeel, Trevor Dar-
rell, et al. Grounding spatial relations for human-robot interaction. In 2013 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pages 1640–1647. IEEE, 2013.
82. Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. Open-
vocabulary object retrieval. In Robotics: science and systems, 2014.
83. Marcus Gualtieri, Andreas Ten Pas, Kate Saenko, and Robert Platt. High precision grasp pose detection in dense clutter.
In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 598–605. IEEE, 2016.
84. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International
Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
85. Di Guo, Tao Kong, Fuchun Sun, and Huaping Liu. Object discovery and grasp detection with a shared convolutional
neural network. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2038–2043. IEEE,
2016.
86. Di Guo, Fuchun Sun, Tao Kong, and Huaping Liu. Deep vision networks for real-time robotic grasp detection. Inter-
national Journal of Advanced Robotic Systems, 14(1):1729881416682706, 2016.
87. Di Guo, Fuchun Sun, Huaping Liu, Tao Kong, Bin Fang, and Ning Xi. A hybrid deep architecture for robotic grasp
detection. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1609–1614. IEEE, 2017.
88. Janik Hager, Ruben Bauer, Marc Toussaint, and Jim Mainprice. Graspme-grasp manifold estimator. In 2021 30th IEEE
International Conference on Robot & Human Interactive Communication (RO-MAN), pages 626–632. IEEE, 2021.
89. Li Han, Jeffrey C Trinkle, and Zexiang X Li. Grasp analysis as linear matrix inequality problems. IEEE Transactions
on Robotics and Automation, 16(6):663–674, 2000.
90. Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, Kuniyuki Takahashi, Yuta Tsuboi, Yuya Unno, Wilson Ko, and Jethro Tan.
Interactively picking real-world objects with unconstrained spoken language instructions. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 3774–3781. IEEE, 2018.
91. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
92. Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, Ludovic Righetti, Tamim Asfour, and Stefan Schaal. Template-
based learning of grasp selection. In 2012 IEEE International Conference on Robotics and Automation, pages 2379–
2384. IEEE, 2012.
93. Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, Ludovic Righetti, Jeannette Bohg, Tamim Asfour, and Stefan
Schaal. Learning of grasp selection based on shape-templates. Autonomous Robots, 36(1):51–65, 2014.
94. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
95. Matthew William Horn et al. Quantifying grasp quality using an inverse reinforcement learning algorithm. PhD thesis,
2017.
96. Kaijen Hsiao and Tomas Lozano-Perez. Imitation learning of whole-body grasps. In 2006 IEEE/RSJ international
conference on intelligent robots and systems, pages 5657–5662. IEEE, 2006.
97. Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referen-
tial expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1115–1124, 2017.
98. Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. Diff-
taichi: Differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935, 2019.
99. Yuanming Hu, Jiancheng Liu, Andrew Spielberg, Joshua B Tenenbaum, William T Freeman, Jiajun Wu, Daniela Rus,
and Wojciech Matusik. Chainqueen: A real-time differentiable physical simulator for soft robotics. In 2019 Interna-
tional conference on robotics and automation (ICRA), pages 6265–6271. IEEE, 2019.
100. Yongqiang Huang, Matteo Bianchi, Minas Liarokapis, and Yu Sun. Recent data sets on object manipulation: A survey.
Big data, 4(4):197–216, 2016.
101. Kai Huebner, Steffen Ruthotto, and Danica Kragic. Minimum volume bounding box decomposition for shape approx-
imation in robot grasping. In 2008 IEEE International Conference on Robotics and Automation, pages 1628–1633.
IEEE, 2008.
102. Thea Iberall, Joe Jackson, Liz Labbe, and Ralph Zampano. Knowledge-based prehension: Capturing human dexterity.
In Proceedings. 1988 IEEE International Conference on Robotics and Automation, pages 82–87. IEEE, 1988.
103. Shariq Iqbal, Jonathan Tremblay, Andy Campbell, Kirby Leung, Thang To, Jia Cheng, Erik Leitch, Duncan McKay,
and Stan Birchfield. Toward sim-to-real directional semantic grasping. In 2020 IEEE International Conference on
Robotics and Automation (ICRA), pages 7247–7253. IEEE, 2020.
104. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural informa-
tion processing systems, 28:2017–2025, 2015.
105. Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine,
Raia Hadsell, and Konstantinos Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-
to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 12627–12637, 2019.
106. Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. End-to-end learning of seman-
tic grasping. arXiv preprint arXiv:1707.01932, 2017.
107. Marc Jeannerod. The neural and behavioural organization of goal-directed movements. Clarendon Press/Oxford
University Press, 1988.
108. Yan-Bin Jia. Computation on parametric curves with an application in grasping. The International Journal of Robotics
Research, 23(7-8):827–857, 2004.
109. Ping Jiang, Junji Oaki, Yoshiyuki Ishihara, Junichiro Ooga, Haifeng Han, Atsushi Sugahara, Seiji Tokura, Haruna Eto,
Kazuma Komoda, and Akihito Ogawa. Learning suction graspability considering grasp quality and robot reachability
for bin-picking. arXiv preprint arXiv:2111.02571, 2021.
110. Yun Jiang, Stephen Moseson, and Ashutosh Saxena. Efficient grasping from rgbd images: Learning using a new
rectangle representation. In 2011 IEEE International conference on robotics and automation, pages 3304–3311. IEEE,
2011.
111. Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE
transactions on pattern analysis and machine intelligence, 2020.
112. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense cap-
tioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016.
113. Ishay Kamon, Tamar Flash, and Shimon Edelman. Learning to grasp using visual information. In Proceedings of IEEE
International Conference on Robotics and Automation, volume 3, pages 2470–2476. IEEE, 1996.
114. Sing Bing Kang and Katsushi Ikeuchi. Toward automatic robot instruction from perception-recognizing a grasp from
observation. IEEE Transactions on Robotics and Automation, 9(4):432–443, 1993.
115. Rainer Kartmann, Fabian Paus, Markus Grotz, and Tamim Asfour. Extraction of physically plausible support relations
to predict and validate manipulation action effects. IEEE Robotics and Automation Letters, 3(4):3991–3998, 2018.
116. Hamidreza Kasaei and Mohammadreza Kasaei. Mvgrasp: Real-time multi-view 3d object grasping in highly cluttered
environments. arXiv preprint arXiv:2103.10997, 2021.
117. Hamidreza Kasaei, Sha Luo, Remo Sasso, and Mohammadreza Kasaei. Simultaneous multi-view object recognition
and grasping in open-ended domains. arXiv preprint arXiv:2106.01866, 2021.
118. Byoung-Ho Kim, Sang-Rok Oh, Byung-Ju Yi, and Il Hong Suh. Optimal grasping based on non-dimensionalized
performance indices. In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems.
Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No. 01CH37180), volume 2, pages 949–956.
IEEE, 2001.
119. Kilian Kleeberger, Richard Bormann, Werner Kraus, and Marco F Huber. A survey on learning-based robotic grasping.
Current Robotics Reports, pages 1–11, 2020.
120. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
121. Danica Kragic and Henrik I Christensen. Model based techniques for robotic servoing and grasping. In IEEE/RSJ
international conference on intelligent robots and systems, volume 1, pages 299–304. IEEE, 2002.
122. Danica Kragic, Andrew T Miller, and Peter K Allen. Real-time tracking meets online grasp planning. In Proceedings
2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), volume 3, pages
2460–2465. IEEE, 2001.
123. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net-
works. Advances in neural information processing systems, 25:1097–1105, 2012.
124. Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot learning for manipulation: Challenges, repre-
sentations, and algorithms. Journal of Machine Learning Research, 22:30–1, 2021.
125. James R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends in
cognitive sciences, 21(10):749–759, 2017.
126. Sulabh Kumra and Christopher Kanan. Robotic grasp detection using deep convolutional neural networks. In 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 769–776. IEEE, 2017.
127. K Lakshiminarayana. Mechanics of form closure. ASME paper, 78-DET-32, 1978.
128. Thomas Lampe and Martin Riedmiller. Acquiring visual servoing reaching and grasping skills using neural reinforce-
ment learning. In The 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013.
129. Steven M LaValle. Planning algorithms. Cambridge university press, 2006.
130. Quoc V Le, David Kamm, Arda F Kara, and Andrew Y Ng. Learning to grasp objects with multiple contact points. In
2010 IEEE International Conference on Robotics and Automation, pages 5062–5069. IEEE, 2010.
131. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
132. Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal
of Robotics Research, 34(4-5):705–724, 2015.
133. Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for
robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research,
37(4-5):421–436, 2018.
134. Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed
points. Advances in neural information processing systems, 31:820–830, 2018.
135. Yikun Li, Lambert Schomaker, and S Hamidreza Kasaei. Learning to grasp 3d objects using deep residual u-nets. In
2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 781–787.
IEEE, 2020.
136. Yiming Li, Tao Kong, Ruihang Chu, Yifeng Li, Peng Wang, and Lei Li. Simultaneous semantic and collision learning
for 6-dof grasp pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pages 3571–3578. IEEE, 2021.
137. Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, and Jianwei Zhang.
Pointnetgpd: Detecting grasp configurations from point sets. In 2019 International Conference on Robotics and Au-
tomation (ICRA), pages 3629–3635. IEEE, 2019.
138. Yun Lin and Yu Sun. Grasp planning based on strategy extracted from demonstration. In 2014 IEEE/RSJ International
Conference on Intelligent Robots and Systems, pages 4458–4463. IEEE, 2014.
139. Yun Lin and Yu Sun. Robot grasp planning based on demonstrated grasp strategies. The International Journal of
Robotics Research, 34(1):26–42, 2015.
140. Guanfeng Liu, Jijie Xu, Xin Wang, and Zexiang Li. On quality functions for grasp synthesis, fixture planning, and
coordinated manipulation. IEEE Transactions on Automation Science and Engineering, 1(2):146–162, 2004.
141. Huan Liu, Thea Iberall, and George A Bekey. The multi-dimensional quality of task requirements for dextrous robot
hand control. In 1989 IEEE International Conference on Robotics and Automation, pages 452–453. IEEE Computer
Society, 1989.
142. Huaping Liu, Yuan Yuan, Yuhong Deng, Xiaofeng Guo, Yixuan Wei, Kai Lu, Bin Fang, Di Guo, and Fuchun Sun.
Active affordance exploration for robot grasping. In International Conference on Intelligent Robotics and Applications,
pages 426–438. Springer, 2019.
143. Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning
for generic object detection: A survey. International journal of computer vision, 128(2):261–318, 2020.
144. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg.
Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
145. Yun-Hui Liu. Computing n-finger form-closure grasps on polygonal objects. The International journal of robotics
research, 19(2):149–158, 2000.
146. Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In
European conference on computer vision, pages 852–869. Springer, 2016.
147. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
148. Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. Proceedings of
Robotics: Science and Systems. doi, 10, 2021.
149. Jeffrey Mahler and Ken Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences.
In Conference on robot learning, pages 515–524. PMLR, 2017.
150. Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken
Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics.
arXiv preprint arXiv:1703.09312, 2017.
151. Jeffrey Mahler, Matthew Matl, Xinyu Liu, Albert Li, David Gealy, and Ken Goldberg. Dex-net 3.0: Computing robust
vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In 2018 IEEE International
Conference on robotics and automation (ICRA), pages 5620–5627. IEEE, 2018.
152. Jeffrey Mahler, Florian T Pokorny, Brian Hou, Melrose Roderick, Michael Laskey, Mathieu Aubry, Kai Kohlhoff,
Torsten Kröger, James Kuffner, and Ken Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp
planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE international conference on robotics
and automation (ICRA), pages 1957–1964. IEEE, 2016.
153. Zhao Mandi, Fangchen Liu, Kimin Lee, and Pieter Abbeel. Towards more generalizable one-shot visual imitation
learning. arXiv preprint arXiv:2110.13423, 2021.
154. Tanis Mar, Vadim Tikhanoff, Giorgio Metta, and Lorenzo Natale. Self-supervised learning of grasp dependent tool
affordances on the icub humanoid robot. In 2015 IEEE International Conference on Robotics and Automation (ICRA),
pages 3200–3206. IEEE, 2015.
155. Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. Pose estimation for augmented reality: a hands-on survey.
IEEE transactions on visualization and computer graphics, 22(12):2633–2651, 2015.
156. Xanthippi Markenscoff, Luqun Ni, and Christos H Papadimitriou. The geometry of grasping. The International Journal
of Robotics Research, 9(1):61–74, 1990.
157. Xanthippi Markenscoff and Christos H Papadimitriou. Optimum grip of a polygon. The International Journal of
Robotics Research, 8(2):17–29, 1989.
158. Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. Learning to parse natural language commands to
a robot control system. In Experimental robotics, pages 403–415. Springer, 2013.
159. Kirill Mazur and Victor Lempitsky. Cloud transformers: A universal approach to point cloud processing tasks. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10715–10724, 2021.
160. Andrew T Miller and Peter K Allen. Examples of 3d grasp quality computations. In Proceedings 1999 IEEE Interna-
tional Conference on Robotics and Automation (Cat. No. 99CH36288C), volume 2, pages 1240–1246. IEEE, 1999.
161. Andrew T Miller and Peter K Allen. Graspit! a versatile simulator for robotic grasping. IEEE Robotics & Automation
Magazine, 11(4):110–122, 2004.
162. Andrew T Miller, Steffen Knoop, Henrik I Christensen, and Peter K Allen. Automatic grasp planning using shape
primitives. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), volume 2,
pages 1824–1829. IEEE, 2003.
163. Brian Mirtich and John Canny. Easily computable optimum grasps in 2-d and 3-d. In Proceedings of the 1994 IEEE
International Conference on Robotics and Automation, pages 739–747. IEEE, 1994.
164. Bhubaneswar Mishra. Grasp metrics: Optimality and complexity. In Algorithmic Foundations of Robotics, pages
137–166. AK Peters, 1995.
165. Rasoul Mojtahedzadeh, Abdelbaki Bouguerra, Erik Schaffernicht, and Achim J Lilienthal. Support relation analysis
and decision making for safe robotic manipulation tasks. Robotics and Autonomous Systems, 71:99–117, 2015.
166. George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and
algorithms. Management science, 28(1):1–16, 1982.
167. A Morales, P Azad, T Asfour, D Kraft, S Knoop, R Dillmann, A Kargov, CH Pylatiuk, and S Schulz. An anthropomor-
phic grasping approach for an assistant humanoid robot. In International Symposium on Robotics (ISR), 2006.
168. Douglas Morrison, Peter Corke, and Jürgen Leitner. Closing the loop for robotic grasping: A real-time, generative grasp
synthesis approach. arXiv preprint arXiv:1804.05172, 2018.
169. Douglas Morrison, Peter Corke, and Jürgen Leitner. Multi-view picking: Next-best-view reaching for improved grasp-
ing in clutter. In 2019 International Conference on Robotics and Automation (ICRA), pages 8762–8768. IEEE, 2019.
170. Douglas Morrison, Peter Corke, and Jürgen Leitner. Learning robust, real-time, reactive robotic grasping. The Interna-
tional journal of robotics research, 39(2-3):183–201, 2020.
171. Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipu-
lation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2901–2910, 2019.
172. Adithyavairavan Murali, Arsalan Mousavian, Clemens Eppner, Chris Paxton, and Dieter Fox. 6-dof grasping for target-
driven object manipulation in clutter. In 2020 IEEE International Conference on Robotics and Automation (ICRA),
pages 6232–6238. IEEE, 2020.
173. Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression
understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
174. Shree K Nayar, Hiroshi Murase, and Sameer A Nene. Learning, positioning, and tracking visual appearance. In
Proceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 3237–3244. IEEE, 1994.
175. Van-Duc Nguyen. Constructing force-closure grasps. The International Journal of Robotics Research, 7(3):3–16, 1988.
176. Peiyuan Ni, Wenguang Zhang, Xiaoxiao Zhu, and Qixin Cao. Pointnet++ grasping: Learning an end-to-end spatial
grasp generation algorithm from sparse point clouds. In 2020 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 3619–3625. IEEE, 2020.
177. Nattee Niparnan and Attawith Sudsang. Computing all force-closure grasps of 2d objects from contact point set. In
2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1599–1604. IEEE, 2006.
178. Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language
processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2):604–624, 2020.
179. Swagatika Panda, AH Abdul Hafez, and CV Jawahar. Learning support order for manipulation in clutter. In 2013
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 809–815. IEEE, 2013.
180. Dongwon Park and Se Young Chun. Classification based grasp detection using spatial transformer network. arXiv
preprint arXiv:1803.01356, 2018.
181. Dongwon Park, Yonghyeok Seo, and Se Young Chun. Real-time, highly accurate robotic grasp detection using fully
convolutional neural network with rotation ensemble module. In 2020 IEEE International Conference on Robotics and
Automation (ICRA), pages 9397–9403. IEEE, 2020.
182. Dongwon Park, Yonghyeok Seo, Dongju Shin, Jaesik Choi, and Se Young Chun. A single multi-task deep neural net-
work with post-processing for object detection with reasoning and robotic grasp detection. In 2020 IEEE International
Conference on Robotics and Automation (ICRA), pages 7300–7306. IEEE, 2020.
183. Young C Park and Gregory P Starr. Grasp synthesis of polygonal objects using a three-fingered robot hand. The
International journal of robotics research, 11(3):163–184, 1992.
184. Rohan Paul, Jacob Arkin, Derya Aksaray, Nicholas Roy, and Thomas M Howard. Efficient grounding of abstract
spatial concepts for natural language interaction with robot platforms. The International Journal of Robotics Research,
37(10):1269–1299, 2018.
185. Fabian Paus and Tamim Asfour. Probabilistic representation of objects and their support relations. In International
Symposium on Experimental Robotics, pages 510–519, 2021.
186. Raphael Pelossof, Andrew Miller, Peter Allen, and Tony Jebara. An svm learning approach to robotic grasping. In
IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 4, pages
3512–3518. IEEE, 2004.
187. Justus H. Piater and Roderic a. Grupen. Learning appearance features to support robotic manipulation. Cognitive Vision
Workshop, pages 19–20, 2001.
188. Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours.
In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
189. Florian T Pokorny, Kaiyu Hang, and Danica Kragic. Grasp moduli spaces. In Robotics: Science and Systems, 2013.
190. Nancy S Pollard. Closure and quality equivalence for efficient synthesis of grasps from examples. The International
Journal of Robotics Research, 23(6):595–613, 2004.
191. Jean Ponce and Bernard Faverjon. On computing three-finger force-closure grasps of polygonal objects. IEEE Trans-
actions on robotics and automation, 11(6):868–881, 1995.
192. Jean Ponce, Steve Sullivan, Attawith Sudsang, Jean-Daniel Boissonnat, and Jean-Pierre Merlet. On computing four-
finger equilibrium and force-closure grasps of polyhedral objects. The International Journal of Robotics Research,
16(1):11–35, 1997.
193. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification
and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
2017.
194. Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a
metric space. arXiv preprint arXiv:1706.02413, 2017.
195. Yuzhe Qin, Rui Chen, Hao Zhu, Meng Song, Jing Xu, and Hao Su. S4g: Amodal single-view single-shot se (3) grasp
detection in cluttered scenes. In Conference on robot learning, pages 53–65. PMLR, 2020.
196. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv
preprint arXiv:2103.00020, 2021.
197. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
198. Deepak Rao, Quoc V Le, Thanathorn Phoka, Morgan Quigley, Attawith Sudsang, and Andrew Y Ng. Grasping novel
objects with depth segmentation. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
2578–2585. IEEE, 2010.
199. Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolutional neural networks. In 2015 IEEE
International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
200. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
201. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
202. F Reuleax. The kinematics of machinery, macmilly and company, 1876. Republished by Dover in, 1876.
203. Elon Rimon and Joel Burdick. On force and form closure for multiple finger grasps. In Proceedings of IEEE Interna-
tional Conference on Robotics and Automation, volume 2, pages 1795–1800. IEEE, 1996.
204. Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel
Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616,
2018.
205. Máximo A Roa and Raúl Suárez. Independent contact regions for frictional grasps on 3d objects. In 2008 IEEE
International Conference on Robotics and Automation, pages 1622–1627. IEEE, 2008.
206. Máximo A Roa and Raúl Suárez. Computation of independent contact regions for grasping 3-d objects. IEEE Transac-
tions on Robotics, 25(4):839–850, 2009.
207. Máximo A Roa and Raúl Suárez. Grasp quality measures: review and performance. Autonomous robots, 38(1):65–88,
2015.
208. Alberto Rodriguez, Matthew T Mason, and Steve Ferry. From caging to grasping. The International Journal of Robotics
Research, 31(7):886–900, 2012.
209. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241.
Springer, 2015.
210. Carlos Rosales, Raúl Suárez, Marco Gabiccini, and Antonio Bicchi. On the synthesis of feasible and prehensile robotic
grasps. In 2012 IEEE International Conference on Robotics and Automation, pages 550–556. IEEE, 2012.
211. Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization,
Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004.
212. Anis Sahbani, Sahar El-Khoury, and Philippe Bidaud. An overview of 3d object grasp synthesis algorithms. Robotics
and Autonomous Systems, 60(3):326–336, 2012.
213. Marcos Salganicoff, Lyle H Ungar, and Ruzena Bajcsy. Active learning for vision-based robot grasping. Machine
Learning, 23(2):251–278, 1996.
214. Ashutosh Saxena, Justin Driemeyer, Justin Kearns, and Andrew Y Ng. Robotic grasping of novel objects. In Proceed-
ings of the 19th International Conference on Neural Information Processing Systems, pages 1209–1216, 2006.
215. Ashutosh Saxena, Justin Driemeyer, and Andrew Y Ng. Robotic grasping of novel objects using vision. The Interna-
tional Journal of Robotics Research, 27(2):157–173, 2008.
216. J Schill, J Laaksonen, M Przybylski, V Kyrki, T Asfour, and R Dillmann. Learning continuous grasp stability for a
humanoid robot hand based on tactile sensing. In 2012 4th IEEE RAS & EMBS International Conference on Biomedical
Robotics and Biomechatronics (BioRob), pages 1901–1906. IEEE, 2012.
217. Alexander M Schmidts, Dongheui Lee, and Angelika Peer. Imitation learning of human grasping skills from motion
and force data. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1002–1007. IEEE,
2011.
218. Lin Shao, Fabio Ferreira, Mikael Jorda, Varun Nambiar, Jianlan Luo, Eugen Solowjow, Juan Aparicio Ojea, Oussama
Khatib, and Jeannette Bohg. Unigrasp: Learning a unified model to grasp with multifingered robotic hands. IEEE
Robotics and Automation Letters, 5(2):2286–2293, 2020.
219. Quanquan Shao and Jie Hu. Combining rgb and points to predict grasping region for robotic bin-picking. arXiv preprint
arXiv:1904.07394, 2019.
220. Quanquan Shao, Jie Hu, Weiming Wang, Yi Fang, Wenhai Liu, Jin Qi, and Jin Ma. Suction grasp region prediction
using self-supervised learning for object picking in dense clutter. In 2019 IEEE 5th International Conference on
Mechatronics System and Robots (ICMSR), pages 7–12. IEEE, 2019.
221. Karun B Shimoga. Robot grasp synthesis algorithms: A survey. The International Journal of Robotics Research,
15(3):230–266, 1996.
222. Mohit Shridhar and David Hsu. Interactive visual grounding of referring expressions for human-robot interaction. arXiv
preprint arXiv:1806.03831, 2018.
223. Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. arXiv
preprint arXiv:2109.12098, 2021.
224. Mohit Shridhar, Dixant Mittal, and David Hsu. Ingress: Interactive visual grounding of referring expressions. The
International Journal of Robotics Research, 39(2-3):217–232, 2020.
225. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
226. Gordon Smith, Eric Lee, Ken Goldberg, Karl Bohringer, and John Craig. Computing parallel-jaw grips. In Proceedings
1999 IEEE International Conference on Robotics and Automation (Cat. No. 99CH36288C), volume 3, pages 1897–
1903. IEEE, 1999.
227. Yanan Song, Liang Gao, Xinyu Li, and Weiming Shen. A novel robotic grasp detection method based on region
proposal networks. Robotics and Computer-Integrated Manufacturing, 65:101963, 2020.
228. Darrell Stam, Jean Ponce, and Bernard Faverjon. A system for planning and executing two-finger force-closure grasps
of curved 2d objects. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 1, pages 210–217. IEEE, 1992.
229. S Stansfield. Visually-aided tactile exploration. In Proceedings. 1987 IEEE International Conference on Robotics and
Automation, volume 4, pages 1487–1492. IEEE, 1987.
230. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
231. Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp
generation in cluttered scenes. arXiv preprint arXiv:2103.14127, 2021.
232. Tamara Supuk, Timotej Kodek, and Tadej Bajd. Estimation of hand preshaping during human grasping. Medical
engineering & physics, 27(9):790–797, 2005.
233. John D Sweeney and Rod Grupen. A model of shared grasp affordances from demonstration. In 2007 7th IEEE-RAS
International Conference on Humanoid Robots, pages 27–35. IEEE, 2007.
234. Marek Teichmann. A grasp metric invariant under rigid motions. In Proceedings of IEEE International Conference on
Robotics and Automation, volume 3, pages 2143–2148. IEEE, 1996.
235. Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose detection in point clouds. The Interna-
tional Journal of Robotics Research, 36(13-14):1455–1473, 2017.
236. Andreas Ten Pas and Robert Platt. Using geometry to detect grasp poses in 3d point clouds. In Robotics Research,
pages 307–324. Springer, 2018.
237. Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose
estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018.
238. Joaquin Vanschoren. Meta-learning: A survey. arXiv preprint arXiv:1810.03548, 2018.
239. Chenxi Wang, Hao-Shu Fang, Minghao Gou, Hongjie Fang, Jin Gao, and Cewu Lu. Graspness discovery in clutters
for fast and accurate grasp detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 15964–15973, 2021.
240. Dexin Wang, Chunsheng Liu, Faliang Chang, Nanjun Li, and Guangxin Li. High-performance pixel-level grasp detec-
tion based on adaptive grasping and grasp-aware network. IEEE Transactions on Industrial Electronics, 2021.
241. Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
242. Shengfan Wang, Xin Jiang, Jie Zhao, Xiaoman Wang, Weiguo Zhou, and Yunhui Liu. Efficient fully convolution neural
network for generating pixel wise robotic grasps with high resolution images. In 2019 IEEE International Conference
on Robotics and Biomimetics (ROBIO), pages 474–480. IEEE, 2019.
243. Tao Wang, Chao Yang, Frank Kirchner, Peng Du, Fuchun Sun, and Bin Fang. Multimodal grasp data set: A novel visual–
tactile data set for robotic manipulation. International Journal of Advanced Robotic Systems, 16(1):1729881418821571,
2019.
244. Yao Wang, Yangtao Zheng, Boyang Gao, and Di Huang. Double-dot network for antipodal grasp detection. In 2021
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4654–4661. IEEE, 2021.
245. Yefei Wang, Kaili Wang, Yi Wang, Di Guo, Huaping Liu, and Fuchun Sun. Audio-visual grounding referring expression
for robotic manipulation. arXiv preprint arXiv:2109.10571, 2021.
246. Zhichao Wang, Zhiqi Li, Bin Wang, and Hong Liu. Robot grasp detection using multimodal deep convolutional neural
networks. Advances in Mechanical Engineering, 8(9):1687814016668077, 2016.
247. Wei Wei, Yongkang Luo, Fuyu Li, Guangyun Xu, Jun Zhong, Wanyi Li, and Peng Wang. Gpr: Grasp pose refinement
network for cluttered scenes. arXiv preprint arXiv:2105.08502, 2021.
248. Keenon Werling, Dalton Omens, Jeongseok Lee, Ioannis Exarchos, and C Karen Liu. Fast and feature-complete differ-
entiable physics for articulated rigid bodies with contact. arXiv preprint arXiv:2103.16021, 2021.
249. Chaozheng Wu, Jian Chen, Qiaoyu Cao, Jianchi Zhang, Yunxin Tai, Lin Sun, and Kui Jia. Grasp proposal networks:
An end-to-end solution for visual learning of robotic grasps. arXiv preprint arXiv:2009.12606, 2020.
250. Yongxiang Wu, Fuhai Zhang, and Yili Fu. Real-time robotic multi-grasp detection using anchor-free fully convolutional
grasp detector. IEEE Transactions on Industrial Electronics, 2021.
251. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey
on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
252. Yu Xiang, Christopher Xie, Arsalan Mousavian, and Dieter Fox. Learning rgb-d feature embeddings for unseen object
instance segmentation. arXiv preprint arXiv:2007.15157, 2020.
253. Christopher Xie, Yu Xiang, Arsalan Mousavian, and Dieter Fox. Unseen object instance segmentation for robotic
environments. IEEE Transactions on Robotics, 2021.
254. Xu Xie, Changyang Li, Chi Zhang, Yixin Zhu, and Song-Chun Zhu. Learning virtual grasp with failed demonstrations
via bayesian inverse reinforcement learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 1812–1817. IEEE, 2019.
255. Ruinian Xu, Fu-Jen Chu, and Patricio A Vela. Gknet: grasp keypoint network for grasp candidates detection. arXiv
preprint arXiv:2106.08497, 2021.
256. Zhenjia Xu, Beichun Qi, Shubham Agrawal, and Shuran Song. Adagrasp: Learning an adaptive gripper-aware grasping
policy. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4620–4626. IEEE, 2021.
257. Mengyuan Yan, Iuri Frosio, Stephen Tyree, and Jan Kautz. Sim-to-real transfer of accurate grasping with eye-in-hand
observations and continuous control. arXiv preprint arXiv:1712.03303, 2017.
258. Xinchen Yan, Mohi Khansari, Jasmine Hsu, Yuanzheng Gong, Yunfei Bai, Sören Pirk, and Honglak Lee. Data-efficient
learning for sim-to-real robotic grasping using deep point cloud prediction networks. arXiv preprint arXiv:1906.08989,
2019.
259. Daniel Yang, Tarik Tosun, Benjamin Eisner, Volkan Isler, and Daniel Lee. Robotic grasping through combined image-
based grasp proposal and 3d reconstruction. In 2021 IEEE International Conference on Robotics and Automation
(ICRA), pages 6350–6356. IEEE, 2021.
260. Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. Journal
of Artificial Intelligence Research, 58:231–266, 2017.
261. Tian Ye, Xiaolong Wang, James Davidson, and Abhinav Gupta. Interpretable intuitive physics model. In Proceedings
of the European Conference on Computer Vision (ECCV), pages 87–102, 2018.
262. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention
network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1307–1315, 2018.
263. Tianhe Yu, Pieter Abbeel, Sergey Levine, and Chelsea Finn. One-shot hierarchical imitation learning of compound
visuomotor tasks. arXiv preprint arXiv:1810.11043, 2018.
264. Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot
imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557, 2018.
265. Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Learning synergies
between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE, 2018.
266. Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R Hogan, Maria Bauza, Daolin Ma, Orion Taylor,
Melody Liu, Eudald Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping
and cross-domain image matching. In 2018 IEEE international conference on robotics and automation (ICRA), pages
3750–3757. IEEE, 2018.
267. Hanbo Zhang, Xuguang Lan, Site Bai, Lipeng Wan, Chenjie Yang, and Nanning Zheng. A multi-task convolutional
neural network for autonomous robotic grasping in object stacking scenes. In 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pages 6435–6442. IEEE, 2019.
268. Hanbo Zhang, Xuguang Lan, Site Bai, Xinwen Zhou, Zhiqiang Tian, and Nanning Zheng. Roi-based robotic grasp
detection for object overlapping scenes. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pages 4768–4775. IEEE, 2019.
269. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Robotic grasping in
multi-object stacking scenes based on visual reasoning. Scientia Sinica Technologica, 48(12):1341–1356, 2018.
270. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Visual manipulation
relationship network for autonomous robotics. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots
(Humanoids), pages 118–125. IEEE, 2018.
271. Hanbo Zhang, Xuguang Lan, Xinwen Zhou, Zhiqiang Tian, Yang Zhang, and Nanning Zheng. Visual manipulation
relationship recognition in object-stacking scenes. Pattern Recognition Letters, 140:34–42, 2020.
272. Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang La, and Nanning Zheng. Invigorate: Interactive visual
grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
273. Hanbo Zhang, Deyu Yang, Han Wang, Binglei Zhao, Xuguang Lan, Jishiyu Ding, and Nanning Zheng. Regrad: A
large-scale relational grasp dataset for safe and object-specic robotic grasping in clutter. IEEE Robotics and Automation
Letters, 2022.
274. Hanbo Zhang, Xinwen Zhou, Xuguang Lan, Jin Li, Zhiqiang Tian, and Nanning Zheng. A real-time robotic grasping
approach with oriented anchor box. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
275. Li Zhang and Jeffrey C Trinkle. The application of particle filtering to grasping acquisition with visual occlusion and
tactile sensing. In 2012 IEEE International Conference on Robotics and Automation, pages 3805–3812. IEEE, 2012.
276. Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation
learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on
Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
277. Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and
Data Engineering, 2020.
278. Binglei Zhao, Hanbo Zhang, Xuguang Lan, Haoyu Wang, Zhiqiang Tian, and Nanning Zheng. Regnet: Region-based
grasp network for end-to-end grasp detection in point clouds. In 2021 IEEE International Conference on Robotics and
Automation (ICRA), pages 13474–13480. IEEE, 2021.
279. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
280. Xinwen Zhou, Xuguang Lan, Hanbo Zhang, Zhiqiang Tian, Yang Zhang, and Narming Zheng. Fully convolutional
grasp detection network with oriented anchor box. In 2018 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 7223–7230. IEEE, 2018.
281. Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.
282. Xiangyang Zhu and Jun Wang. Synthesis of force-closure grasps on 3-d objects based on the q distance. IEEE
Transactions on robotics and Automation, 19(4):669–679, 2003.
283. R Zollner, O Rogalla, R Dillmann, and JM Zollner. Dynamic grasp recognition within the framework of programming
by demonstration. In Proceedings 10th IEEE International Workshop on Robot and Human Interactive Communication.
ROMAN 2001 (Cat. No. 01TH8591), pages 418–423. IEEE, 2001.
284. Guoyu Zuo, Jiayuan Tong, Hongxing Liu, Wenbai Chen, and Jianfeng Li. Graph-based visual manipulation relationship
reasoning network for robotic grasping. Frontiers in Neurorobotics, 15, 2021.

Applsci 14 09229
No ratings yet
Applsci 14 09229
22 pages
Industrial Robotics (Unit 1 & 2)
100% (2)
Industrial Robotics (Unit 1 & 2)
76 pages
Mobile Robot Platform For Research and Training: Robotino
No ratings yet
Mobile Robot Platform For Research and Training: Robotino
12 pages
已读2023Learning-based robotic graspinp a review
No ratings yet
已读2023Learning-based robotic graspinp a review
14 pages
Deep Learning Approaches To Grasp Synthesis - A Review
No ratings yet
Deep Learning Approaches To Grasp Synthesis - A Review
20 pages
BADGr-A Toolbox For Box-Based Approximation, Decomposition and GRasping
No ratings yet
BADGr-A Toolbox For Box-Based Approximation, Decomposition and GRasping
10 pages
已读（4区）机器人抓取检测技术的研究综述
No ratings yet
已读（4区）机器人抓取检测技术的研究综述
40 pages
Dynamic Path Planning For Dexterous Manipulation A Matlab Implementation IJERTV13IS100037
No ratings yet
Dynamic Path Planning For Dexterous Manipulation A Matlab Implementation IJERTV13IS100037
8 pages
Vision-Based Robotic Grasping From Object Localization, Object Pose Estimation To Grasp Estimation For Parallel Grippers
No ratings yet
Vision-Based Robotic Grasping From Object Localization, Object Pose Estimation To Grasp Estimation For Parallel Grippers
39 pages
Sensors 20 03707 v2
No ratings yet
Sensors 20 03707 v2
30 pages
Dynamic Object Grasping in Human-Robot Cooperation Based On Mixed-Reality
No ratings yet
Dynamic Object Grasping in Human-Robot Cooperation Based On Mixed-Reality
5 pages
Data-driven Robotic Visual Grasping Detection for Unknown Objects - a Problem-Oriented Review (科研通-Ablesci.com)
No ratings yet
Data-driven Robotic Visual Grasping Detection for Unknown Objects - a Problem-Oriented Review (科研通-Ablesci.com)
22 pages
Mvgrasp: Real-Time Multi-View 3D Object Grasping in Highly Cluttered Environments
No ratings yet
Mvgrasp: Real-Time Multi-View 3D Object Grasping in Highly Cluttered Environments
11 pages
Thinkgrasp: A Vision-Language System For Strategic Part Grasping in Clutter
No ratings yet
Thinkgrasp: A Vision-Language System For Strategic Part Grasping in Clutter
16 pages
Any Grasp
No ratings yet
Any Grasp
16 pages
Sensors: Vision For Robust Robot Manipulation
No ratings yet
Sensors: Vision For Robust Robot Manipulation
15 pages
2020vision-Based Robotic Grasping From Object Localization, Object Pose Estimation To Grasp Estimation For Parallel Grippers A Review
No ratings yet
2020vision-Based Robotic Grasping From Object Localization, Object Pose Estimation To Grasp Estimation For Parallel Grippers A Review
58 pages
Joshi 2020
No ratings yet
Joshi 2020
6 pages
ESTANOCO - Bibliography of Section 3.2
No ratings yet
ESTANOCO - Bibliography of Section 3.2
5 pages
Dextrah-G: Pixels-To-Action Dexterous Arm-Hand Grasping With Geometric Fabrics
No ratings yet
Dextrah-G: Pixels-To-Action Dexterous Arm-Hand Grasping With Geometric Fabrics
30 pages
已读2022（3区）Review of Learning-Based Robotic Manipulation in
No ratings yet
已读2022（3区）Review of Learning-Based Robotic Manipulation in
37 pages
61 Ijmperddec201861
No ratings yet
61 Ijmperddec201861
10 pages
49-Optimal Grasping Strategy For Robots With A Parallel Gripper Based On Feature Sensing of 3D Object Model
No ratings yet
49-Optimal Grasping Strategy For Robots With A Parallel Gripper Based On Feature Sensing of 3D Object Model
11 pages
Lu 2022 J. Phys. Conf. Ser. 2216 012026
No ratings yet
Lu 2022 J. Phys. Conf. Ser. 2216 012026
9 pages
Robotics 12 00005 v2
No ratings yet
Robotics 12 00005 v2
33 pages
Robot Grasping Part 2
No ratings yet
Robot Grasping Part 2
4 pages
When Transformer Meets Robotic Grasping Exploits Context For Efficient Grasp Detection
No ratings yet
When Transformer Meets Robotic Grasping Exploits Context For Efficient Grasp Detection
8 pages
Object Detection Recognition and Robot Grasping Based On Machine Learning A Survey
No ratings yet
Object Detection Recognition and Robot Grasping Based On Machine Learning A Survey
25 pages
A Lightweight Object Grasping Network Using GhostNet
No ratings yet
A Lightweight Object Grasping Network Using GhostNet
10 pages
Annurev Control 061520 010405
No ratings yet
Annurev Control 061520 010405
21 pages
Kinova Gemini Interactive Robot Grasping With Visual Reasoning and Conversational AI
No ratings yet
Kinova Gemini Interactive Robot Grasping With Visual Reasoning and Conversational AI
6 pages
robotics-12-00148-v2
No ratings yet
robotics-12-00148-v2
17 pages
Geometry-Based Grasping Pipeline For Bi-Modal Pick and Place
No ratings yet
Geometry-Based Grasping Pipeline For Bi-Modal Pick and Place
7 pages
Berscheid Et Al. - 2021 - Robot Learning of 6 DoF Grasping Using Model-Based
No ratings yet
Berscheid Et Al. - 2021 - Robot Learning of 6 DoF Grasping Using Model-Based
7 pages
Robotics and Autonomous Systems
No ratings yet
Robotics and Autonomous Systems
26 pages
Task-Specific Grasp Selection For Underactuated Hands
No ratings yet
Task-Specific Grasp Selection For Underactuated Hands
6 pages
A Compliant Adaptive Gripper and Its Intrinsic Force Sensing Method
No ratings yet
A Compliant Adaptive Gripper and Its Intrinsic Force Sensing Method
20 pages
Contact Sensing and Grasping Performance of Compliant Hands
No ratings yet
Contact Sensing and Grasping Performance of Compliant Hands
9 pages
EE106b Project Ideas: 1 Guidelines
No ratings yet
EE106b Project Ideas: 1 Guidelines
3 pages
(Highlight) 2031
No ratings yet
(Highlight) 2031
16 pages
Interactive Retrieval
No ratings yet
Interactive Retrieval
7 pages
Unidexgrasp: Universal Robotic Dexterous Grasping Via Learning Diverse Proposal Generation and Goal-Conditioned Policy
No ratings yet
Unidexgrasp: Universal Robotic Dexterous Grasping Via Learning Diverse Proposal Generation and Goal-Conditioned Policy
19 pages
Vision-Based Robotic Pushing and Grasping For Stone Sample Collection Under Computing Resource Constraints
No ratings yet
Vision-Based Robotic Pushing and Grasping For Stone Sample Collection Under Computing Resource Constraints
7 pages
2020 Bme 103
No ratings yet
2020 Bme 103
3 pages
Grasping Novel Objects With Depth Segmentation
No ratings yet
Grasping Novel Objects With Depth Segmentation
8 pages
Tactile Object Recognition in Early Phases of Grasping Using Underactuated Robotic Hands
No ratings yet
Tactile Object Recognition in Early Phases of Grasping Using Underactuated Robotic Hands
13 pages
Gripping Force Analysis of Jamia Hand
No ratings yet
Gripping Force Analysis of Jamia Hand
6 pages
An_Adaptive_Grasping_Force_Tracking_Strategy_for_Nonlinear_and_Time-Varying_Object_Behaviors
No ratings yet
An_Adaptive_Grasping_Force_Tracking_Strategy_for_Nonlinear_and_Time-Varying_Object_Behaviors
14 pages
Taxonomy Everyday Grasps
No ratings yet
Taxonomy Everyday Grasps
8 pages
6-DoF Grasp Pose Evaluation and Optimization Via Transfer Learning From NeRF
No ratings yet
6-DoF Grasp Pose Evaluation and Optimization Via Transfer Learning From NeRF
7 pages
What Are The Types of Robotic and Human Grasping
No ratings yet
What Are The Types of Robotic and Human Grasping
3 pages
Nips06 Graspingnovelobjects
No ratings yet
Nips06 Graspingnovelobjects
8 pages
Optimization-Based Reactive Force Control For Robot Grasping Tasks
No ratings yet
Optimization-Based Reactive Force Control For Robot Grasping Tasks
6 pages
Grasping Force Control of Robotic Gripper With High Stiffness
No ratings yet
Grasping Force Control of Robotic Gripper With High Stiffness
12 pages
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
No ratings yet
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
7 pages
Scirobotics Adi8808
No ratings yet
Scirobotics Adi8808
11 pages
2506.11570v1
No ratings yet
2506.11570v1
12 pages
Robotic Object Grasping in Context of Human Grasping and Manipulation
No ratings yet
Robotic Object Grasping in Context of Human Grasping and Manipulation
6 pages
Levine 2018
No ratings yet
Levine 2018
16 pages
A User Interface For Assistive Grasping: Jonathan Weisz, Carmine Elvezio, and Peter K. Allen
No ratings yet
A User Interface For Assistive Grasping: Jonathan Weisz, Carmine Elvezio, and Peter K. Allen
6 pages
Good 3
No ratings yet
Good 3
15 pages
Arm
No ratings yet
Arm
12 pages
Unit 5 - Rehab Robotics & ECS W
No ratings yet
Unit 5 - Rehab Robotics & ECS W
41 pages
Quiz Test-1 Question As On 12.09.2019
No ratings yet
Quiz Test-1 Question As On 12.09.2019
13 pages
M.Saravana Kumar. Ap/ Mech Fmcet
No ratings yet
M.Saravana Kumar. Ap/ Mech Fmcet
6 pages
Designs 08 00035 v3
No ratings yet
Designs 08 00035 v3
18 pages
ME6703-CIMS Question Bank
No ratings yet
ME6703-CIMS Question Bank
9 pages
Week 2 Lecture Material - Watermark PDF
No ratings yet
Week 2 Lecture Material - Watermark PDF
48 pages
Robotic System in An Industry
No ratings yet
Robotic System in An Industry
2 pages
Datasheet qc11
No ratings yet
Datasheet qc11
58 pages
GMD1 - Section 10 - Part Handling and Mechanical Conveyance (Update 31JA22)
No ratings yet
GMD1 - Section 10 - Part Handling and Mechanical Conveyance (Update 31JA22)
60 pages
Design and Implementation of A Robotic Arm Using Ros and Moveit!
No ratings yet
Design and Implementation of A Robotic Arm Using Ros and Moveit!
6 pages
LOWCOSTAUTOMATION
No ratings yet
LOWCOSTAUTOMATION
8 pages
HMR Paper
No ratings yet
HMR Paper
13 pages
Space Robotics Seminar Report
No ratings yet
Space Robotics Seminar Report
31 pages
EXO Hand
No ratings yet
EXO Hand
2 pages
Question Bank 1st Unit
No ratings yet
Question Bank 1st Unit
2 pages
Gantry Robot
No ratings yet
Gantry Robot
16 pages
Design Based On Availability Generative Design and Robotic Fabrication Workflow For Non-Standardized Sheet Metal With Variable Properties
No ratings yet
Design Based On Availability Generative Design and Robotic Fabrication Workflow For Non-Standardized Sheet Metal With Variable Properties
17 pages
Automation and Robotics Week 08 Theory Notes 20ME51I
No ratings yet
Automation and Robotics Week 08 Theory Notes 20ME51I
18 pages
Technologies: A Survey of Robots in Healthcare
No ratings yet
Technologies: A Survey of Robots in Healthcare
26 pages
Basic Conception of Automation
No ratings yet
Basic Conception of Automation
41 pages
Robotic Arm PDF
No ratings yet
Robotic Arm PDF
23 pages
Juki Nozzle Catalogue Rev C3
No ratings yet
Juki Nozzle Catalogue Rev C3
40 pages
5827-Article Text-12774-2-10-20150716 PDF
No ratings yet
5827-Article Text-12774-2-10-20150716 PDF
5 pages
Gripper: Types of Grips
No ratings yet
Gripper: Types of Grips
6 pages
Nota Robot
No ratings yet
Nota Robot
26 pages
2019 DSPL Tidyboy
No ratings yet
2019 DSPL Tidyboy
8 pages

Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan

Uploaded by

Robotic Grasping From Classical To Modern: A Survey: Hanbo Zhang, Jian Tang, Shiguang Sun and Xuguang Lan

Uploaded by

Robotic Grasping from Classical to Modern: A

Hanbo Zhang 1 , Jian Tang 1 , Shiguang Sun 1 and Xuguang Lan1

• Mathematically, what is grasping?

• How can we solve the problem of grasping?

• What are the advantages and disadvantages of the existing methods?

• What could the future trends and directions in this field?

We will discuss all the above questions in detail in this survey.

Author & Year Type Summary

• How can we properly represent a grasp?

• How can we evaluate the quality of a given grasp?

2.2. Grasp Representation

2.2.1. Contact-based Grasp Representation

2.2.2. Independent Contact Regions

2.2.3. SE(3) Grasp Representation

Author & Year Repr. Type Fingers Object

2.2.4. Point-based Grasp Representation

2.2.5. Oriented-rect Grasp Representation

2.2.6. Pixel-level Grasp Maps

3. Analytic Grasp Synthesis

• Frictionless contact, meaning that there is no friction at the contact point.

• Frictional contact, meaning that there is friction at the contact point.

3.2. Grasp Quality Evaluation

3.3. Grasp Synthesis on Simple Shapes

3.4. Grasp Synthesis on General Shapes

4. Data-driven Grasp Synthesis

4.2. Imitation-based Methods

Author & Year Imit. Type Modality Abstractor Planner

4.2.1. Programming by Demonstrations

Author & Year Repr. Modality Generator Discriminator View

4.2.2. Matching of Templates

4.3. Sampling-based Methods

4.4. End-to-end Learning

4.4.1. Grasp Detection on Images

Author & Year Repr. Modality Method Structure Anchor Gripper

4.4.2. Grasp Detection on Point Clouds

4.4.3. Pixel-level Grasp Map Synthesis

5. Object-centric Grasp Synthesis

5.2. Object-Specific Grasp Synthesis

5.3. Interactive Grasp Synthesis

5.4. Relational Grasp Synthesis

You might also like