GP-Net: A Lightweight Generative Convolutional Neural Network With Grasp Priority
GP-Net: A Lightweight Generative Convolutional Neural Network With Grasp Priority
This is an Open Access article, distributed under the terms of the Creative Commons
Attribution licence (http:// creativecommons.org/ licenses/ by-nc/ 4.0/ ), which permits un-
restricted re-use, distribution, and reproduction in any medium, for non-commercial use,
provided the original work is properly cited.
Original Paper
GP-Net: A Lightweight Generative
Convolutional Neural Network with
Grasp Priority
Yuxiang Yang1,2 , Yuhu Xing1 , Jing Zhang2 and Dacheng Tao2∗
1
School of Electronics and Information, Hangzhou Dianzi University,
Hangzhou, China
2
School of Computer Science, The University of Sydney, Australia
ABSTRACT
Grasping densely stacked objects may cause collisions and result in
failures, degenerating the functionality of robotic arms. In this paper, we
propose a novel lightweight generative convolutional neural network with
grasp priority called GP-Net to solve multiobject grasp tasks in densely
stacked environments. Specifically, a calibrated global context (CGC)
module is devised to model the global context while obtaining long-range
dependencies to achieve salient feature representation. A grasp priority
prediction (GPP) module is designed to assign high grasp priorities to
top-level objects, resulting in better grasp performance. Moreover, a
new loss function is proposed, which can guide the network to focus
on high-priority objects effectively. Extensive experiments on several
challenging benchmarks including REGRAD and VMRD demonstrate
the superiority of our proposed GP-Net over representative state-of-the-
art methods. We also tested our model in a real-world environment and
obtained an average success rate of 83.3%, demonstrating that GP-Net
has excellent generalization capabilities in real-world environments as
well. The source code will be made publicly available.
1 Introduction
The ability to grasp objects is one of the most important and fundamental
capabilities of intelligent robots [1, 30–32, 50]. As deep learning techniques
have made great progress in visual perception, various deep learning methods
have been applied to grasp techniques [9, 11, 28, 37, 43]. Six-degree-of-freedom
(6DoF) grasp pose estimation methods [15, 33, 40] focus on constructing point
cloud images of objects and diverse 6DoF grasp parameters in the simulation
environment. In the real grasping environment, this method filters the grasp
parameters with the aid of the positional estimation of the target object point
cloud and finally achieves the selection of the optimal grasp parameters. These
6DoF pose estimation methods rely on a known point cloud image of the target
object, which limits their performance in practical applications.
Facing the above problems, researchers simplified the process of robotic
grasping by using 4DoF parameters, i.e., the x-coordinate of the grasp, the
y-coordinate of the grasp, the angle of the grasp, and the width of the grasp.
Mahler et al. [23] first proposed a two-stage 4DoF grasp detection network.
The two-stage grasp detection network first generates the candidate regions
through a deep network, and then evaluates the feature vectors of the candidate
regions to generate grasp representation. However, these two-stage networks [9,
23] bring significant computational overhead, which impairs real-time efficiency.
Recently, Morrison et al. [25] proposed a lightweight generative grasping
convolutional neural network (GG-CNN) for real-time robotic grasping. This
method generated pixel-level grasp images mapped to 4DoF grasp parameters,
thus solving the real-time problem in actual grasping. Kumra et al. [18] added
a residual module to GG-CNN [25], which significantly improved the grasping
effect with less impact on the real-time efficiency. Chalvatzaki et al. [6] focused
on the problem of grasp direction to make the network more concerned with the
grasp direction while maintaining real-time efficiency. Xu et al. [41] proposed
a key point detection algorithm that can reduce the actual detection difficulty
of the network and further improve the real-time efficiency of the network.
However, these methods are all trained in simple scenarios with a single object.
Grasping densely stacked objects may cause collisions and result in failures,
degenerating the functionality of these methods.
In fact, the grasp order is particularly important in complex multiobject
stacking scenes. Recently, visual manipulation relationship detection methods
[27, 45] have been proposed to predict the grasp order in multiobject stacking
scenes, which consist of multiple stages, i.e., object detection, grasp detection,
and relational inference. Such a multistage framework reduces the real-time
efficiency of these methods. Moreover, the generalization ability of object
detection and relational inference in complex multiobject stacking scenarios is
a bottleneck, especially for unknown objects. Hence, it remains a challenge to
obtain a highly robust grasp performance while maintaining real-time efficiency
in complex multiobject stacking environments.
A Lightweight Generative Convolutional Neural Network with Grasp Priority 3
2 Related Work
Multitask learning (MTL) can improve the performance of the primary task
through collaborative training on auxiliary tasks [12, 21, 52, 53]. By sharing
features across multiple tasks, the network is guided to learn a common
representation among them, which may reduce overfitting and thus better
generalize the original task [39]. Hence, a growing number of MTL methods
have been used in the field of robotic grasping to improve performance. For
example, Prew et al. [29] achieved higher grasp detection performance using
depth prediction as a secondary task. Nguyen et al. [26] improved the grasp
detection performance with the aid of the bounding box generation task. Yu
et al. [44] proposed a grasp task implemented through a secondary task of
segmenting objects. In this paper, we design a novel grasp priority prediction
auxiliary task for the 4DoF grasp detection network, which can obtain more
accurate grasp parameters by sharing parameters among the tasks. Moreover,
the auxiliary task in our network has little impact on the network inference time.
In the top-down grasping model, the robotic grasping problem can be defined
as a parameter estimation problem with four variables [18, 25]:
Gr = (S, Θr , Wr , Q) , (1)
Gr = (S, Θr , Wr , Q∗ ) , (2)
Figure 2: GP-Net framework. The input RGB-D image needs to go through three parts:
feature extraction, attention block, and generation block. The grasp quality score image Qi ,
the grasp priority image Pi , the grasp cosine angle image Θcos
i , the grasp sine angle image
i , and the grasp width image Wi are obtained from the generation block. These image
Θsin
features are fused to obtain the final valid grasp parameters.
We can get the coordinate (u, v) of the pixel with the maximum value in
Q*i . The x-axis and y-axis of S of Equation (2) can be obtained from (u, v)
using a coordinate transformation, and the z-axis of S can be obtained from the
depth map. Θr of Equation (2) can be obtained from the corresponding pixel
(u, v) in the grasping angle image Θi . Wr of Equation (2) can be obtained
from the coordinates pixel (u, v) of the grasping angle image Wi . Thus, we
can get all the grasp parameters.
3.2 GP-Net
network to obtain the feature map of size 56 × 56. We then feed the feature
map into the long-range attention module. Specifically, to obtain more spatial
feature information while keeping the network lightweight, we divide the
obtained feature map into two parts. One part is self-calibrated, and the
other part undergoes a convolution operation. The results of the two parts are
concatenated after passing through a self-attention module. The concatenated
feature maps are then passed through transposed convolution to generate
images containing grasp information [3, 5, 16, 34].
Finally, the grasp quality score Qi and the grasp priority Pi are combined
to obtain our new grasp quality score Q∗i with more prominent features. We
extract the angle in the form of two elements Θcos i and Θsin
i that output
distinct values that are combined to form the required angle Θi . The point
with the largest pixel value in the grasp quality score image Q∗i is the 2D
coordinate of the grasp center, and the pixel value at the same position in the
grasp width image Wi and the angle image Θi is the grasp width and angle
centered on that 2D coordinate in the image coordinate system.
hand, we do not create query pairs, which reduces the amount of computation.
On the other hand, our module generates an attention weight for each point of
the feature map. Long-range dependencies built in this way can obtain more
global information while keeping the network lightweight. Hence, our CGC
is able to model effective long-range dependencies such as SNL blocks [5, 35]
and save computations such as SE blocks [13]
Specifically, the architecture of our CGC is given in Figure 3. We divided
the feature map obtained from the feature extraction module into two parts.
The purpose of this design is that on the one hand we need to extract deeper
information and on the other hand we need a branch to preserve the feature
representation from the upstream feature extraction module. In the branch
above as shown in Figure 3, convolution kernels W2 , W3 , and W4 are applied
to extract the deeper feature representations. In terms of details, in the self-
calibration module, we conduct convolutional feature transformation in two
different scale spaces to efficiently gather informative contextual information
for each spatial location, i.e., an original scale space and a small latent space
after down-sampling. The embeddings after W3 in the small latent space have
large fields-of-view and are used as references to guide the original feature
space. Our self-calibrating convolution can achieve the purpose of enlarging
the receptive field through the intrinsic communication of features, which
enhances the diversity of output features. In the global context block, we
generate the global attention map by context modeling, which is shared with
all locations. The implementation of context modeling relies on the 1 × 1
convolution kernel to extract the weights on the feature map. All locations
share an attention map, which is less computationally intensive and allows
global information to be encoded. The branch below as shown in Figure 3, is
designed to preserve the original spatial context information.
In previous grasping works [6, 18], depth information is often used to obtain the
order of the entire grasp task in the face of multiobject stacking environments,
A Lightweight Generative Convolutional Neural Network with Grasp Priority 9
and there are also many studies that obtain the grasp order in terms of
visual operational relationships [45]. However, visual operational relationships
can only have some effect on some secondary grasping networks. For real-
time closed-loop lightweight grasping networks, it is still a challenge to make
judgments about the grasp order.
We constructed the grasp priority prediction (GPP) module by drawing
upon the human experience of grasping in real life. Specifically, when facing
the grasp challenge of multiple objects stacked in real scenarios, humans
tend to prioritize the topmost objects for grasping, thus making the whole
grasping process more stable and avoiding collisions. Therefore, the topmost
object in a multiobject stacking environment has the highest grasp priority.
For complex stacking environments, we represent the topmost object in a
multiobject stacking scene by constructing a mask image P , i.e., the grasp
priority. In the grasp priority mask P , the larger the pixel value is, the higher
the object is in the top layer of the stacked scene, and the higher the capture
priority of the object.
The loss function of GG-CNN [25] is the sum of the mean squared loss of the
output image in the space of four parameters Qi , Θcos i , Θi , and Wi .
sin
defined as:
2
2
LGP −N et = Qi − Qgt + Pgt Θsin i − Θsin
gt
can obtain pixel-level segmentation of the top-level objects and get the priority
ground-truth Pgt .
4 Experiment
4.1 Datasets
In this paper, the REGRAD dataset [47] is used to train the network. REGRAD
is a simulation dataset that consists of 50K kinds of objects with 100M grasping
labels. In addition, the REGRAD dataset also contains the operational
relationships between different objects and their segmentation. Using this
information, we construct the grasp priority masks, as shown in Figure 4,
which are used as the grasp priority ground-truth Pgt . We can also obtain
the grasp ground-truth of center Qgt , angle Θcosgt , angle Θgt , width Wgt by
sin
The rectangle metric [14] is used to report the performance of different methods.
Specifically, a grasp is considered valid when it satisfies the following two
conditions: (1) the intersection over union (IoU) score between the ground
truth and the predicted grasp rectangle is more than 25%, and (2) the offset
between the predicted grasp orientation and the ground truth is less than 30◦ .
A Lightweight Generative Convolutional Neural Network with Grasp Priority 11
We evaluated our network on the REGRAD dataset [47], recording the grasp
success rate in different scenarios. To compare with previous work, we evaluated
some grasp networks on the REGRAD dataset as well. For a fair comparison,
we trained these networks on the REGRAD dataset. Our method improves the
grasping success rate by 29.3% over the GG-CNN network on the REGRAD
dataset. Compared with GR-ConvNet, it improved the grasp accuracy from
67.1% to 82.6%. Our method achieves state-of-the-art grasp performance on
the REGRAD dataset compared with other networks of the same type. To
evaluate the effectiveness of different modules in our model, we performed
ablation experiments in two cases, i.e., removing the CGC module and removing
the GPP module. After removing the CGC module and GPP module, the
overall grasping performance of the algorithm showed a relatively large drop.
It is also noteworthy that the removal of the GPP module resulted in a larger
drop in performance and a more pronounced impact on the overall system. The
results in Table 1 show that the auxiliary task of grasp priority significantly
enhances the robustness of grasping, with a 19.4% improvement in the success
rate of grasping in the same test set, compared with the network without the
GPP module. The experimental results show that the CGC module and GPP
module play an important role in the whole system and can help achieve a
better grasp performance in complexly stacked multiobject scenes.
In addition, we evaluate the performance of the network with different input
modalities. The modalities that the model was tested on included unimodal
12 Yang et al.
Figure 5: The comparison of using different loss functions to train GP-Net. In multi-object
scenario, the priority loss function obtains the higher grasp success rate.
The results demonstrate that our GP-Net can effectively grasp the topmost
object, which is an extremely reasonable grasp in a multiobject stacked scene,
and can effectively avoid collisions. The evaluation results on the VMRD
dataset also demonstrate that our GP-Net generalizes well to new objects that
it has never seen before.
We built a real robot arm test environment, where the robot arm is a UR10
and the camera is an Inter Realsense D435, and the objects used in testing are
shown in Figure 7 (a). During testing, incorrect grasping is defined as shown
in Figure 7 (b) where the grasped object is not a top-level object. Successful
grasping is defined as shown in Figure 7 (c), where the topmost object is
grasped.
14 Yang et al.
Figure 6: Visual results on the VMRD dataset [46]. “Quality” is the original grasp quality
score Q, while “Quality∗ ” is the grasp quality score Q∗ after combining with the predicted
priority mask. It can be seen that Q∗ has more prominent features on the top-level objects.
Besides, the angle and width feature maps show a strong suppression effect on the background
region of non-top-level objects with the supervision of the priority-optimized loss function.
GP-Net has a more reasonable grasp performance for grasp detection in complex stacking
scenes.
Figure 7: Real scenario testing. (a) The objects used in our testing, (b) Incorrect grasping
where the grasped object is not a top-level object, and (c) Successful grasping.
Figure 8: Qualitative analysis of real scenarios. “Quality” is the original grasp quality score
Q, and “Quality∗ ” is the grasp quality score Q∗ after combining with the predicted priority.
Note that Q∗ gives the topmost objects a higher grasp priority.
can be seen that our method obtains satisfactory grasping results in real-world
complex stacking scenarios, with an average grasping success rate of 83.3%.
5 Conclusions
Acknowledgements
Part of this work was done during Yuxiang Yang’s visit at The University of
Sydney. This work was supported by the Zhejiang Provincial Natural Science
Foundation Key Fund of China (LZ23F030003).
References
[20] W. Lin, S. Lee, et al., “Visual Saliency and Quality Evaluation for 3D
Point Clouds and Meshes: An Overview,” APSIPA Transactions on
Signal and Information Processing, 11(1), 2022.
[21] T. Liu, D. Tao, M. Song, and S. J. Maybank, “Algorithm-dependent
generalization bounds for multi-task learning,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(2), 2016, 227–41.
[22] B. Ma, J. Zhang, Y. Xia, and D. Tao, “Auto learning attention,” 2020
Advances in Neural Information Processing Systems, 33, 2020, 1488–500.
[23] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,
and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps
with synthetic point clouds and analytic grasp metrics,” arXiv preprint
arXiv:1703.09312, 2017.
[24] J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P. Abbeel, “Cloth
grasp point detection based on multiple-view geometric cues with appli-
cation to robotic towel folding,” in 2010 IEEE International Conference
on Robotics and Automation (ICRA), 2010, 2308–15.
[25] D. Morrison, P. Corke, and J. Leitner, “Learning robust, real-time, reac-
tive robotic grasping,” The International Journal of Robotics Research,
39(2-3), 2020, 183–201.
[26] A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Object-
based affordances detection with convolutional neural networks and dense
conditional random fields,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), 2017, 5908–15.
[27] D. Park, Y. Seo, D. Shin, J. Choi, and S. Y. Chun, “A single multi-
task deep neural network with post-processing for object detection with
reasoning and robotic grasp detection,” in 2020 IEEE International
Conference on Robotics and Automation (ICRA), 2020, 7300–6.
[28] L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta, “The curious
robot: Learning visual representations via physical interactions,” in
European Conference on Computer Vision, 2016, 3–18.
[29] W. Prew, T. Breckon, M. Bordewich, and U. Beierholm, “Improving
Robotic Grasping on Monocular Images Via Multi-Task Learning and
Positional Loss,” in 2020 25th International Conference on Pattern
Recognition (ICPR), 2021, 9843–50.
[30] A. Rakshit, A. Konar, and A. K. Nagar, “A hybrid brain-computer
interface for closed-loop position control of a robot arm,” IEEE/CAA
Journal of Automatica Sinica, 7(5), 2020, 1344–60.
[31] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel
objects using vision,” The International Journal of Robotics Research,
27(2), 2008, 157–73.
[32] K. B. Shimoga, “Robot grasp synthesis algorithms: A survey,” The
International Journal of Robotics Research, 15(3), 1996, 230–66.
A Lightweight Generative Convolutional Neural Network with Grasp Priority 19