0% found this document useful (0 votes)
36 views25 pages

Object Detection Recognition and Robot Grasping Based On Machine Learning A Survey

This document summarizes research on using machine learning for object detection, recognition, and robot grasping. It discusses traditional machine learning methods, deep learning using convolutional neural networks (CNNs), and newer approaches like unsupervised learning, self-supervised learning, and reinforcement learning. It also addresses the need to combine vision with tactile feedback to improve robot grasping performance. The goal is to provide a systematic review of the status of machine vision and tactile feedback for robot grasping.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views25 pages

Object Detection Recognition and Robot Grasping Based On Machine Learning A Survey

This document summarizes research on using machine learning for object detection, recognition, and robot grasping. It discusses traditional machine learning methods, deep learning using convolutional neural networks (CNNs), and newer approaches like unsupervised learning, self-supervised learning, and reinforcement learning. It also addresses the need to combine vision with tactile feedback to improve robot grasping performance. The goal is to provide a systematic review of the status of machine vision and tactile feedback for robot grasping.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Received September 11, 2020, accepted September 23, 2020, date of publication October 5, 2020, date of current version

October 15, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3028740

Object Detection Recognition and Robot Grasping


Based on Machine Learning: A Survey
QIANG BAI 1 , SHAOBO LI 1,2,3 , JING YANG 1,3 , (Member, IEEE),
QISONG SONG1 , ZHIANG LI1 , AND XINGXING ZHANG1
1 Schoolof Mechanical Engineering, Guizhou University, Guiyang 550025, China
2 Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang 550025, China
3 Guizhou Province Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China

Corresponding author: Shaobo Li ([email protected])


This work was supported in part by the National Key Technologies Research and Development Program of China under
Grant 2018AAA0101800, in part by the National Natural Science Foundation of China under Grant 51475097 and Grant 91746116,
in part by the Ministry of Industry and Information Technology of the People’s Republic of China Talents under Grant [2016]213, and in
part by the Science and Technology Project of Guizhou Province Talents under Grant [2015]4011 and Grant [2016]5013.

ABSTRACT With the rapid development of machine learning, its powerful function in the machine vision
field is increasingly reflected. The combination of machine vision and robotics to achieve the same precise
and fast grasping as that of humans requires high-precision target detection and recognition, location and
reasonable grasp strategy generation, which is the ultimate goal of global researchers and one of the prereq-
uisites for the large-scale application of robots. Traditional machine learning has a long history and good
achievements in the field of image processing and robot control. The CNN (convolutional neural network)
algorithm realizes training of large-scale image datasets, solves the disadvantages of traditional machine
learning in large datasets, and greatly improves accuracy, thereby positioning CNNs as a global research
hotspot. However, the increasing difficulty of labeled data acquisition limits their development. Therefore,
unsupervised learning, self-supervised learning and reinforcement learning, which are less dependent on
labeled data, have also undergone rapid development and achieved good performance in the fields of image
processing and robot capture. According to the inherent defects of vision, this paper summarizes the research
achievements of tactile feedback in the fields of target recognition and robot grasping and finds that the
combination of vision and tactile feedback can improve the success rate and robustness of robot grasping.
This paper provides a systematic summary and analysis of the research status of machine vision and tactile
feedback in the field of robot grasping and establishes a reasonable reference for future research.

INDEX TERMS Machine learning, recognition, grasping, robot, tactile feedback, vision.

I. INTRODUCTION image judgment [1]–[14], etc. To this end, researchers hope


Vision is the main way in which humans to receive all to achieve great breakthroughs in machine vision to allow for
types of information, followed by tactile feedback. One goal precise recognition, positioning and grasp strategy generation
of researchers is to equip robots with vision systems that and the realization of stable grasping of robots, which could
have high accuracy and robustness, similar to human beings, lead to wide application.
to help people complete all types of work. Thus, machine Although the above papers provide a wide range of
vision has always been an important research topic in the research and surveys of machine learning and machine vision
field of artificial intelligence and robotics. With the rapid in plain image processing, there are very few surveys of
development of machine learning, machine vision has been machine learning used for object detection recognition and
widely and successfully applied in various image process- robot grasping. Accurate and fast object recognition and
ing tasks, such as defect detection, target detection, medical grasping based on vision are the basis of robot applica-
tions in both industry and real-life scenarios. This paper
The associate editor coordinating the review of this manuscript and mainly summarizes the research achievements of six main-
approving it for publication was Tao Zhou . stream methods in object detection recognition, positioning,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 181855
Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

grasp strategy generation and grabbing, including traditional the main content of this paper. The second part discusses
machine learning, deep learning, unsupervised learning, self- the research achievements of several mainstream traditional
supervised learning, reinforcement learning and visual-tactile machine learning methods in image processing, object recog-
fusion. Machine learning is the inevitable product of artificial nition and guided robot grasping. The third part summarizes
intelligence development to a certain stage and has been put the performance of the convolutional neural network (CNN)
forward and developed for decades. The most substantial algorithm in object detection recognition position and grasp
advantage of traditional machine learning (support vector strategy generation. In the fourth part, aiming to address
machine (SVM), random forest, decision tree, clustering, the difficulty of acquiring label data, the paper describes
and Bayesian algorithms) is that it requires only a small the performance of unsupervised learning, self-supervised
amount of data and has strong interpretability and fast running learning and reinforcement learning in the fields of vision
speed [15]–[17]. However, with the increase in the amount and grabbing. The fifth part discusses the inherent defects of
of data, the performance of these algorithms becomes limited vision and summarizes the research achievements of robot
and stagnated instead of continuing to improve [18], [19]. For tactile feedback and the combination of vision and tactile.
a long time after the birth of the neural network algorithm in In the sixth part, the future development prospects of machine
the 1980s, SVMs and other machine learning algorithms had vision in robot object recognition and grasping are proposed
an advantage. However, the gradient vanishing problem of based on the above analysis. Finally, conclusions are drawn
these algorithms has led to difficulties in deep network train- in the seventh part.
ing [20], [21] and revealed limitations in the number of sam-
ples and computing power. In 2012, the success of the Alex II. CLASSICAL MACHINE LEARNING
network led to the comeback of the deep neural network [22]. It has been nearly 70 years since Arthur Samuel put forward
It is widely used in various fields of machine vision, and the concept of ‘‘machine learning’’ in 1952. In the 1980s,
its performance continues to increase with the increase in machine learning became an independent discipline and
datasets, avoiding the disadvantages of traditional machine developed rapidly. Since 2006, due to the demand of big
learning in large datasets. Deep learning needs numerous data analysis, neural networks based on machine learning
labeled data, but it is not easy to label all of the data, which have attracted more attention and become the basis of deep
has led to the emergence of unsupervised and self-supervised learning theory. Currently, the research of machine learning
learning algorithms. Unsupervised learning mainly addresses is mainly divided into two directions: the first is traditional
situations in which the input data is not labeled and the output machine learning, which mainly studies the learning principle
is not determined [23], [24]. This approach classifies the and pays attention to exploring the learning mechanism of
samples according to the similarity. However, unsupervised humanoids [32]–[36]; the second is the research of machine
learning has no label data at all, which may lead to slow learning in big data environments, which mainly focuses on
speed and low precision [25]. Self-supervised learning uses how to use information effectively and how to acquire hid-
the input data to generate supervisory information and ben- den, effective and understandable knowledge from massive
efits almost all types of downstream tasks [26], [27]. With amounts of data [37]–[41]. From the perspective of method-
Google’s successful application of reinforcement learning ology, machine learning can be divided into linear models and
in the Go game, reinforcement learning has attracted the nonlinear models. Linear models are relatively simple, but
worldwide attention of researchers. Reinforcement learning they are the basis of nonlinear models, and many nonlinear
considers sequence problems and has a long-term perspec- models are transformed from linear models [42]–[46]. Non-
tive on long-term returns [28], while supervised learning linear models can be divided into traditional machine learning
generally considers one-off problems and focuses on only models (SVM, KNN, decision tree, etc.) and deep learning
short-term and immediate returns. This long-term perspective models. Fig. 1 lists the currently mature traditional machine
of reinforcement learning is very important for determining learning algorithms and briefly describes their principles and
the optimal solution to many problems. The key point of the characteristics [47]–[51]. It is found that the functions of
above algorithm is to process the image collected by the cam- different algorithms are varied, indicating that each algorithm
era, realize the object detection recognition positioning and has different application scenarios. Although deep learning
grasp strategy generation and then guide the robot to complete plays a dominant role in the field of machine vision, deep
the capture. However, noncontact object perception always learning is data-driven and has poor performance in small
has inherent defects, especially in unstructured environments datasets [52]–[54]. However, traditional machine learning
and real-life scenes, and it is difficult to accurately predict can adapt to a variety of datasets; especially in scenarios with
the weight, shape and grasping strategy of the object [29]. small amounts of data (such as the medical field), machine
Based on the above situation, adding pressure sensors to learning has better performance [55], [56]. In this case,
the dexterous hand to provision it with tactile feedback and the advantages of traditional machine learning algorithms
combine it with vision has become a new direction in robot are highlighted. Alternately, the traditional machine learning
grasping research [30], [31]. model is small, and the requirement of computer hardware is
This paper is organized as follows. The first part introduces not high, which yields a strong speed advantage in the field of
the advantages and disadvantages of the six methods and manipulator grasping-based vision [57]–[59]. According to

181856 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 1. Introduction of traditional machine learning.

the characteristics of different machine learning algorithms, of image processing. Based on RGB images and point cloud
they can be applied in all aspects of manipulator grasping to images, Yuan et al. [12] used the SVM-rank algorithm to
improve the accuracy and robustness. recognize object features and generate the grabbing strategy
and then realize the accurate grabbing of objects by a five-
A. SUPPORT VECTOR MACHINE (SVM) fingered dexterous hand. Ergene and Durdu [60] used the
The SVM has strong generalization performance and can bag of words (BoW) method and an SVM to achieve fea-
address machine learning problems in high-dimensional ture extraction and object classification based on the grid
datasets and small samples, so it is widely used in the field and then guide the manipulator to achieve the classification

VOLUME 8, 2020 181857


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

and grabbing of a pen, water cup and stapler. The accuracy attitude verification algorithm, which significantly improves
was 83%. Hu et al. [61] developed an operation and grasp the accuracy and robustness of the system.
control system based on sensor-motor fusion for a robot The clustering algorithm is a type of unsupervised algo-
hand-eye system, proposed a motion recognition method rithm with a long history, and it is widely used because it
of a multifinger manipulator based on an AdaBoost-SVM, does not need training datasets and has a simple structure
and proved the high response and flexibility of this method. and fast speed. In tasks related to target detection and recog-
Valente et al. [62] used the competitive Hopfield neural nition, the clustering algorithm is mainly used for feature
network to collect several points on the edge of the object extraction and clustering, and it achieves the segmentation
to build an approximate polygon, used the radial bases of the background and target location. However, existing
function-global ridge regression (RBF) network to process research results [63], [66], [67], [69] have indicated that the
the polygon, and selected the appropriate grasping points to clustering algorithm usually needs to be used together with
guide the grasping of the manipulator. other algorithms to achieve the classification and grasping of
SVM is a type of supervised learning method that has the different targets.
advantages of good classification performance and simple
structure but is difficult to train on large datasets and has C. BAYESIAN ALGORITHM
poor performance on multiclassification problems. Accord- The Bayesian algorithm plays an important role in manipula-
ing to related research [12], [60], [61], SVMs have the tor grasping planning. The naive Bayesian model originated
disadvantages of complex feature work and poor gener- from classical mathematical theory; it has a solid mathemat-
alization performance in target recognition, location and ical foundation and stable classification efficiency, performs
grasping. However, the improved SVM can be used in well in small-scale datasets, and can handle multiclassifica-
the robot grasping control algorithm and achieves good tion tasks. Budiharto [70] proposed a fast object detection
results. algorithm based on stereo vision and used the Bayesian algo-
rithm to reduce camera noise and achieve robust tracking.
B. CLUSTERING ALGORITHM Wang et al. [71] proposed an online estimation method of
The clustering algorithm has the advantages of simplicity and a robot vision servo system based on a traceless particle
easy implementation and can utilize large datasets, so it is filter and the Jacobian matrix. First, the definition of the
widely used. Hannat et al. [63] presented a real-time method total Jacobian matrix is given, and the estimation of the
for visual categorization to achieve robot grasping. This total Jacobian matrix is transformed into a Bayesian filtering
method uses the speeded up robust feature (SURF) points framework. Then, the paper proposes to estimate the Jacobian
to describe the feature data of objects and uses the K-means matrix by a traceless particle filter and use the traceless
algorithm to extract the vocabulary. The results of our object Kalman filter equation to propagate and update each particle.
recognition experiments show an average accuracy between Bekiroglu et al. [72] proposed a probabilistic framework
95% and 100%. Harada et al. [64] first clustered the polygon for grasp modeling and stability assessment, which inte-
model of the object and the surrounding environment and then grates supervised learning and unsupervised learning, and
separated the environment and the object through different Bayesian networks are used to model the conditional rela-
clustering algorithms to achieve successful grasping and sta- tionship between tasks and multiple sensory flows (vision,
ble placement. Verma et al. [65] proposed that the algorithm ontological sensation and tactile). The obtained model can
of density clustering and homography transformation can not only predict the success rate of grasping but also provide
obtain the maximum stable extremal approach of the object insight into the dependency between the related variables and
and then realize the accurate positioning of the object, which features of object grabbing.
provides powerful assistance for the successful grasping of The Bayesian algorithm is widely used in noise reduc-
the manipulator. Zhang and Shen [66] extracted effective tion, servo control and grasping probability prediction in the
local features from photos of the object. After clustering, research of target detection and recognition and robot grasp-
these key points of each image are mapped into a uniform ing, which is mainly due to its solid mathematical foundation
dimension histogram vector, and the histogram is used as and its ability to address multiclassification tasks.
the input vector of the multiclass SVM algorithm to estab-
lish the training classifier model and realize the real-time D. PRINCIPAL COMPONENT ANALYSIS (PCA)
recognition of moving objects. Kouskouridas et al. [67] In addition to the above algorithms, PCA also has appli-
combined shape retrieval technology with a classification cations in the field of vision and robotics. PCA finds the
and clustering algorithm for attitude estimation of objects. principal axis direction, which is used to effectively represent
Wiesmann et al. [68] proposed an event-driven embedded the common characteristics of the same type of samples.
system for feature extraction and object recognition during Song et al. [73] developed a general framework to esti-
robot grasping. Skotheim et al. [69] proposed a flexible 3D mate the ability of grasping from the 2D data of an object,
object positioning system that can make the manipulator which includes the identification of the similarity of the
assemble, grasp and place in a 3D environment. The sys- local features of the object and the generation of the object
tem is improved based on a robust clustering algorithm and grabbing strategy based on the experience obtained from

181858 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

the prelearning. Zhang et al. [74] proposed a shared control and high precision [76]–[80], [82], [83]. In 1998,
wheelchair manipulator, which can automatically detect a Yann Lecun et al. proposed a gradient-based back-
water cup based on vision and help the disabled achieve the propagation algorithm (LeNet-5) for supervised training
task of drinking water. In this scheme, a CNN and PCA are of networks [84]. Yann Lecun is known as the father
used to separately identify and estimate the attitude and direc- of the CNN for his outstanding contributions to machine
tion of the object. Mattar [75] proposed a learning mechanism learning and computer vision. Due to the lack of large-
for stable grasping and control of a manipulator. Based on a scale training datasets and hardware, LeNet-5 is not ideal
PCA neural network and the Widrow-Hoff method to learn a for complex problems. In 2012, the AlexNet proposed by
large number of patterns of prosthetic behavior, good grasp Alex Krizhevsky et al. won the image classification cham-
control of the prosthetic is realized. pionship on the ImageNet training set, making the CNN a
PCA is an unsupervised learning method without parame- key research direction in computer vision. AlexNet uses the
ter limitations, but it is seldom used in the image processing rectified linear unit (ReLU) instead of the sigmoid as the
field. To achieve ideal robot grasping operation, PCA is com- activation function, and it achieves good results and solves
monly used with a CNN. the problem of gradient disappearance when the network is
Machine learning algorithms have a long history of devel- deep [22]. At the same time, the use of the GPU-based Com-
opment and have made outstanding achievements in their pute Unified Device Architecture (CUDA) greatly accelerates
respective fields. According to the algorithm principle and the training speed of neural networks. Based on the above
research (Table 1), it is found that target detection recognition advantages, AlexNet has been applied in defect detection,
and image processing are not the strong points of machine location and visual tracking of dynamic objects [85], [86].
learning. First, machine learning algorithms require an ardu- In 2014, the GoogLeNet network proposed by Google [87]
ous amount of feature engineering, which greatly increases won the ILSVRC competition, and its error rate was lower
the difficulty and cost of image processing. Second, machine than that of VGGNet proposed in the same year. Generally,
learning requires a variety of algorithms to work together or the position and size of the same object in different images
with CNNs to achieve complete recognition, positioning and are greatly varied, and an accurate convolution operation
grasping, which increases the difficulty of model building and is needed to recognize this type of object. To solve the
training. Finally, with the explosive growth of data in the era problem exemplified by large convolution kernels, which
of big data, the disadvantages of traditional machine learning usually tend to perceive global information, while small con-
have become increasingly prominent. volution kernels mainly capture local information, the idea
of GoogLeNet is to use multiple convolution kernels of
TABLE 1. Comparison of machine learning application scenarios. different sizes in the same layer to capture information,
and this structure is called inception [88]–[90]. Due to
the good performance of GoogLeNet in image recognition,
it has also achieved good accuracy in robot target detec-
tion [91]. VGGNet achieved second place in the classification
task of the ILSVRC competition in 2014 (first place was
GoogLeNet) and first place in the positioning task. At the
same time, the model has good generalization ability for use
with other datasets, and VGGNet has proven that a deeper
network can affect the recognition effect of the network to
a certain extent [92]. Because of its simple structure and
strong feature extraction ability, VGGNet has a wide range
of application scenarios. It is often used in the backbone of
target detection (Fast-RCNN, single-shot multibox detector
(SSD), etc.) to extract features [93], [94] and for target
detection of robot grasping [95], [96]. The ResNet deep
residual network proposed in 2015 won first place in the
classification task of the ImageNet competition [97]. Because
III. CONVOLUTIONAL NEURAL NETWORK (CNN) of its simple and practical structure, many target detection,
The CNN is one of the most representative neural networks in segmentation and recognition algorithms are completed on
the field of deep learning and has made many breakthroughs the basis of ResNet50 or ResNet101 [98], [99]. The residual
in the field of image analysis and processing. Based on design mainly solves the performance degradation problem
the standard image annotation set, ImageNet, the CNN has of deep networks and reduces the computation through a
many achievements, including image feature extraction and long jump connection. Even if the number of model layers is
classification, scene and target recognition, and so on. Com- very deep, it can ensure normal training. The SSD algorithm
pared with the traditional image processing algorithm, the proposed in 2016 is improved on the basis of VGG-16 and
CNN has the advantages of no preprocessing requirements uses a multiscale feature map to a priori detect and set a box

VOLUME 8, 2020 181859


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

for target detection [100]. The entire process of SSD requires developments. To date, various improved algorithms based
only one step, so its most substantial advantage is that it on CNNs continue to emerge and are one of the main research
runs faster. First, dense sampling is carried out in different directions in the field of vision.
positions of the image according to different scales and aspect
ratios, and then the features are extracted by a CNN and
then directly classified and regressed [101], [102]. However, A. ROBOT GRASP POINT AND GRASP STRATEGY
uniform density sampling will lead to the imbalance of To solve the problem of robot grasping angle prediction,
positive and negative samples, which makes training more Cheng and Meng [113] proposed a two-stage cascade train-
difficult and leads to a reduction in model accuracy. The You ing process solution. First, the neural network performs
Only Look Once (YOLO) algorithm proposed in 2016 is a 20000 iterations to obtain the ability to locate the object, and
typical one-stage method for target detection; the core idea some parameters in the network are frozen. Second, the scale
is to transform the object detection problem into a regression factor of 1.14 (superparameter) is multiplied by the sin(θ) and
problem. The model can directly predict the bounding box cos(θ) of the ground truth value. Through these two cascaded
and category probability from the input image by using a training processes and 500 iterations, the network can obtain
CNN structure [103]. The execution speed is fast, and very strong direction prediction ability. Zunjani et al. [114] found
high detection accuracy can be achieved by using a regres- that robots need to predict the ideal matrix according to the
sion method. From YOLOv1 in 2016 to YOLOv3 in 2018, intention of the object to achieve an optimal grabbing strat-
the YOLO algorithm has continuously absorbed the advan- egy. They input the object image and intention type metadata
tages of similar algorithms (such as the feature pyramid into the full connection layer of the CNN network, which will
network (FPN) and the Fast-Region-based CNN (RCNN)) achieve the ideal rectangular prediction. Corona et al. [115]
and achieved higher detection speed and accuracy through designed a hierarchy model composed of three CNNs for
its own continuous improvement and progress, which is the problem of grasping deformable objects such as textiles,
more in line with the real-time requirements of the indus- which can be trained by using synthetic images and real
try for the target detection algorithm compared with other images. Through the three steps of object recognition, the first
algorithms [104]. As two algorithms proposed in the same grabbing point and the second grabbing point, accurate grab-
year, SSD and YOLO algorithms have made outstanding bing of the object can be achieved. Gaona and Lin [116] pro-
achievements in the field of image and vision, and they have posed an estimator-based particle swarm (PS) optimization
good performance in target recognition, location and capture algorithm by a CNN for fast and robust reasoning of robot
strategy generation [104]–[110]. The greatest contribution of grasping points. The cost function of PS is mainly consid-
the RetinaNet algorithm put forward by Tsung-Yi Lin et al. ered from two aspects: first, the CNN divides the grabbing
in 2018 is the proposal of focal loss to solve the problem of features into good features and poor features; and second,
class imbalance [111], thus enabling the algorithm accuracy a magnet mechanism is designed to make particles converge
to exceed the target detection model of the classic two-stage to the object center. The algorithm also includes a confi-
approach. Both one-stage and two-stage detection algorithms dence factor to reduce misjudgment between the grabbing
are proposed based on an anchor mechanism (e.g., Fast- point and the nongrabbing point. Yamazaki [117] proposed
RCNN, RetinaNet, YOLO, or SSD), and these anchors are a method to detect the grabbing point from irregular-shaped
mainly used to find the location of the box; however, all of knitted fabrics. Combining the grabbing point detection with
these algorithms incur excessive costs because of the anchor the shape classifier, a CNN is used to classify the shape
mechanism. This mechanism has two disadvantages. First, and extract the feature vector of the detected object shape.
many anchors will be generated in the network, and most Using this feature, the captured points are calculated as
of these anchors cannot box the target; therefore, most of image coordinates, and the effectiveness of this method is
them are negative samples, with few positive samples. This proven.
outcome leads to the problem of unbalanced positive and A reasonable grasping strategy and grasping points are the
negative samples and consumes an extensive amount of com- basic requirements for the robot to grasp the target based
putation. Second, the anchor mechanism introduces a vast on vision, and they correspond to the nondeformable object
amount of superparameters for the complex network, which and the deformable object, respectively. An end-to-end deep
often makes the adjustment of these superparameters very learning model is constructed based on the CNN algorithm,
complicated and increases the complexity of the network. and the images collected by the camera are input into the
Based on the above problems, Hei Law et al. proposed an model to realize the reasonable output of the grabbing strat-
anchor-free mechanism in 2019, and it used the upper left egy and grabbing points. However, at present, there are two
corner and the lower right corner to predict the bounding main problems. First, the image processing effect is poor if
box instead of implementing an anchor [112]. Fig. 2 lists the noise is large, so image preprocessing and noise reduction
the major improvement process of the CNN algorithm from are necessary to realize the grabbing strategy. Second, it is
1998 to 2019 and illustrates the core structure of various necessary to manually design reasonable label features to
improved algorithms. The recognition accuracy and oper- make the model achieve better results in the test set and
ation speed of algorithms have greatly improved by these practical applications.

181860 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 2. Development track of machine vision based on CNNs. (a) LeNet-5 [84]. (b) AlexNet-5 [22]. (c) GoogLeNet [87]. (d) VGG-16 [92].
(e) Faster-RCNN [94]. (f) ResNet [97]. (g) YOLO [103]. (h) SSD [100]. (i) YOLOv3 [119]. (j) RetinaNet [111]. (k) CornerNet [112].

VOLUME 8, 2020 181861


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 2. (Continued.) Development track of machine vision based on CNNs. (a) LeNet-5 [84]. (b) AlexNet-5 [22]. (c) GoogLeNet [87]. (d) VGG-16 [92].
(e) Faster-RCNN [94]. (f) ResNet [97]. (g) YOLO [103]. (h) SSD [100]. (i) YOLOv3 [119]. (j) RetinaNet [111]. (k) CornerNet [112].

181862 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 2. (Continued.) Development track of machine vision based on CNNs. (a) LeNet-5 [84]. (b) AlexNet-5 [22]. (c) GoogLeNet [87]. (d) VGG-16 [92].
(e) Faster-RCNN [94]. (f) ResNet [97]. (g) YOLO [103]. (h) SSD [100]. (i) YOLOv3 [119]. (j) RetinaNet [111]. (k) CornerNet [112].

VOLUME 8, 2020 181863


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

B. MULTITASK COOPERATIVE OPERATION geometry of the contact area between the grabber and the
Haochen et al. [118] established a neural network for object object.
recognition, location and attitude detection using the CNN Dividing objects into several categories according to their
algorithm. Pose detection is treated as a classification prob- general shapes and then generating different grabbing strate-
lem in this model, and multiple tasks, such as recognition gies based on the category is a good approach, but the general-
and location are combined at the same level to achieve ity is poor. It is very important to realize the 3D reconstruction
good performance in printed circuit board (PCB) datasets. of the object based on vision. The RGB-D image collected
Chen et al. [120] introduced the grasp path based on CNN to by the depth camera is input into the deep learning model to
predict multigrasp tasks, mapped the grasp candidate options realize 3D reconstruction, which can improve the success rate
to the grasp path and generated the mapping capture, and the and speed of the capture.
deviation between them is taken as the estimation error of
back-propagation. Experiments on the datasets and real scene D. MOTION PATH
show that this method can improve the detection accuracy and To solve the problem of dexterous hand grasp force when
be well extended to the occluded objects. performing tasks, Sun et al. [125] proposed a motion repro-
Complex system engineering is required to realize target duction system based on several motion and depth data. At the
grabbing based on vision, which involves a series of steps, same time, CNN is used to estimate the motion instructions
such as recognition, positioning and pose detection, that are of the depth image, and the force data is saved to generate the
all in the field of image processing. Therefore, building a label training datasets. Deng et al. [126] proposed a learning
model based on CNNs to realize the real-time processing of framework combining semantic reach-to-grasp (RTG) with
multiple tasks and the probability ranking of output results is trajectory generation, aiming for the successful realization of
an important research direction. semantic reach-to-grasp in unstructured environments. First,
an object detection model based on deep learning is used to
C. OBJECT 3D SHAPE CONSTRUCTION detect the interested objects, and the trained network based
Roy et al. [121] used CNN (VGG16) to classify the objects on the Bayesian search algorithm is used to find the most suc-
grasped by the manipulator into four categories, cylin- cessful grabbing configuration from the object segmentation
drical, spherical, cubic and conical, and then generated image. Second, a model-based trajectory generation method
four different grasping strategies. This method achieves is designed for the robot’s arrival motion, which is inspired
93% accuracy in real-time object recognition and grasping. by the theory of the human internal model to generate the
Yan et al. [122] introduced a deep geometry-aware grasping trajectory satisfying the constraints; the effectiveness of this
network (DGGN), which divides learning into two steps. method has been proven.
First, the 3D shape model and scene are generated and Different grasping forces are the key to grasping different
reconstructed by RGB-D, and then the construct of geometry objects successfully. Associating scene images with force
representation is acquisition. Second, the results are predicted data and using the CNN model to complete training can
by learning the geometry perception representation within improve the adaptability of the robot grasping force. The
the model. Satish et al. [123] learned the deep strategy from combination of a CNN and the traditional machine learning
the comprehensive training datasets of a point cloud and algorithm can realize the sorting of several options and output
used the analysis algorithm of a random noise model to the optimal value.
randomly sample, grab and reward the domain to explore
how the distribution of comprehensive training examples E. REAL-TIME MOTION
affects the speed and reliability of the robot learning strategy. González-Díaz et al. [127] proposed a real-time solution to
A comprehensive data sampling distribution is proposed in the problem of grasping action in self-centered video. First,
this paper, which combines the grabbing sample from the aiming to address the problem of deciding which object will
strategy action set and the guide sample from the supervisor be grabbed and when to trigger the grabbing operation from
with high robustness. This method is used to train the robot a given classification, this paper determines the grabbing
grasping strategy based on a full convolution network archi- area based on the gaze-guided CNN focusing on an object.
tecture, which evaluates millions of grasping options in four Second, the fixed sequence obtained is noisy because of
degrees of freedom (three-dimensional position and plane distraction and visual fatigue, and gaze is not always reliable
direction). The experimental results show that CNN based for the object of interest. To solve this problem, video-level
on full convolution grasp quality (FC-GQ-CNNs) has better annotation is used to represent the object to be grabbed,
speed and reliability. Liang et al. [124] proposed an end-to- and a loss function is used in a deep CNN. To detect when
end grabbing evaluation model (PointNetGPD) to solve the a person removes an object, the prediction ability of long-
problem of grabbing configuration directly from the point and short-term memory networks is used to analyze gaze
cloud map. The model is lightweight and takes the original and visual dynamics. The results show that this method
point cloud image as the input, which can directly process and has better performance than other methods in real datasets.
evaluate the 3D point cloud image inside the grabber. Even Farag et al. [128] proposed a real-time object detec-
if the point cloud is very sparse, it can capture the complex tion algorithm based on a selective flexible assembly

181864 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

TABLE 2. Comprehensive performance comparison of mainstream CNN models.

manipulator (SCARA) for robot grasping and positioning in increases the consistency, difficulty and cost of labeled data
industrial assembly lines. The motion of a SCARA robot acquisition. Because of the above reasons, neural networks
is composed of two parts: target detection based on deep that do not need or rely on labeled data have become a
learning and position measurement based on edge detection. worldwide priority research direction. These algorithms need
Real-time performance is very important for robot grasp- little or no labeled data or do not need manually labeled data,
ing, and good real-time performance can guide the robot to which greatly reduces the need for human intervention in the
realize the recognition, positioning and grasping of dynamic model training process.
objects. Based on the CNN-AlexNet, the researchers used the
transfer learning method to establish a target detection model, IV. DIFFERENT MACHINE VISION ALGORITHMS
knowledge and statistics superimposing network (KSSNet), WITHOUT LABELED DATA
which achieved a 100% success rate in target detection, loca- Supervised learning (especially CNN) has made remarkable
tion and capture. achievements in the field of vision after nearly ten years of
The target detection, recognition, location and grasp strat- rapid development, but it has also attracted some criticism.
egy generation involved in robot visual grasping are all in the Label data are very important for the training of supervised
field of image processing, and the CNN has strong perfor- learning, and the label data of traditional supervised learning
mance in such a field. Therefore, the CNN is widely used in need to be labeled manually, which not only leads to the
the field of visual grasping and has a good effect. As shown high cost but also appears less intelligent. With the rapid
in Table 2, from the proposal of the first full-fledged CNN increase in artificial intelligence applications, especially
in 1998 to the RetinaNet network in 2018, deep learning machine vision, researchers hope to achieve model train-
has been developing rapidly, and the accuracy and speed ing without a large number of artificial annotation datasets.
have greatly improved. At present, CNN research is generally Unsupervised learning can complete training based on unla-
based on supervised learning, which needs a large number beled data, so it can realize object recognition and grasping
of labeled datasets for model training. However, with the very intelligently [129]–[132]. Self-supervised learning is
continuous development of computer vision, it is increas- a special case of supervised learning that does not need a
ingly difficult to obtain valuable labeled datasets, and most large number of manually labeled datasets to realize model
of the labeled data are calibrated by humans, which greatly training [133]–[137]. Reinforcement learning is learning an

VOLUME 8, 2020 181865


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 3. Comparison between supervised and unsupervised learning.

optimal policy, which can make the agent perform an action grabbing strategy. Ardon et al. [144] proposed a method to
according to the current state in a specific environment to detect and extract multiple grabbing signals through visual
obtain the maximum return. Reinforcement learning was not input. This method does not need to manually define label
the focus in the early stage, but with Google’s successful data but collects their distribution, location and executable
application in Atari and Go games, this branch of machine grasp label data from 1269 objects to obtain their relationship
learning has attracted much attention. With the development with input. The model not only learns to grasp the object but
of deep reinforcement learning, researchers have combined also has better generalization ability in different environments
it with machine vision [138]–[142] in the hope of removing based on these datasets. Detry et al. [129] designed a new
the need for labeled data and artificial means to achieve method of object recognition and grabbing based on the
intelligence. reduced dimension and clustering algorithm and let the model
learn from a group of grabbing examples to improve the
A. UNSUPERVISED LEARNING generalization ability. Unsupervised learning has the advan-
Unsupervised learning is one of the most difficult and impor- tage of object classification based on multimodal information
tant problems in machine vision and machine learning. Many because it does not require label data [130]–[132]. However,
researchers believe that learning from a large amount of unla- due to the inherent defects of vision and the development of
beled data can help solve problems concerning intelligence sensor technology, it has become a hot direction to integrate
and the nature of learning. In addition, unsupervised learning the information of vision, tactile feedback and hearing to help
has practical application value in many fields of computer the robot achieve accurate recognition and grasping of the
vision and robot grasping because of the low cost and ease object.
of collecting unlabeled image datasets. It is easy to see why Because unsupervised learning does not need labeled data,
researchers think unsupervised learning is more intelligent it has good generalization and can extend some features of
through Fig. 3. known objects to similar objects to achieve the grasping
Unsupervised learning can be regarded as a branch of of unknown objects. Alternately, as a pretraining method,
traditional machine learning. Dimension reduction and clus- unsupervised learning plays an important role in the success
tering are well-known unsupervised learning methods, but of deep neural networks.
traditional unsupervised learning is significant in data anal-
ysis. With the rapid development of deep learning and the B. SELF-SUPERVISED LEARNING
difficulty of label data acquisition, the combination of deep Self-supervised learning mainly uses pretext tasks to mine
learning and unsupervised learning has gradually become a its own supervision information from large-scale unlabeled
reasonable research direction. Lenz et al. [143] designed a datasets, and the training of the neural network is based
system to achieve robot grasping from RGB-D images by on constructed supervision information to learn valuable
using deep learning. This method can label data without representations of downstream tasks. As shown in Fig. 4,
manual work. To quickly select the grabbing options, this the assessment of self-supervised learning ability is mainly
paper proposes a two-step series deep learning network. The completed through a pretraining-fine-tuning mode. First,
first network quickly selects several grabbing strategies with the network is trained by pretext from a large number
high probability, and the second network takes the output of unlabeled datasets (automatic construction of supervi-
of the first network as the input and calculates the optimal sion information in the data), and the pretraining model is

181866 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 4. Process of self-supervised learning.

obtained. Then, for the new downstream tasks, the algo- classification and grasping attitude estimation but also has
rithm adopts a method similar to supervised learning, which good generalization performance.
can obtain parameters through transfer learning and then
fine-tune them. Thus, the ability of self-supervised learn- C. REINFORCEMENT LEARNING
ing is mainly reflected by the performance of downstream Reinforcement learning has achieved good results in many
tasks. decision-making fields, especially in the game field, which
Nguyen et al. [133] adopted a self-supervised learn- has reached or even surpassed the human level. However, it is
ing method in which the training datasets are automati- not widely used in the field of machine vision, which may
cally marked by the model. In this paper, a continuous be because vision does not seem to directly correspond to
level neural network is proposed to reduce the runtime of a decision-making environment or interpretable action steps
the grabbing task by eliminating the nonextractable sam- similar to that seen in games. Even so, because reinforce-
ples from the reasoning process, and the network can esti- ment learning does not need label data and works similar
mate 18 grabbing postures and classify 4 objects at the to human beings, it has aroused researchers’ enthusiasm to
same time. The experimental results show that the accu- apply it to the visual field. Fig. 5 lists several mainstream
racy of the network is 94.8% for grasping posture estima- reinforcement learning algorithms and their core structures.
tion and 100% for object classification within 0.65 seconds. From the initial Q-learning to the recently popular deep
Murali et al. [134] proposed a new method to accelerate the reinforcement learning, it shows that reinforcement learning
self-supervised learning process and mapped visual informa- is developing rapidly. The training process of reinforcement
tion to a high-level and high-dimensional movement space to learning with little or no human intervention has fascinated
realize the training strategy of the model. Florence et al. [135] many researchers. As early as 2014, the Google DeepMind
used self-supervised correspondence to improve the gener- team applied deep reinforcement learning to the attention
alization ability and sample efficiency of visually driven mechanism [145]. In 2018, Yu et al. [146] applied deep rein-
strategy learning. Yang et al. [137] proposed a critical policy forcement learning to image repair and achieved good results.
form to design a deep learning method for a new problem James et al. [147] proposed a new benchmark and learn-
named ‘‘grasping the invisible,’’ where a robot is tasked ing environment for challenging robotic learning: RLBench,
with grasping an initially invisible object via a sequence of which is designed to accelerate progress in the field of visu-
nonprehensile (e.g., pushing) and prehensile (e.g., grasping) ally guided manipulation. The above research lays a foun-
actions. In this paper, the Bayesian algorithm and classifier dation for the application of deep reinforcement learning in
model are combined, the self-supervised method is used to machine vision to guide robots in recognizing and grasping
train the motion critic and the classifier in the interaction objects.
between robot and environment, and a good success rate is The model-free deep reinforcement learning proposed by
achieved in the experiment. Zeng et al. [148] found that it was feasible for robots to
Self-supervised learning is a type of unsupervised learn- learn some cooperative grasping strategies. By training two
ing that realizes the supervised training through the auto- complete convolution neural networks, the first is from vision
matic generation of labels. Self-supervised learning not only mapping to action, and the other is used for robot grasping.
achieves high accuracy and speed in object recognition These two networks are jointly trained in the Q-learning

VOLUME 8, 2020 181867


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 5. Mainstream reinforcement learning algorithms.

framework, and self-supervised training is completely carried the algorithm. Katyal et al. [152] used deep reinforcement
out by a trial-and-error method. In the trial-and-error method, learning to make a robot immune to the changes of manip-
the successful completion of the action can be rewarded, ulator or environment and achieve robustness to changes
and the learning strategy can promote action in this way. of the environment without clear prior knowledge and fine
Wang et al. [149] proposed a method combining Q-learning kinematics knowledge of the human arm structure and with-
and a visual servo to solve the grasping problem of wheeled out careful hand-eye calibration. Ghadirzadeh et al. [153],
mobile robots and realized the robust grasping of robots. to solve the inherent delay in motion perception processes,
Gu et al. [150] proposed a new deep reinforcement learning proposed a data-based deep predictive policy training (DPPT)
algorithm based on deep Q-functions nonstrategy training framework, which maps the observed images to a series of
that can adapt to complex 3D operation tasks. motion activation. The system consists of three subnetworks,
Breyer et al. [151] proposed an object grabbing algorithm namely, the perception, strategy and behavior superlayer, and
based on reinforcement learning. In this paper, the image each task is trained by strategy search reinforcement learning.
collected by a depth camera is mapped to the closed-loop Nguyen et al. [154] compared the performance of propri-
control strategy of motion command, and several differ- oceptive/kinesthetic input and original visual input in the
ent methods are compared to ensure the rationality of framework of deep reinforcement learning and found that the

181868 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

TABLE 3. Analysis of advantages and disadvantages of unlabeled data algorithms.

former greatly improved the performance of the agent com- the closure function of a transparent and easily deformable
pared with the latter. Beltran-Hernandez et al. [155] proposed zipper bag. Platt [161] took tactile feedback as the main
a reinforcement learning model based on a strategy search information source and combined part of the visual infor-
algorithm, which shows good robustness in the generalization mation to achieve better performance in the experiment of
from a simple shape object to a complex one. Li et al. [156] grasping plane objects. Merzic et al. [162] used model-free
put forward a type of reinforcement learning strategy for the deep reinforcement learning to combine vision and tactile
operation and grasping of a mobile manipulator to solve the feedback to generate a control strategy. The results show
problem of human-like mobile robot learning complex grasp- that tactile feedback can significantly improve the grasping
ing action in a human environment. This strategy reduces the robustness of objects with attitude uncertainty and complex
complexity of visual feedback and can deal with the chang- features.
ing operation dynamics and uncertain external interference. Traditional reinforcement learning has the limitation of a
Miljković et al. [157] proposed a robot intelligent visual servo small action space and sample space, and it is usually used
controller based on reinforcement learning, developed two in a discrete situation. However, being more complex and
different time difference algorithms (Q-learning and SARSA) closer to the actual situation of the task often yields a large
and combined them with a neural network, and then tested state space and continuous action space. When the input data
them in different visual control scenes. Compared with the are images or sound, it often has a high dimension, and the
traditional image-based visual servo system, the algorithm traditional reinforcement learning has difficulty addressing it.
proposed in this paper has better performance for low-cost The deep reinforcement learning combines the deep learning
visual system manipulators. and reinforcement learning to make the two complementary
Bousmalis et al. [158] studied how to extend the ran- and achieve better performance.
dom simulation environment and region adaptive method As shown in Table 3, the three types of algorithms do not
to the training grabbing system to grab new objects from need manually labeled data, which has great advantages over
the original monocular RGB image. By using only unla- the traditional CNN algorithm. The three algorithms not only
beled real-world data and the grasp generative adversarial have achieved outstanding results in their respective fields but
network (GraspGAN) algorithm in this paper, the grabbing also achieved good performance in the fields of vision and
performance is similar to that obtained with 939,777 labeled robotics. The clustering algorithm in unsupervised learning
real-world samples. James et al. [159] proposed a method is widely used in the field of vision. Through the fusion of
called random to canonical adaptation networks (RCANs) to the clustering algorithm and deep learning, it can realize the
solve the problem of difficult acquisition of real label data in accurate recognition and classification of the objects and the
the field of robotics, which can achieve real-world effects by recognition of the robot’s running posture and trajectory, but
using nonreal-world data. The paper trained a visual-based it also has the disadvantage of low efficiency. The data are
closed-loop grabbing reinforcement learning agent in sim- easy to obtain, but the labeling cost is high, so researchers
ulation and then transferred it to the real world, achieving hope that supervised learning can train a model with good
very good performance and proving the effectiveness of this generalization performance by using few labeled datasets.
sim-to-real method. Hellman et al. [160] proposed a con- However, if a good feature expression can be obtained, it will
textual multiarmed bandit (C-MAB) reinforcement learning be conducive to the fine-tuning of downstream tasks and mul-
algorithm that integrates vision and tactile feedback to realize titask training, which is also the core idea of self-supervised

VOLUME 8, 2020 181869


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

learning. Self-supervised learning takes unlabeled datasets as realize intelligence. It is a good idea to apply tactile tech-
input, automatically constructs labels through the structure or nology to the robot alone, which can avoid the fusion of
characteristics of the data itself, and then carries out training different signals and improve the processing speed of the
similar to supervised learning. Based on the above advan- system. As shown in Fig. 6, Sundaram et al. [213] proposed
tages, self-supervised learning has achieved good training a low-cost and high-robustness tactile glove, which weaves
effects and high-precision target recognition and positioning the array pressure sensor on the surface of the flexible glove
in the field of vision, but it has the problem of label rationality. and then wears it on the hands of the experimenter to collect
The principle of reinforcement learning makes it not domi- the tactile data of different objects. By touching different
nant in the field of vision and target detection and recognition. objects, different pressure point cloud images are obtained
The integration of reinforcement learning and deep learning and introduced into the neural network for training to realize
is the mainstream research direction and has achieved good object recognition and weight estimation without vision.
performance in many decision-making fields. The visual per- Rasouli et al. [199] developed a neural morphological
ception model based on deep reinforcement learning can system for tactile pattern recognition, aiming to address the
predict all possible actions in the current state when only the problem of low efficiency and capability of artificial tactile
original image is input. Therefore, deep reinforcement learn- sensors. The system achieved 92% classification accuracy
ing has some research achievements in the action conditional in a texture recognition task and proved that there is a
video prediction task. In addition, the deep reinforcement tradeoff between response time and classification accuracy.
learning based on the strategy gradient (e.g., trust region pol- Ward-Cherrier et al. [200] studied the development of the
icy optimization (TRPO), generalized advantage estimation gripping platform Gr2, which demonstrated the reorientation
(GAE), stochastic value gradient (SVG), and asynchronous of the grasped object through active tactile manipulation and
advantage actor-critic (A3C)) realizes the behavior control of used a new tactile sensor for tactile manipulation. The active
the robot and is verified in the actual application scenario. tactile manipulation proposed in this study is modeless and
The low sampling efficiency of reinforcement learning makes can be used to study the operation principle of a dexterous
training difficult, and a reasonable reward function and net- hand. Bimbo et al. [201] proposed a method to locate the
work structure need to be designed to achieve better results. grabbed object in the robot’s hand, which includes calculating
the covariance of the pressure data of the tactile sensor and
V. FUSION OF VISUAL AND TACTILE FEEDBACK the eigen basis vector from the main axis. Liu et al. [202]
After years of development, object recognition and loca- regarded tactile data as a time series, used a dynamic time
tion based on machine vision has achieved great success, warping method to evaluate its difference, and proposed a
which lays a solid foundation for research on robot grasping. joint kernel sparse coding model to solve the representation
At present, representative object detection algorithms (e.g., and classification of tactile data. Bhattacharjee et al. [203]
Faster-RCNN [94], SSD [100], and YOLOv3 [119]) can used the first two seconds of force, heat, and motion sensing
quickly identify and locate objects, but relying on precise data collected by a robot in a real environment to solve the
location alone cannot make the manipulator achieve stable impact of the surrounding environment on tactile perception
grasping in complex environments. From the view of people’s when the robot works in a human environment (such as a
own experience in grasping objects, a series of attributes, home), and data-driven approaches to the problems of vari-
such as the hardness and quality of objects, are needed to ous tactile perception performances (neighbor, SVM, hidden
ensure the success of grasping. In addition, the accuracy of Markov model, and long short-term memory) have been char-
machine vision is greatly affected by the surrounding envi- acterized. The results show the value of multimodel tactile
ronment. When the robot is applied in a variable light source perception and data-driven methods for short-term contact
environment, such as life scenes, the robustness of machine tactile perception.
vision is low [193]–[197], and it is difficult to achieve stable The research history of machine tactile feedback is rel-
grasping only by machine vision when the object can easily atively short and clearly lags behind machine vision. This
deform [198]. To solve these problems, researchers in the lag is mainly due to the backward hardware performance
field of robotics and vision consider adding additional tactile of tactile sensors and the confusion of sensor types. Alter-
sensors to the robot to achieve more stable grasping. The nately, the lack of research content and methods of tactile
research direction is mainly divided into single tactile object technology also causes the lag of tactile research. With the
perception [199]–[203] and vision-tactile fusion [204]–[207] rapid development of intelligent robots, tactile feedback has
object recognition and grabbing. gradually attracted the attention of researchers, and there are
many fruitful research achievements. At present, the research
A. TACTILE FEEDBACK of tactile technology mainly focuses on three areas: 1. Hard-
For human beings, tactile feedback is the second most impor- ware improvement of tactile sensors is needed. Through the
tant signal receptor after vision, which plays an important improvement of hardware, the sensitivity of the sensor can
role in life. With the development of tactile sensor technol- be improved, and multiple types of data can be collected
ogy [208]–[212], researchers hope that robots can also have at the same time (e.g., temperature, pressure, friction, etc.).
the same tactile perception ability as humans and further 2. Based on the sense of tactile feedback, the precise

181870 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

FIGURE 6. Tactile sensor schematic.

FIGURE 7. Framework of visual-tactile fusion for object recognition.

extraction of object features can be achieved, and then visual-tactile fusion is generally divided into four steps. First,
the model-free stable operation of objects (e.g., grasping, 2D vision processing technology is used to determine the
classification, recognition, attitude estimation, etc.) can be object position and boundary area, and then 3D vision is used
achieved to improve the generalization of the tactile feedback. to determine the object’s center of mass as the starting point
3. The tactile feedback and deep learning are combined to of tactile detection. Second, tactile exploration is carried out
realize the acquisition and training of tactile datasets, and for features and positions (e.g., pits, holes or occluded areas)
then the deep neural network (DNN) realizes the weight per- that are hard to determine by vision to further determine the
ception, grasp and classification of objects. Tactile feedback object surface features. Third, the information collected by
is second only to vision in information perception, but its vision and tactile feedback is fused to generate accurate 3D
research and application are much worse than those of vision, point cloud images. Fourth, an appropriate grasping strategy
mainly due to the poor universality and reliability of tactile is generated to guide the robotic arm to complete the object
feedback. The application of tactile technology to multisensor grasping based on the visual centroid and tactile features.
sensing systems to realize complementary information per- Calandra et al. [204] studied how robots learn to use tac-
ception is a reasonable future research direction. tile information for iterative operations to effectively adjust
their grasping strategy. In this paper, an end-to-end action
B. FUSION OF VISION AND TACTILE FEEDBACK condition model is proposed to learn the grasping strat-
The integration of vision and tactile feedback helps the egy from the original visual-tactile data. Guo et al. [205]
robot to achieve better grasping, which is also more in proposed a method of vision-tactile combination based on
line with human expectations for the robot. However, deep learning for robot grasping detection, and experi-
the research time of robot tactile feedback is relatively short, ments show that tactile data is helpful for deep learning to
and many types of sensors and multisensor data fusion learn better object characteristics of robot grasping detection
approaches are involved, which leads to the difficulty of tac- tasks. Li et al. [206] designed a sliding detection algorithm
tile research, which is scattered and unsystematic. As shown using the GelSight tactile sensor and the camera installed
in Fig. 7, the object recognition and grabbing system based on on the side of the gripper without knowing the physical

VOLUME 8, 2020 181871


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

parameters of the object in advance. Using the image TABLE 4. Comparative analysis of vision and tactile feedback.
sequences collected by two sensors, a DNN is trained to
classify the grabbed objects and evaluate the stability of the
grabbing process. Garg et al. [207] proposed an adaptive
grasping method based on tactile and visual feedback. This
method combines model-based partially observable Markov
decision process (POMDP) planning with simulation learn-
ing, which has strong robustness under uncertainty, strong
generalization ability and fast execution ability for multiple
objects. Wang et al. [214] proposed a new method to solve
the problems of imprecise visual modeling and low tactile
efficiency. Through the combination of vision and tactile
feedback, as well as learning the prior knowledge of common
object shapes from a large shape database, this method can
effectively perceive the accurate 3D information of the object.
Hogan et al. [215] proposed a regrasp control strategy using a
tactile sensor to adjust the local grasping action. In this paper,
the local transformation of the actual search tactile value is
used to determine the regrasp action to improve the quality
of the grasp. The success rate of vision-tactile fusion is 70%
higher than that of vision alone. Sun et al. [216] put for- geometric contact between object and container, and visual
ward two different tactile sequence models according to the tracking. Finally, a data-driven method is proposed to deduce
advantages of vision and tactile feedback, proposed an object the contact information to achieve better grasping and place-
shape modeling method based on the direction description ment. Santina et al. [218] proposed a data-driven autonomous
histogram features, and then considered the accuracy of the grasping mechanism of a humanoid soft hand to improve the
grasping point and the rapid planning of hand kinematics to grasping performance. The nail of the humanoid soft hand is
achieve the grasping operation. equipped with an inertial measurement device to detect the
Through the research results of the above papers, it is contact with objects. In this paper, a classifier is obtained by
found that the fusion of vision and tactile feedback improves a deep neural network, which takes the visual information of
the robustness and success rate of robot grasping, indi- the grasped object as input and predicts the grabbing action.
cating that the introduction of tactile feedback provides a Hang et al. [219] proposed a unified framework for grasping
new direction for robot grasping research. The grasping of planning and hand grasping adaptation based on visual, tactile
deformable objects has always been a difficult problem, and and proprioception feedback. The main purpose of the frame-
the operation needs to accurately estimate the real-time state work is to solve the problems of object deformation, sliding
of the objects. At present, the main research direction is and external interference to achieve grasping.
machine vision, but the vision is very sensitive to occlu- As shown in Table 4, visual and tactile feedback are the
sion, which is inevitable when the robot moves. Compared basic ways for a human or robot to perceive the environment
with vision, tactile feedback has strong robustness, so the or target, and they are the key research fields of scholars
addition of tactile feedback can solve this problem well. across the globe. Because of their different principles and
Sanchez et al. [198] proposed a modular pipeline that can data structures, they both have advantages and disadvantages
track the shape of deformable objects online by coupling the in perception and recognition, so combining them is a rea-
tactile sensor with the deformation model and achieve robust sonable choice. The combination of vision and tactile feed-
grasping through the combination of vision and tactile feed- back realizes complementary advantages, which can achieve
back. Jain et al. [217] proposed a simulation-based learning more accurate object recognition, real-time state estimation,
method that uses a simulated five-fingered dexterous hand grasp force adjustment, 3D object modeling, grasping pose
to train the deep visual motion strategy of various opera- detection and other functions, but the process of multivariate
tion tasks and found that using tactile sensitive information data analysis is difficult. At present, the mainstream research
can make the task with a highly occluded object exhibit direction of visual-tactile fusion is to realize the direct input
faster learning speed and better asymptotic performance. of visual-tactile data and the output of results via end-to-
Yu et al. [194] proposed a framework that fuses vision and end deep learning. However, some problems remain, such
tactile feedback to estimate the attitude and contact state of as the lack of a general research framework, confusion over
objects relative to the environment in real-time, aiming to methods and challenges related to unified evaluations.
address the application of inserting objects picked up by a
suction cup into a small space. The fusion algorithm based VI. DISCUSSION AND FUTURE DIRECTIONS
on iSAM (an online estimation technology) is adopted in the The ultimate goal of researchers is to create machine vision
framework to realize the fusion of robot motion measurement, and robots that have the same visual recognition and grasping

181872 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

ability as human beings; this is an important step that must be REFERENCES


achieved so that robots can be more widely applied—from [1] L. Bozhkov and P. Georgieva, ‘‘Overview of deep learning architectures
industry to daily life. Although there has been great progress for EEG-based brain imaging,’’ in Proc. Int. Joint Conf. Neural Netw.
(IJCNN). Rio de Janeiro, Brazil: IEEE, Jul. 2018, pp. 1–7.
in object recognition, location, grasping speed and accuracy, [2] X. Shen, H.-S. Kim, S. Komatsu, A. Markman, and B. Javidi,
there is still a vast gap that must be crossed, with human ‘‘Spatial-temporal human gesture recognition under degraded condi-
beings in the face of unstructured life scenes, which is an tions using three-dimensional integral imaging: An overview,’’ in Proc.
17th Workshop Inf. Opt. (WIO). Québec, QC, Canada: IEEE, Jul. 2018,
important reason why robots cannot be applied in daily life at pp. 13938–13951.
present. Based on the development status of machine vision [3] B. Gite, K. Nikhal, and F. Palnak, ‘‘Evaluating facial expressions in
and the analogy analysis with human vision, the following real time,’’ in Proc. Intell. Syst. Conf. (IntelliSys). London, U.K.: IEEE,
thoughts are put forward regarding the future development of Sep. 2017, pp. 849–855.
[4] P. Panchal, V. C. Raman, and S. Mantri, ‘‘Plant diseases detection and
robot grasping. classification using machine learning models,’’ in Proc. 4th Int. Conf.
1. Vision is still the mainstream technology. Due to the Comput. Syst. Inf. Technol. Sustain. Solution (CSITSS). Bengaluru, India:
noncontact and high-efficiency characteristics of vision, it has IEEE, Dec. 2019, pp. 1–6.
[5] M. Gao, J. Jiang, G. Zou, V. John, and Z. Liu, ‘‘RGB-D-Based object
great advantages. With the development of camera technol- recognition using multimodal convolutional neural networks: A survey,’’
ogy, the collection of environmental and object information IEEE Access, vol. 7, pp. 43110–43136, 2019.
will be more accurate and robust, which will greatly enhance [6] H. Wang, H. Du, Y. Zhao, and J. Yan, ‘‘A comprehensive overview
of person re-identification approaches,’’ IEEE Access, vol. 8,
the development of machine vision.
pp. 45556–45583, 2020.
2. Tactile feedback will become an important part of robot [7] M. E. Celebi, N. Codella, and A. Halpern, ‘‘Dermoscopy image analysis:
grasping systems. Due to the inherent defects of vision, Overview and future directions,’’ IEEE J. Biomed. Health Inform., vol. 23,
it is difficult to generate an appropriate grabbing strategy no. 2, pp. 474–478, Mar. 2019.
[8] H. Greenspan, B. van Ginneken, and R. M. Summers, ‘‘Guest edito-
in complex environments according to the characteristics of rial deep learning in medical imaging: Overview and future promise of
objects collected by vision. Hence, the combination of vision an exciting new technique,’’ IEEE Trans. Med. Imag., vol. 35, no. 5,
and tactile feedback will be an important future development pp. 1153–1159, May 2016.
[9] D. Zhao, Y. Chen, and L. Lv, ‘‘Deep reinforcement learning with visual
direction so that accurate recognition and positioning of the attention for vehicle classification,’’ IEEE Trans. Cognit. Develop. Syst.,
object and stable grasping can be achieved. vol. 9, no. 4, pp. 356–367, Dec. 2017.
3. CNNs will still develop rapidly over a short period of [10] W. Zhang, K. Song, X. Rong, and Y. Li, ‘‘Coarse-to-fine UAV target
time, but they may be replaced in the future. The CNN model tracking with deep reinforcement learning,’’ IEEE Trans. Autom. Sci.
Eng., vol. 16, no. 4, pp. 1522–1530, Oct. 2019.
evolves from a giant to a lightweight network step by step [11] N. Hajj and M. Awad, ‘‘On biologically inspired stochastic reinforcement
and achieves continuously higher accuracy in the process. deep learning: A case study on visual surveillance,’’ IEEE Access, vol. 7,
However, it needs a massive amount of labeled data for train- pp. 108431–108437, 2019.
[12] H. Yuan, D. Li, and J. Wu, ‘‘Efficient learning of grasp selection for
ing, which is time consuming. Real artificial intelligence (AI) five-finger dexterous hand,’’ in Proc. IEEE 7th Annu. Int. Conf. CYBER
needs the ability to complete few-shot learning. Technol. Autom., Control, Intell. Syst. (CYBER). Honolulu, HI, USA:
4. Reinforcement learning and unsupervised learning will IEEE, Jul. 2017, pp. 1101–1106.
[13] J. Yang, S. Li, Z. Gao, Z. Wang, and W. Liu, ‘‘Real-time recognition
develop rapidly. Due to their low dependence on label data, method for 0.8 cm darning needles and KR22 bearings based on con-
the training process is relatively intelligent, which meets volution neural networks and data increase,’’ Appl. Sci., vol. 8, no. 1857,
people’s expectations of AI. pp. 1–18, 2018.
[14] J. Yang, S. Li, Z. Wang, and G. Yang, ‘‘Real-time tiny part defect detection
system in manufacturing using deep learning,’’ IEEE Access, vol. 7,
pp. 89278–89291, 2019.
VII. CONCLUSION [15] A. Wang, M. Chu, M. Sha, and L. Liu, ‘‘A new process industry fault
Machine vision and robotics are two research directions that diagnosis algorithm based on ensemble improved binary-tree SVM,’’
Chin. J. Electron., vol. 24, no. 2, pp. 258–262, Apr. 2015.
serve as inspiration for researchers all over the world. People [16] J. Li, N. Allinson, D. Tao, and X. Li, ‘‘Multitraining support vector
hope to combine these two streams of research to create machine for image retrieval,’’ IEEE Trans. Image Process., vol. 15, no. 11,
robots that have the same target recognition and grabbing pp. 3597–3601, Nov. 2006.
[17] E. Pasolli, F. Melgani, and Y. Bazi, ‘‘Support vector machine active
ability as humans, which could lead to the partial realization learning through significance space construction,’’ IEEE Geosci. Remote
of futuristic scenes in movies or science fiction. In this paper, Sens. Lett., vol. 8, no. 3, pp. 431–435, May 2011.
the mainstream machine vision technology applied in robots [18] D. Singh, D. Roy, and C. K. Mohan, ‘‘DiP-SVM : Distribution preserving
is reviewed in detail, including traditional machine learning; kernel support vector machine for big data,’’ IEEE Trans. Big Data, vol. 3,
no. 1, pp. 79–90, Mar. 2017.
CNNs, which have achieved good accomplishments in recent [19] J. Ruan, H. Jiang, X. Li, Y. Shi, F. T. S. Chan, and W. Rao, ‘‘A granular
years; and reinforcement learning, unsupervised learning and GA-SVM predictor for big data in agricultural cyber-physical systems,’’
self-supervised learning, which preclude labeled data limi- IEEE Trans. Ind. Informat., vol. 15, no. 12, pp. 6510–6521, Dec. 2019.
[20] X. Hu, P. Niu, J. Wang, and X. Zhang, ‘‘A dynamic rectified linear
tations. In view of the limitations of vision, this paper also activation units,’’ IEEE Access, vol. 7, pp. 180409–180416, 2019.
summarizes the development of tactile feedback in detail. [21] B. Zhang, M. Zhu, M. Yu, D. Pu, and G. Feng, ‘‘Extreme residual con-
This survey provides a detailed reference for the evaluation of nected convolution-based collaborative filtering for document context-
current research on robot grasping based on machine vision aware rating prediction,’’ IEEE Access, vol. 8, pp. 53604–53613, 2020.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
and tactile feedback. Future research directions of machine with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
vision and robot grasping are also considered. pp. 84–90, May 2017.

VOLUME 8, 2020 181873


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

[23] L. Xiang, G. Zhao, Q. Li, W. Hao, and F. Li, ‘‘TUMK-ELM: A fast unsu- [45] M. Butcher and A. Karimi, ‘‘Linear parameter-varying iterative learning
pervised heterogeneous data learning approach,’’ IEEE Access, vol. 6, control with application to a linear motor system,’’ IEEE/ASME Trans.
pp. 35305–35315, 2018. Mechatronics, vol. 15, no. 3, pp. 412–420, Jun. 2010.
[24] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L.-A. Yau, Y. Elkhatib, [46] J.-G. Hsieh, Y.-L. Lin, and J.-H. Jeng, ‘‘Preliminary study on Wilcoxon
A. Hussain, and A. Al-Fuqaha, ‘‘Unsupervised machine learning for learning machines,’’ IEEE Trans. Neural Netw., vol. 19, no. 2,
networking: Techniques, applications and research challenges,’’ IEEE pp. 201–211, Feb. 2008.
Access, vol. 7, pp. 65579–65615, 2019. [47] J. Song, F. Dong, J. Zhao, H. Wang, Z. He, and L. Wang, ‘‘An efficient
[25] J.-Y. Zhu, J. Wu, Y. Xu, E. Chang, and Z. Tu, ‘‘Unsupervised object multiobjective design optimization method for a PMSLM based on an
class discovery via saliency-guided multiple class learning,’’ IEEE Trans. extreme learning machine,’’ IEEE Trans. Ind. Electron., vol. 66, no. 2,
Pattern Anal. Mach. Intell., vol. 37, no. 4, pp. 862–875, Apr. 2015. pp. 1001–1011, Feb. 2019.
[26] C. Liu, L. Song, J. Zhang, K. Chen, and J. Xu, ‘‘Self-supervised learning [48] N. D. Vanli, M. O. Sayin, I. Delibalta, and S. S. Kozat, ‘‘Sequential
for specified latent representation,’’ IEEE Trans. Fuzzy Syst., vol. 28, nonlinear learning for distributed multiagent systems via extreme learn-
no. 1, pp. 47–59, Jan. 2020. ing machines,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3,
[27] A. Zhao, J. Dong, and H. Zhou, ‘‘Self-supervised learning from multi- pp. 546–558, Mar. 2017.
sensor data for sleep recognition,’’ IEEE Access, vol. 8, pp. 93907–93921, [49] M. H. C. Law and A. K. Jain, ‘‘Incremental nonlinear dimensionality
2020. reduction by manifold learning,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
[28] W. Abdullah Al and I. D. Yun, ‘‘Partial policy-based reinforcement vol. 28, no. 3, pp. 377–391, Mar. 2006.
learning for anatomical landmark localization in 3D medical images,’’ [50] H. Liu, Z. Liu, S. Liu, Y. Liu, J. Bin, F. Shi, and H. Dong, ‘‘A nonlinear
IEEE Trans. Med. Imag., vol. 39, no. 4, pp. 1245–1255, Apr. 2020. regression application via machine learning techniques for geomagnetic
[29] H. Liu, Y. Yu, F. Sun, and J. Gu, ‘‘Visual–tactile fusion for object data reconstruction processing,’’ IEEE Trans. Geosci. Remote Sens.,
recognition,’’ IEEE Trans. Autom. Sci. Eng., vol. 14, no. 2, pp. 996–1008, vol. 57, no. 1, pp. 128–140, Jan. 2019.
Apr. 2017. [51] G. Chen, J. Du, L. Sun, W. Zhang, K. Xu, X. Chen, G. T. Reed, and
[30] X. Li, H. Liu, J. Zhou, and F. Sun, ‘‘Learning cross-modal visual-tactile Z. He, ‘‘Nonlinear distortion mitigation by machine learning of SVM
representation using ensembled generative adversarial networks,’’ Cog- classification for PAM-4 and PAM-8 modulated optical interconnection,’’
nit. Comput. Syst., vol. 1, no. 2, pp. 40–44, Jul. 2019. J. Lightw. Technol., vol. 36, no. 3, pp. 650–657, Feb. 1, 2018.
[31] P. Falco, S. Lu, C. Natale, S. Pirozzi, and D. Lee, ‘‘A transfer learn- [52] K. Gao, W. Guo, X. Yu, B. Liu, A. Yu, and X. Wei, ‘‘Deep induction
ing approach to cross-modal object recognition: From visual observa- network for small samples classification of hyperspectral images,’’ IEEE
tion to robotic haptic exploration,’’ IEEE Trans. Robot., vol. 35, no. 4, J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 3462–3477,
pp. 987–998, Aug. 2019. 2020.
[32] F. D. Ledezma and S. Haddadin, ‘‘FOP networks for learning humanoid [53] D. Zhang, W. Ding, C. Liu, H. Wang, and B. Zhang, ‘‘Modulated auto-
body schema and dynamics,’’ in Proc. IEEE-RAS 18th Int. Conf. correlation convolution networks for automatic modulation classification
Humanoid Robots (Humanoids). Beijing, China: IEEE, Nov. 2018, based on small sample set,’’ IEEE Access, vol. 8, pp. 27097–27105, 2020.
pp. 1–9.
[54] Q. Zhou and X. He, ‘‘Broad learning model based on enhanced features
[33] M. C. Capolei, N. A. Andersen, H. H. Lund, E. Falotico, and S. Tolu, learning,’’ IEEE Access, vol. 7, pp. 42536–42550, 2019.
‘‘A cerebellar internal models control architecture for online sensorimotor
[55] J. Xu, Y. Y. Tang, B. Zou, Z. Xu, L. Li, Y. Lu, and B. Zhang,
adaptation of a humanoid robot acting in a dynamic environment,’’ IEEE
‘‘The generalization ability of SVM classification based on Markov sam-
Robot. Autom. Lett., vol. 5, no. 1, pp. 80–87, Jan. 2020.
pling,’’ IEEE Trans. Cybern., vol. 45, no. 6, pp. 1169–1179, Jun. 2015.
[34] F. Keyrouz, ‘‘A novel robotic sound localization and separation using non-
causal filtering and Bayesian fusion,’’ in Proc. IEEE 26th Int. Workshop [56] C. Lu, A. Devos, J. A. K. Suykens, C. Arus, and S. Van Huffel, ‘‘Bag-
Mach. Learn. Signal Process. (MLSP). Vietri sul Mare, Italy: IEEE, ging linear sparse Bayesian learning models for variable selection in
Sep. 2016, pp. 1–6. cancer diagnosis,’’ IEEE Trans. Inf. Technol. Biomed., vol. 11, no. 3,
pp. 338–347, May 2007.
[35] E. Sauser and A. Billard, ‘‘Biologically inspired multimodal inte-
gration: Interferences in a human-robot interaction game,’’ in Proc. [57] A. Luo, F. An, X. Zhang, and H. J. Mattausch, ‘‘A hardware-efficient
IEEE/RSJ Int. Conf. Intell. Robots Syst. Beijing, China: IEEE, Oct. 2006, recognition accelerator using Haar-like feature and SVM classifier,’’
pp. 5619–5624. IEEE Access, vol. 7, pp. 14472–14487, 2019.
[36] M. Toussaint and C. Goerick, ‘‘Probabilistic inference for structured [58] R. Trinchero, P. Manfredi, I. S. Stievano, and F. G. Canavero, ‘‘Machine
planning in robotics,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. learning for the performance assessment of high-speed links,’’ IEEE
San Diego, CA, USA: IEEE, Oct. 2007, pp. 3068–3073. Trans. Electromagn. Compat., vol. 60, no. 6, pp. 1627–1634, Dec. 2018.
[37] Q. Zhang, L. T. Yang, and Z. Chen, ‘‘Deep computation model for unsu- [59] A. J. Siddiqui, A. Mammeri, and A. Boukerche, ‘‘Real-time vehicle make
pervised feature learning on big data,’’ IEEE Trans. Services Comput., and model recognition based on a bag of SURF features,’’ IEEE Trans.
vol. 9, no. 1, pp. 161–171, Feb. 2016. Intell. Transp. Syst., vol. 17, no. 11, pp. 3205–3219, Nov. 2016.
[38] W. Wang and M. Zhang, ‘‘Tensor deep learning model for heterogeneous [60] M. C. Ergene and A. Durdu, ‘‘Robotic hand grasping of objects classified
data fusion in Internet of Things,’’ IEEE Trans. Emerg. Topics Comput. by using support vector machine and bag of visual words,’’ in Proc.
Intell., vol. 4, no. 1, pp. 32–41, Feb. 2020. Int. Artif. Intell. Data Process. Symp. (IDAP). Malatya, Turkey: IEEE,
[39] Y. Lei, F. Jia, J. Lin, S. Xing, and S. X. Ding, ‘‘An intelligent fault diag- Sep. 2017, pp. 1–5.
nosis method using unsupervised feature learning towards mechanical [61] Y. Hu, Z. Li, G. Li, P. Yuan, C. Yang, and R. Song, ‘‘Development
big data,’’ IEEE Trans. Ind. Electron., vol. 63, no. 5, pp. 3137–3147, of sensory-motor fusion-based manipulation and grasping control for a
May 2016. robotic hand-eye system,’’ IEEE Trans. Syst., Man, Cybern. Syst., vol. 47,
[40] G. A. Susto, A. Schirru, S. Pampuri, and S. McLoone, ‘‘Supervised no. 7, pp. 1169–1180, Jul. 2017.
aggregative feature extraction for big data time series regression,’’ IEEE [62] C. M. o. Valente, A. Schammass, A. F. R. Araujo, and G. A. P. Caurin,
Trans. Ind. Informat., vol. 12, no. 3, pp. 1243–1252, Jun. 2016. ‘‘Intelligent Grasping Using Neural Modules,’’ in Proc. IEEE Int. Conf.
[41] N. Yu, Z. Li, and Z. Yu, ‘‘Survey on encoding schemes for genomic data Syst., Man, Cybern. Tokyo, Japan: IEEE, Oct. 1999, pp. 780–785.
representation and feature learning—From signal processing to machine [63] M. Hannat, N. Zrira, Y. Raoui, and E. H. Bouyakhf, ‘‘A fast object
learning,’’ Big Data Mining Anal., vol. 1, no. 3, pp. 191–210, 2018. recognition and categorization technique for robot grasping using the
[42] F. Ye, Z. Zhang, K. Chakrabarty, and X. Gu, ‘‘Board-level functional visual bag of words,’’ in Proc. 5th Int. Conf. Multimedia Comput. Syst.
fault diagnosis using multikernel support vector machines and incremen- (ICMCS). Marrakech, Morocco: IEEE, Sep. 2016, pp. 173–178.
tal learning,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., [64] K. Harada, T. Tsuji, K. Nagata, N. Yamanobe, H. Onda, T. Yoshimi, and
vol. 33, no. 2, pp. 279–290, Feb. 2014. Y. Kawai, ‘‘Object placement planner for robotic pick and place tasks,’’ in
[43] D. Elizondo, ‘‘The linear separability problem: Some testing methods,’’ Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. Vilamoura, Portugal: IEEE,
IEEE Trans. Neural Netw., vol. 17, no. 2, pp. 330–344, Mar. 2006. Oct. 2012, pp. 980–985.
[44] A. J. Stimpson and M. L. Cummings, ‘‘Assessing intervention timing [65] N. K. Verma, A. Mustafa, and A. Salour, ‘‘Stereo-vision based object
in computer-based education using machine learning algorithms,’’ IEEE grasping using robotic manipulator,’’ in Proc. 11th Int. Conf. Ind. Inf. Syst.
Access, vol. 2, pp. 78–87, 2014. (ICIIS). Roorkee, India: IEEE, Dec. 2016, pp. 95–100.

181874 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

[66] J. Zhang and L. Shen, ‘‘Clustering and recognition for automated tracking [86] L. Xu, L. Wang, Y. Zhang, and S. Cheng, ‘‘Visual tracking based
and grasping of moving objects,’’ in Proc. IEEE Workshop Electron., on siamese network of fused score map,’’ IEEE Access, vol. 7,
Comput. Appl. Ottawa, ON, Canada: IEEE, May 2014, pp. 222–229. pp. 151389–151398, 2019.
[67] R. Kouskouridas, A. Amanatiadis, and A. Gasteratos, ‘‘Guiding a robotic [87] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
gripper by visual feedback for object manipulation tasks,’’ in Proc. IEEE V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ in
Int. Conf. Mechatronics. Istanbul, Turkey: IEEE, Apr. 2011, pp. 433–438. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). Boston, MA,
[68] G. Wiesmann, S. Schraml, M. Litzenberger, A. N. Belbachir, USA: IEEE, Jun. 2015, pp. 1–9.
M. Hofstatter, and C. Bartolozzi, ‘‘Event-driven embodied system [88] Q. Gao, J. Liu, Z. Ju, and X. Zhang, ‘‘Dual-hand detection for human–
for feature extraction and object recognition in robotic applications,’’ robot interaction by a parallel network based on hand detection and
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. body pose estimation,’’ IEEE Trans. Ind. Electron., vol. 66, no. 12,
Workshops. Providence, RI, USA: IEEE, Jun. 2012, pp. 76–82. pp. 9663–9672, Dec. 2019.
[69] O. Skotheim, M. Lind, P. Ystgaard, and S. A. Fjerdingen, ‘‘A flexible [89] S. Yang, G. Lin, Q. Jiang, and W. Lin, ‘‘A dilated inception network
3D object localization system for industrial part handling,’’ in Proc. for visual saliency prediction,’’ IEEE Trans. Multimedia, vol. 22, no. 8,
IEEE/RSJ Int. Conf. Intell. Robots Syst. Vilamoura, Portugal: IEEE, pp. 2163–2176, Aug. 2020.
Oct. 2012, pp. 3326–3333. [90] X. Jin, L. Wu, X. Li, X. Zhang, J. Chi, S. Peng, S. Ge, G. Zhao, and S. Li,
[70] W. Budiharto, ‘‘Robust vision-based detection and grasping object for ‘‘ILGNet: Inception modules with connected local and global features for
manipulator using SIFT keypoint detector,’’ in Proc. Int. Conf. Adv. Mech. efficient image aesthetic quality classification using domain adaptation,’’
Syst. Kumamoto, Japan: IEEE, Aug. 2014, pp. 448–452. IET Comput. Vis., vol. 13, no. 2, pp. 206–212, Mar. 2019.
[71] F. Wang, F. Sun, J. Zhang, B. Lin, and X. Li, ‘‘Unscented particle filter for [91] W. Dongyu, H. Fuwen, T. Mikolajczyk, and H. Yunhua, ‘‘Object detec-
online total image Jacobian matrix estimation in robot visual servoing,’’ tion for soft robotic manipulation based on RGB-D sensors,’’ in Proc.
IEEE Access, vol. 7, pp. 92020–92029, 2019. WRC Symp. Adv. Robot. Autom. (WRC SARA). Beijing, China: IEEE,
[72] Y. Bekiroglu, D. Song, L. Wang, and D. Kragic, ‘‘A probabilis- Aug. 2018, pp. 52–58.
tic framework for task-oriented grasp stability assessment,’’ in Proc. [92] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
IEEE Int. Conf. Robot. Autom. Karlsruhe, Germany: IEEE, May 2013, large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent.
pp. 3040–3047. (ICLR), San Diego, CA, USA, 2015, pp. 1–14.
[73] H. O. Song, M. Fritz, D. Goehring, and T. Darrell, ‘‘Learning to detect [93] W. Guan, T. Wang, J. Qi, L. Zhang, and H. Lu, ‘‘Edge-aware convolution
visual grasp affordance,’’ IEEE Trans. Autom. Sci. Eng., vol. 13, no. 2, neural network based salient object detection,’’ IEEE Signal Process.
pp. 798–809, Apr. 2016. Lett., vol. 26, no. 1, pp. 114–118, Jan. 2019.
[74] Z. Zhang, S. Mao, K. Chen, L. Xiao, B. Liao, C. Li, and P. Zhang, [94] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
‘‘CNN and PCA based visual system of a wheelchair manipulator robot object detection with region proposal networks,’’ IEEE Trans. Pattern
for automatic drinking,’’ in Proc. IEEE Int. Conf. Robot. Biomimetics Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
(ROBIO). Kuala Lumpur, Malaysia: IEEE, Dec. 2018, pp. 1280–1286. [95] Z. Zhao, T. Cai, F. Chang, and X. Cheng, ‘‘Real-time surgical instru-
[75] E. Mattar, ‘‘PCA Learning for Non-brain Waves-Controlled Robotic ment detection in robot-assisted surgery using a convolutional neural
Hand (Prosthesis): Grasp Stabilization and Control,’’ in Proc. UKSim- network cascade,’’ Healthcare Technol. Lett., vol. 6, no. 6, pp. 275–279,
AMSS 16th Int. Conf. Comput. Modeling Simulation. Cambridge, U.K.: Dec. 2019.
IEEE, Mar. 2014, pp. 211–216. [96] L. Liu, X. Tang, J. Xie, X. Gao, W. Zhao, F. Mo, and G. Zhang, ‘‘Deep-
[76] T. Ishii, R. Nakamura, H. Nakada, Y. Mochizuki, and H. Ishikawa, learning and depth-map based approach for detection and 3D localization
‘‘Surface object recognition with CNN and SVM in Landsat 8 images,’’ of small traffic signs,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote
in Proc. 14th IAPR Int. Conf. Mach. Vis. Appl. (MVA). Miraikan, Japan: Sens., vol. 13, pp. 2096–2111, 2020.
IEEE Press, May 2015, pp. 341–344. [97] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
[77] Y. Shin and I. Balasingham, ‘‘Comparison of hand-craft feature based image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
SVM and CNN based deep learning framework for automatic polyp (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1–12.
classification,’’ in Proc. 39th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. [98] Y. Liao, P. Xiong, W. Min, W. Min, and J. Lu, ‘‘Dynamic sign lan-
(EMBC). Seogwipo, South Korea: IEEE, Jul. 2017, pp. 3277–3280. guage recognition based on video sequence with BLSTM-3D residual
[78] A. Wibisono, M. S. Saputri, P. Mursanto, J. Rachmad, Alberto, networks,’’ IEEE Access, vol. 7, pp. 38044–38054, 2019.
A. T. W. Yudasubrata, F. Rizki, and E. Anderson, ‘‘Deep learning and [99] X. Ou, P. Yan, Y. Zhang, B. Tu, G. Zhang, J. Wu, and W. Li, ‘‘Moving
classic machine learning approach for automatic bone age assessment,’’ in object detection method via ResNet-18 with Encoder–Decoder structure
Proc. 4th Asia–Pacific Conf. Intell. Robot Syst. (ACIRS). Nagoya, Japan: in complex scenes,’’ IEEE Access, vol. 7, pp. 108152–108160, 2019.
IEEE, Jul. 2019, pp. 235–240. [100] W. Liu, ‘‘SSD: Single Shot MultiBox Detector,’’ in Proc. Eur. Conf.
[79] P. Wang, L. Li, Y. Jin, and G. Wang, ‘‘Detection of unwanted traffic Comput. Vis. (ECCV), Amsterdam, The Netherlands, 2016, pp. 1–17.
congestion based on existing surveillance system using in freeway via [101] X. Li, C. Liu, S. Dai, H. Lian, and G. Ding, ‘‘Scale specified single
a CNN-architecture trafficnet,’’ in Proc. 13th IEEE Conf. Ind. Electron. shot multibox detector,’’ IET Comput. Vis., vol. 14, no. 2, pp. 59–64,
Appl. (ICIEA). Wuhan, China: IEEE, May 2018, pp. 1134–1139. Mar. 2020.
[80] Y. Wang, C. Wang, L. Luo, and Z. Zhou, ‘‘Image classification based on [102] L. Chen, Z. Zhang, and L. Peng, ‘‘Fast single shot multibox detector
transfer learning of convolutional neural network,’’ in Proc. Chin. Control and its application on vehicle counting system,’’ IET Intell. Transp. Syst.,
Conf. (CCC). Guangzhou, China: IEEE, Jul. 2019, pp. 7506–7510. vol. 12, no. 10, pp. 1406–1413, Dec. 2018.
[81] S. Sudha, K. B. Jayanthi, C. Rajasekaran, and T. Sunder, ‘‘Segmen- [103] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look
tation of RoI in medical images using CNN-a comparative study,’’ in once: Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput.
Proc. TENCON-IEEE Region 10th Conf. (TENCON). Kochi, India: IEEE, Vis. Pattern Recognit. (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016,
Oct. 2019, pp. 767–771. pp. 779–788.
[82] B. Jiang, J. He, S. Yang, H. Fu, T. Li, H. Song, and D. He, ‘‘Fusion of [104] Y. Yu, K. Zhang, H. Liu, L. Yang, and D. Zhang, ‘‘Real-time visual local-
machine vision technology and AlexNet-CNNs deep learning network ization of the picking points for a ridge-planting strawberry harvesting
for the detection of postharvest apple pesticide residues,’’ Artif. Intell. robot,’’ IEEE Access, vol. 8, pp. 116556–116568, 2020.
Agricult., vol. 1, pp. 1–8, Mar. 2019. [105] L. Yang, M. Li, X. Song, Z. Xiong, C. Hou, and B. Qu, ‘‘Vehicle speed
[83] A. Ibrahim, A. Dalbah, A. Abualsaud, U. Tariq, and A. El-Hag, ‘‘Appli- measurement based on binocular stereovision system,’’ IEEE Access,
cation of machine learning to evaluate insulator surface erosion,’’ IEEE vol. 7, pp. 106628–106641, 2019.
Trans. Instrum. Meas., vol. 69, no. 2, pp. 314–316, Feb. 2020. [106] T. Kitayama, H. Lu, Y. Li, and H. Kim ‘‘Detection of grasping posi-
[84] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn- tion from video images based on SSD,’’ in Proc. 18th Int. Conf. Con-
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11, trol, Autom. Syst. (ICCAS). Daegwallyeong, South Korea, Oct. 2018,
pp. 2278–2324, 1998. pp. 1472–1475.
[85] M. Zhou, Z. Pan, Y. Liu, Q. Zhang, Y. Cai, and H. Pan, ‘‘Leak detection [107] Y. Chao, X. Chen, and N. Xiao, ‘‘Deep learning-based grasp-detection
and location based on ISLMD and CNN in a pipeline,’’ IEEE Access, method for a five-fingered industrial robot hand,’’ IET Comput. Vis.,
vol. 7, pp. 30457–30464, 2019. vol. 13, no. 1, pp. 61–70, Feb. 2019.

VOLUME 8, 2020 181875


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

[108] G. Wu, W. Chen, H. Cheng, W. Zuo, D. Zhang, and J. You, ‘‘Multi-object [130] T. Nakamura, T. Nagai, and N. Iwahashi, ‘‘Multimodal categorization
grasping detection with hierarchical feature fusion,’’ IEEE Access, vol. 7, by hierarchical Dirichlet process,’’ in Proc. IEEE/RSJ Int. Conf. Intell.
pp. 43884–43894, 2019. Robots Syst. San Francisco, CA, USA: IEEE, Sep. 2011, pp. 1520–1525.
[109] K. Choi, J. K. Suhr, and H. G. Jung, ‘‘Map-matching-based cascade [131] T. Nakamura, T. Nagai, and N. Iwahashi, ‘‘Multimodal object catego-
landmark detection and vehicle localization,’’ IEEE Access, vol. 7, no. 1, rization by a robot,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.
pp. 127874–127894, 2019. San Diego, CA, USA: IEEE, Oct. 2007, pp. 2415–2420.
[110] Y. Xu, L. Wang, A. Yang, and L. Chen, ‘‘GraspCNN: Real-time grasp [132] T. Nagai and N. Iwahashi, ‘‘Object categorization using multimodal
detection using a new oriented diameter circle representation,’’ IEEE information,’’ in Proc. TENCON-IEEE Region 10th Conf. Hong Kong:
Access, vol. 7, pp. 159322–159331, 2019. IEEE, 2006, pp. 1–4.
[111] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for [133] V.-T. Nguyen, C. Lin, C.-H.-G. Li, S.-M. Guo, and J.-J.-J. Lien, ‘‘Visual-
dense object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, guided robot arm using self-supervised deep convolutional neural net-
no. 2, pp. 318–327, Feb. 2020. works,’’ in Proc. IEEE 15th Int. Conf. Autom. Sci. Eng. (CASE).
[112] H. Law and J. Deng, ‘‘CornerNet: Detecting objects as paired keypoints,’’ Vancouver, BC, Canada: IEEE, Aug. 2019, pp. 1415–1420.
Int. J. Comput. Vis., vol. 128, no. 3, pp. 642–656, Mar. 2020. [134] A. Murali, L. Pinto, D. Gandhi, and A. Gupta, ‘‘CASSL: Curricu-
[113] H. Cheng and M. Q.-H. Meng, ‘‘A grasp pose detection scheme with lum accelerated self-supervised learning,’’ in Proc. IEEE Int. Conf.
an end-to-end CNN regression approach,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA). Brisbane, QLD, Australia: IEEE, May 2018,
Robot. Biomimetics (ROBIO). Kuala Lumpur, Malaysia: IEEE: Malaysia, pp. 6453–6460.
Dec. 2018, pp. 544–549. [135] P. Florence, L. Manuelli, and R. Tedrake, ‘‘Self-supervised correspon-
[114] F. H. Zunjani, S. Sen, H. Shekhar, A. Powale, D. Godnaik, and dence in visuomotor policy learning,’’ IEEE Robot. Autom. Lett., vol. 5,
G. C. Nandi, ‘‘Intent-based object grasping by a robot using deep no. 2, pp. 492–499, Apr. 2020.
learning,’’ in Proc. IEEE 8th Int. Advance Comput. Conf. (IACC). [136] M. Yan, Y. Zhu, N. Jin, and J. Bohg, ‘‘Self-supervised learning of state
Greater Noida, India: IEEE, Dec. 2018, pp. 246–251. estimation for manipulating deformable linear objects,’’ IEEE Robot.
[115] E. Corona, G. Alenya, A. Gabas, and C. Torras, ‘‘Active garment recog- Autom. Lett., vol. 5, no. 2, pp. 2372–2379, Apr. 2020.
nition and target grasping point detection using deep learning,’’ Pattern [137] Y. Yang, H. Liang, and C. Choi, ‘‘A deep learning approach to grasping
Recognit., vol. 74, pp. 629–641, Feb. 2018. the invisible,’’ IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 2232–2239,
[116] A. Gaona and H.-I. Lin, ‘‘Robotic grasping estimation by evolutionary Apr. 2020.
deep networks,’’ in Proc. Int. Autom. Control Conf. (CACS). Taoyuan, [138] G. Zhang, H. Li, and Odbal, ‘‘Research on fuzzy enhanced learning model
Taiwan: IEEE, Nov. 2018, pp. 1–7. of multienhanced signal learning automata,’’ IEEE Trans. Ind. Informat.,
[117] K. Yamazaki, ‘‘Selection of grasp points of cloth product on a table based vol. 15, no. 11, pp. 5980–5987, Nov. 2019.
on shape classification feature,’’ in Proc. IEEE Int. Conf. Inf. Autom.
[139] S. Jeong, M. Lee, H. Arie, and J. Tani, ‘‘Developmental learning of
(ICIA). Macau, China: IEEE, Jul. 2017, pp. 136–141.
integrating visual attention shifts and bimanual object grasping and
[118] L. Haochen, Z. Bin, S. Xiaoyong, and Z. Yongting, ‘‘CNN-based model manipulation tasks,’’ in Proc. IEEE 9th Int. Conf. Develop. Learn.
for pose detection of industrial PCB,’’ in Proc. 10th Int. Conf. Intell. Ann Arbor, MI, USA: IEEE, Aug. 2010, pp. 165–170.
Comput. Technol. Autom. (ICICTA). Changsha, China: IEEE, Oct. 2017,
[140] W. Yuan, K. Hang, D. Kragic, M. Y. Wang, and J. A. Stork, ‘‘End-to-
pp. 390–393.
end nonprehensile rearrangement with deep reinforcement learning and
[119] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
simulation-to-reality transfer,’’ Robot. Auto. Syst., vol. 119, pp. 119–134,
ment,’’ 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/
Sep. 2019.
1804.02767
[141] K. Terada, H. Takeda, and T. Nishida, ‘‘An acquisition of the relation
[120] L. Chen, P. Huang, and Z. Meng, ‘‘Convolutional multi-grasp detec-
between vision and action using self-organizing map and reinforcement
tion using grasp path for RGBD images,’’ Robot. Auto. Syst., vol. 113,
learning,’’ in Proc. 2nd Int. Conf. Knowl.-Based Intell. Electron. Syst.
pp. 94–103, Mar. 2019.
Adelaide, SA, Australia: IEEE, Apr. 1998, pp. 429–434.
[121] R. Roy, A. Kumar, M. Mahadevappa, and C. S. Kumar, ‘‘Deep learning
[142] T. Lampe and M. Riedmiller, ‘‘Acquiring visual servoing reaching and
based object shape identification from EOG controlled vision system,’’ in
grasping skills using neural reinforcement learning,’’ in Proc. Int. Joint
Proc. IEEE Sensors. New Delhi, India: IEEE, Oct. 2018, pp. 1–4.
Conf. Neural Netw. (IJCNN). Dallas, TX, USA: IEEE, Aug. 2013.
[122] X. Yan, J. Hsu, M. Khansari, Y. Bai, A. Pathak, A. Gupta, J. Davidson, and
H. Lee, ‘‘Learning 6-DOF grasping interaction via deep geometry-aware [143] I. Lenz, H. Lee, and A. Saxena, ‘‘Deep learning for detecting robotic
3D representations,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA). grasps,’’ Int. J. Robot. Res., vols. 4–5, no. 34, pp. 705–724, 2015.
Brisbane, QLD, Australia: IEEE, May 2018, pp. 3766–3773. [144] P. Ardon, E. Pairet, R. P. A. Petrick, S. Ramamoorthy, and K. S. Lohan,
[123] V. Satish, J. Mahler, and K. Goldberg, ‘‘On-policy dataset synthe- ‘‘Learning grasp affordance reasoning through semantic relations,’’ IEEE
sis for learning robot grasping policies using fully convolutional deep Robot. Autom. Lett., vol. 4, no. 4, pp. 4571–4578, Oct. 2019.
networks,’’ IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1357–1364, [145] V. Mnih, N. Heess, and A. Graves, ‘‘Recurrent models of visual atten-
Apr. 2019. tion,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014. vol. 3, no. 6, pp. 1–9.
[124] H. Liang, X. Ma, S. Li, M. Gorner, S. Tang, B. Fang, F. Sun, and [146] K. Yu, C. Dong, L. Lin, and C. C. Loy, ‘‘Crafting a toolchain for image
J. Zhang, ‘‘PointNetGPD: Detecting grasp configurations from point restoration by deep reinforcement learning,’’ in Proc. IEEE/CVF Conf.
sets,’’ in Proc. Int. Conf. Robot. Autom. (ICRA). Montreal, QC, Canada: Comput. Vis. Pattern Recognit. Salt Lake City, UT, USA: IEEE, Jun. 2018,
IEEE, May 2019, pp. 3629–3635. pp. 2443–2452.
[125] X. Sun, T. Nozaki, T. Murakami, and K. Ohnishi, ‘‘Grasping point esti- [147] S. James, Z. Ma, D. Rovick Arrojo, and A. J. Davison, ‘‘RLBench:
mation based on stored motion and depth data in motion reproduction sys- The robot learning benchmark & learning environment,’’ IEEE Robot.
tem,’’ in Proc. IEEE Int. Conf. Mechatronics (ICM). Ilmenau, Germany: Autom. Lett., vol. 5, no. 2, pp. 3019–3026, Apr. 2020.
IEEE, Mar. 2019, pp. 471–476. [148] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,
[126] Z. Deng, X. Zheng, L. Zhang, and J. Zhanga, ‘‘A learning framework for ‘‘Learning synergies between pushing and grasping with self-supervised
semantic reach-to-grasp tasks integrating machine learning and optimiza- deep reinforcement learning,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots
tion,’’ Robot. Auton. Syst., vol. 108, pp. 140–152, Oct. 2018. Syst. (IROS). Madrid, Spain: IEEE, Oct. 2018, pp. 4238–4245.
[127] I. González-Díaz, J. Benois-Pineau, J.-P. Domenger, D. Cattaert, and [149] Y. Wang, H. Lang, and C. W. de Silva, ‘‘A hybrid visual servo con-
A. de Rugy, ‘‘Perceptually-guided deep neural networks for ego-action troller for robust grasping by wheeled mobile robots,’’ IEEE/ASME Trans.
prediction: Object grasping,’’ Pattern Recognit., vol. 88, pp. 223–235, Mechatronics, vol. 15, no. 5, pp. 757–769, Oct. 2010.
Apr. 2019. [150] S. Gu, E. Holly, T. Lillicrap, and S. Levine, ‘‘Deep reinforcement learning
[128] M. Farag, A. N. A. Ghafar, and M. H. Alsibai, ‘‘Real-time robotic for robotic manipulation with asynchronous off-policy updates,’’ in Proc.
grasping and localization using deep learning-based object detection tech- IEEE Int. Conf. Robot. Autom. (ICRA). Singapore: IEEE, May 2017,
nique,’’ in Proc. IEEE Int. Conf. Autom. Control Intell. Syst. (I2CACIS). pp. 3389–3396.
Selangor, Malaysia: IEEE, Jun. 2019, pp. 139–144. [151] M. Breyer, F. Furrer, T. Novkovic, R. Siegwart, and J. Nieto, ‘‘Com-
[129] R. Detry, C. H. Ek, M. Madry, J. Piater, and D. Kragic, ‘‘Generalizing paring task simplifications to learn closed-loop object picking using
grasps across partly similar objects,’’ in Proc. IEEE Int. Conf. Robot. deep reinforcement learning,’’ IEEE Robot. Autom. Lett., vol. 4, no. 2,
Autom. Saint Paul, MN, USA: IEEE, May 2012, pp. 3791–3797. pp. 1549–1556, Apr. 2019.

181876 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

[152] K. Katyal, I.-J. Wang, and P. Burlina, ‘‘Leveraging deep reinforcement [171] S. Duffner and C. Garcia, ‘‘Visual focus of attention estimation with
learning for reaching robotic tasks,’’ in Proc. IEEE Conf. Comput. Vis. unsupervised incremental learning,’’ IEEE Trans. Circuits Syst. Video
Pattern Recognit. Workshops (CVPRW). Honolulu, HI, USA: IEEE, Technol., vol. 26, no. 12, pp. 2264–2272, Dec. 2016.
Jul. 2017, pp. 490–491. [172] B. C. Kwon, B. Eysenbach, J. Verma, K. Ng, C. De Filippi, W. F. Stewart,
[153] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Bjorkman, ‘‘Deep predic- and A. Perer, ‘‘Clustervision: Visual supervision of unsupervised cluster-
tive policy training using reinforcement learning,’’ in Proc. IEEE/RSJ ing,’’ IEEE Trans. Vis. Comput. Graphics, vol. 24, no. 1, pp. 142–151,
Int. Conf. Intell. Robots Syst. (IROS). Vancouver, BC, Canada: IEEE, Jan. 2018.
Sep. 2017, pp. 2351–2358. [173] X. Li, H. Zhang, R. Zhang, and F. Nie, ‘‘Discriminative and uncorrelated
[154] K. N. Nguyen, J. Yoo, and Y. Choe, ‘‘Speeding up affordance learning feature selection with constrained spectral analysis in unsupervised learn-
for tool use, using proprioceptive and kinesthetic inputs,’’ in Proc. Int. ing,’’ IEEE Trans. Image Process., vol. 29, pp. 2139–2149, 2020.
Joint Conf. Neural Netw. (IJCNN). Budapest, Hungary: IEEE, Jul. 2019, [174] M. A. Lee, Y. Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese,
pp. 1–8. L. Fei-Fei, A. Garg, and J. Bohg, ‘‘Making sense of vision and touch:
[155] C. C. Beltran-Hernandez, D. Petit, I. G. Ramirez-Alpizar, and K. Harada, Learning multimodal representations for contact-rich tasks,’’ IEEE Trans.
‘‘Learning to grasp with primitive shaped object policies,’’ in Proc. Robot., vol. 36, no. 3, pp. 582–596, Jun. 2020.
IEEE/SICE Int. Symp. Syst. Integr. (SII). Paris, France: IEEE, Jan. 2019, [175] L. Wellhausen, A. Dosovitskiy, R. Ranftl, K. Walas, C. Cadena, and
pp. 468–473. M. Hutter, ‘‘Where should i walk? Predicting terrain properties from
images via self-supervised learning,’’ IEEE Robot. Autom. Lett., vol. 4,
[156] Z. Li, T. Zhao, F. Chen, Y. Hu, C.-Y. Su, and T. Fukuda, ‘‘Reinforce-
no. 2, pp. 1509–1516, Apr. 2019.
ment learning of manipulation and grasping using dynamical movement
[176] X. Shu, C. Liu, T. Li, C. Wang, and C. Chi, ‘‘A self-supervised learning
primitives for a humanoidlike mobile manipulator,’’ IEEE/ASME Trans.
manipulator grasping approach based on instance segmentation,’’ IEEE
Mechatronics, vol. 23, no. 1, pp. 121–131, Feb. 2018.
Access, vol. 6, pp. 65055–65064, 2018.
[157] Z. Miljković, M. Mitić, M. Lazarević, and B. Babić, ‘‘Neural network
[177] T. Mar, V. Tikhanoff and L. Natale, ‘‘What can i do with this tool self-
reinforcement learning for visual control of robot manipulators,’’ Expert
supervised learning of tool affordances from their 3-D geometry,’’ IEEE
Syst. Appl., vol. 40, no. 5, pp. 1721–1736, Apr. 2013.
Trans. Aerosp. Electron. Syst., vol. 10, no. 3, pp. 595–610, Sep. 2018.
[158] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, [178] T. Schmidt, R. Newcombe, and D. Fox, ‘‘Self-supervised visual descriptor
M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, learning for dense correspondence,’’ IEEE Robot. Autom. Lett., vol. 2,
and V. Vanhoucke, ‘‘Using simulation and domain adaptation to no. 2, pp. 420–427, Apr. 2017.
improve efficiency of deep robotic grasping,’’ in Proc. IEEE Int. Conf. [179] J. Yuan and Y. Wu, ‘‘Mining visual collocation patterns via
Robot. Autom. (ICRA). Brisbane, QLD, Australia: IEEE, May 2018, self-supervised subspace learning,’’ IEEE Trans. Syst., Man,
pp. 4243–4250. Cybern. B. Cybern., vol. 42, no. 2, pp. 334–346, Apr. 2012.
[159] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, [180] K. Yun, J. Park, and J. Cho, ‘‘Robust human pose estimation for rotation
J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, ‘‘Sim-to-real via via self-supervised learning,’’ IEEE Access, vol. 8, pp. 32502–32517,
sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical 2020.
adaptation networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- [181] Y. Cong, J. Liu, J. Yuan, and J. Luo, ‘‘Self-supervised online metric
tern Recognit. (CVPR). Long Beach, CA, USA: IEEE, Jun. 2019, learning with low rank constraint for scene categorization,’’ IEEE Trans.
pp. 12619–12629. Image Process., vol. 22, no. 8, pp. 3179–3191, Aug. 2013.
[160] R. B. Hellman, C. Tekin, M. van der Schaar, and V. J. Santos, ‘‘Functional [182] K. Stefanov, J. Beskow, and G. Salvi, ‘‘Self-supervised vision-based
contour-following via haptic perception and reinforcement learning,’’ detection of the active speaker as support for socially aware lan-
IEEE Trans. Haptics, vol. 11, no. 1, pp. 61–72, Jan. 2018. guage acquisition,’’ IEEE Trans. Cognit. Develop. Syst., vol. 12, no. 2,
[161] R. Platt, ‘‘Learning grasp strategies composed of contact relative pp. 250–259, Jun. 2020.
motions,’’ in Proc. 7th IEEE-RAS Int. Conf. Humanoid Robots. [183] B. Gfeller, C. Frank, D. Roblek, M. Sharifi, M. Tagliasacchi, and
Pittsburgh, PA, USA: IEEE, Nov. 2007, pp. 49–56. M. Velimirovic, ‘‘SPICE: Self-supervised pitch estimation,’’ IEEE/ACM
[162] H. Merzic, M. Bogdanovic, D. Kappler, L. Righetti, and J. Bohg, ‘‘Lever- Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1118–1128, 2020.
aging contact forces for learning to grasp,’’ in Proc. Int. Conf. Robot. [184] M. A. Moussa, ‘‘Combining expert neural networks using reinforcement
Autom. (ICRA). Montreal, QC, Canada: IEEE, May 2019, pp. 3615–3621. feedback for learning primitive grasping behavior,’’ IEEE Trans. Neural
[163] Y. Xing, F. Shen, and J. Zhao, ‘‘Perception evolution network based on Netw., vol. 15, no. 3, pp. 629–638, May 2004.
cognition deepening model—Adapting to the emergence of new sen- [185] H. Shi, G. Sun, Y. Wang, and K.-S. Hwang, ‘‘Adaptive image-based visual
sory receptor,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, servoing with temporary loss of the visual signal,’’ IEEE Trans. Ind.
pp. 607–620, Mar. 2016. Informat., vol. 15, no. 4, pp. 1956–1965, Apr. 2019.
[164] S. Lowry and M. J. Milford, ‘‘Supervised and unsupervised linear learn- [186] R. S. Sharma, R. R. Nair, P. Agrawal, L. Behera, and V. K. Subramanian,
ing techniques for visual place recognition in changing environments,’’ ‘‘Robust hybrid visual servoing using reinforcement learning and finite-
IEEE Trans. Robot., vol. 32, no. 3, pp. 600–613, Jun. 2016. time adaptive FOSMC,’’ IEEE Syst. J., vol. 13, no. 3, pp. 3467–3478,
Sep. 2019.
[165] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and V. Kumar,
[187] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi, ‘‘Action-driven visual
‘‘Unsupervised deep homography: A fast and robust homography esti-
object tracking with deep reinforcement learning,’’ IEEE Trans. Neural
mation model,’’ IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 2346–2353,
Netw. Learn. Syst., vol. 29, no. 6, pp. 2239–2252, Jun. 2018.
Jul. 2018.
[188] Y. Xie, J. Xiao, K. Huang, J. Thiyagalingam, and Y. Zhao, ‘‘Correla-
[166] F. Despinoy, D. Bouget, G. Forestier, C. Penet, N. Zemiti, P. Poignet,
tion filter selection for visual tracking using reinforcement learning,’’
and P. Jannin, ‘‘Unsupervised trajectory segmentation for surgical gesture
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 1, pp. 192–204,
recognition in robotic training,’’ IEEE Trans. Biomed. Eng., vol. 63, no. 6,
Jan. 2020.
pp. 1280–1291, Jun. 2016.
[189] Y. Wang, L. Zhang, L. Wang, and Z. Wang, ‘‘Multitask learning for object
[167] T. Feng and D. Gu, ‘‘SGANVO: Unsupervised deep visual odometry and localization with deep reinforcement learning,’’ IEEE Trans. Cognit.
depth estimation with stacked generative adversarial networks,’’ IEEE Develop. Syst., vol. 11, no. 4, pp. 573–580, Dec. 2019.
Robot. Autom. Lett., vol. 4, no. 4, pp. 4431–4437, Oct. 2019. [190] Z. Ni and S. Paul, ‘‘A multistage game in smart grid security: A reinforce-
[168] Y. Su, W. Li, W. Nie, D. Song, and A.-A. Liu, ‘‘Unsupervised feature ment learning solution,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30,
learning with graph embedding for view-based 3D model retrieval,’’ IEEE no. 9, pp. 2684–2695, Sep. 2019.
Access, vol. 7, pp. 95285–95296, 2019. [191] Y. Zeng, K. Xu, L. Qin, and Q. Yin, ‘‘A semi-Markov decision model
[169] Y. Li, C. Tao, Y. Tan, K. Shang, and J. Tian ‘‘Unsupervised multilayer with inverse reinforcement learning for recognizing the destination of a
feature learning for satellite image scene classification,’’ IEEE Geosci. maneuvering agent in real time strategy games,’’ IEEE Access, vol. 8,
Remote Sens. Lett., vol. 13, no. 2, pp. 157–161, Feb. 2016. pp. 15392–15409, 2020.
[170] H. Qiao, Y. Li, F. Li, X. Xi, and W. Wu, ‘‘Biologically inspired model [192] M. S. Emigh, E. G. Kriminger, A. J. Brockmeier, J. C. Principe, and
for visual cognition achieving unsupervised episodic and semantic fea- P. M. Pardalos, ‘‘Reinforcement learning in video games using nearest
ture learning,’’ IEEE Trans. Cybern., vol. 46, no. 10, pp. 2335–2347, neighbor interpolation and metric learning,’’ IEEE Trans. Comput. Intell.
Oct. 2016. AI Games, vol. 8, no. 1, pp. 56–66, Mar. 2016.

VOLUME 8, 2020 181877


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

[193] R. Li, R. Platt, W. Yuan, A. T. Pas, N. Roscup, M. A. Srinivasan, and [214] S. Wang, J. Wu, X. Sun, W. Yuan, W. T. Freeman, J. B. Tenenbaum, and
E. Adelson, ‘‘Localization and manipulation of small parts using gelsight E. H. Adelson, ‘‘3D shape perception from monocular vision, touch, and
tactile sensing,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. Chicago, shape priors,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS).
IL, USA: IEEE, Sep. 2014, pp. 3988–3993. Madrid, Spain: IEEE, Oct. 2018, pp. 1606–1613.
[194] K.-T. Yu and A. Rodriguez, ‘‘Realtime state estimation with tactile and [215] F. R. Hogan, M. Bauza, O. Canal, E. Donlon, and A. Rodriguez, ‘‘Tactile
visual Sensing. Application to planar manipulation,’’ in Proc. IEEE Int. regrasp: Grasp adjustments via simulated tactile transformations,’’ in
Conf. Robot. Autom. (ICRA). Brisbane, QLD, Australia: IEEE, May 2018, Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). Madrid, Spain:
pp. 7778–7783. IEEE, Oct. 2018, pp. 2963–2970.
[195] J. Bimbo, S. Rodriguez-Jimenez, H. Liu, X. Song, N. Burrus, [216] F. Sun, C. Liu, W. Huang, and J. Zhang, ‘‘Object classification and
L. D. Senerivatne, M. Abderrahim, and K. Althoefer, ‘‘Object pose esti- grasp planning using visual and tactile sensing,’’ IEEE Trans. Syst., Man,
mation and tracking by fusing visual and tactile information,’’ in Proc. Cybern. Syst., vol. 46, no. 7, pp. 969–979, Jul. 2016.
IEEE Int. Conf. Multisensor Fusion Integr. Intell. Syst. (MFI). Hamburg, [217] D. Jain, A. Li, S. Singhal, A. Rajeswaran, V. Kumar, and E. Todorov,
Germany: IEEE, Sep. 2012, pp. 65–70. ‘‘Learning deep visuomotor policies for dexterous hand manipulation,’’
[196] C. Schuetz, J. Pfaff, F. Sygulla, D. Rixen, and H. Ulbrich, ‘‘Motion in Proc. Int. Conf. Robot. Autom. (ICRA). Montreal, QC, Canada: IEEE,
planning for redundant manipulators in uncertain environments based May 2019, pp. 3636–3643.
on tactile feedback,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. [218] C. D. Santina, V. Arapi, G. Averta, F. Damiani, G. Fiore, A. Settimi,
(IROS). Hamburg, Germany: IEEE, Sep. 2015, pp. 6387–6394. M. G. Catalano, D. Bacciu, A. Bicchi, and M. Bianchi, ‘‘Learning from
[197] J. Zhang, C. Song, Y. Hu, and B. Yu, ‘‘Improving robustness of robotic humans how to grasp: A data-driven architecture for autonomous grasping
grasping by fusing multi-sensor,’’ in Proc. IEEE Int. Conf. Multisensor with anthropomorphic soft hands,’’ IEEE Robot. Autom. Lett., vol. 4,
Fusion Integr. Intell. Syst. (MFI). Hamburg, Germany: IEEE, Sep. 2012, no. 2, pp. 1533–1540, Apr. 2019.
pp. 126–131. [219] K. Hang, M. Li, J. A. Stork, Y. Bekiroglu, F. T. Pokorny, A. Billard, and
[198] J. Sanchez, C. M. Mateo, J. A. Corrales, B.-C. Bouzgarrou, and D. Kragic, ‘‘Hierarchical fingertip space: A unified framework for grasp
Y. Mezouar, ‘‘Online shape estimation based on tactile sensing and planning and in-hand grasp adaptation,’’ IEEE Trans. Robot., vol. 32,
deformation modeling for robot manipulation,’’ in Proc. IEEE/RSJ Int. no. 4, pp. 960–972, Aug. 2016.
Conf. Intell. Robots Syst. (IROS). Madrid, Spain: IEEE, Oct. 2018,
pp. 504–511.
[199] M. Rasouli, Y. Chen, A. Basu, S. L. Kukreja, and N. V. Thakor,
‘‘An extreme learning machine-based neuromorphic tactile sensing sys- QIANG BAI received the B.Sc. degree from
tem for texture recognition,’’ IEEE Trans. Biomed. Circuits Syst., vol. 12, Zaozhuang University, in 2015, the double mas-
no. 2, pp. 313–325, Apr. 2018. ter’s degree from Yuan Ze University, and the
[200] B. Ward-Cherrier, N. Rojas, and N. F. Lepora, ‘‘Model-free precise M.Sc. degree from Guizhou University and Yuan
in-hand manipulation with a 3D-printed tactile gripper,’’ IEEE Robot. Ze University, in 2018. He is currently pursuing
Autom. Lett., vol. 2, no. 4, pp. 2056–2063, Oct. 2017. the Ph.D. degree with the School of Mechan-
[201] J. Bimbo, S. Luo, K. Althoefer, and H. Liu, ‘‘In-hand object pose estima- ical Engineering, Guizhou University, Guiyang,
tion using covariance-based tactile to geometry matching,’’ IEEE Robot. China. From September 2016 to August 2017,
Autom. Lett., vol. 1, no. 1, pp. 570–577, Jan. 2016. he was Joint Educated with Yuan Ze University.
[202] H. Liu, D. Guo, and F. Sun, ‘‘Object recognition using tactile measure- His research interests include machine learning,
ments: Kernel sparse coding methods,’’ IEEE Trans. Instrum. Meas.,
robot, grasp, vision, and location.
vol. 65, no. 3, pp. 656–665, Mar. 2016.
[203] T. Bhattacharjee, H. M. Clever, J. Wade, and C. C. Kemp, ‘‘Multimodal
tactile perception of objects in a real home,’’ IEEE Robot. Autom. Lett.,
vol. 3, no. 3, pp. 2523–2530, Jul. 2018.
SHAOBO LI was a Professor with the School
[204] R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik,
of Mechanical Engineering, Guizhou University
E. H. Adelson, and S. Levine, ‘‘More than a feeling: Learning to grasp
and regrasp using vision and touch,’’ IEEE Robot. Autom. Lett., vol. 3,
(GZU), China. From 2007 to 2015, he was the
no. 4, pp. 3300–3307, Oct. 2018. Vice Director of the Key Laboratory of Advanced
[205] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi, ‘‘A hybrid deep Manufacturing Technology, Ministry of Educa-
architecture for robotic grasp detection,’’ in Proc. IEEE Int. Conf. Robot. tion, GZU. Since 2015, he has been the Dean
Autom. (ICRA). Singapore: IEEE, May 2017, pp. 1609–1614. of the School of Mechanical Engineering, GZU.
[206] J. Li, S. Dong, and E. Adelson, ‘‘Slip detection with combined tactile His research has been supported by the National
and visual information,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA). Science Foundation of China (NSFC) and the
Brisbane, QLD, Australia: IEEE, May 2018, pp. 7772–7777. National High-Tech Research and Development
[207] N. P. Garg, D. Hsu, and W. S. Lee, ‘‘Learning to grasp under uncertainty Program (863 Program). His main research interests include intelligence
using POMDPs,’’ in Proc. Int. Conf. Robot. Autom. (ICRA). Montreal, manufacturing and big data.
QC, Canada: IEEE, May 2019, pp. 2751–2757.
[208] B. Ward-Cherrier, L. Cramphorn, and N. F. Lepora, ‘‘Tactile manipulation
with a TacThumb integrated on the open-hand m2 gripper,’’ IEEE Robot.
Autom. Lett., vol. 1, no. 1, pp. 169–175, Jan. 2016. JING YANG (Member, IEEE) received the B.Sc.
[209] C. Li, D. Yan, and J. Shen, ‘‘A convex tactile sensor for isotropic tissue degree from Anyang Normal University, in 2015,
elastic modulus estimation based on the plane contact model,’’ IEEE and the Ph.D. degree from the School of Mechan-
Sensors J., vol. 19, no. 15, pp. 6251–6259, Aug. 2019. ical Engineering. He is currently a Lecturer
[210] N. Pestell, J. Lloyd, J. Rossiter, and N. F. Lepora, ‘‘Dual-modal tactile with Guizhou University. From August 2018 to
perception and exploration,’’ IEEE Robot. Autom. Lett., vol. 3, no. 2,
September 2019, he was awarded a scholarship
pp. 1033–1040, Apr. 2018.
by the China Scholarship Council (CSC) under
[211] W. Zheng, B. Wang, H. Liu, X. Wang, Y. Li, and C. Zhang, ‘‘Bio-inspired
the State Scholarship Fund to pursue his study
magnetostrictive tactile sensor for surface material recognition,’’ IEEE
Trans. Magn., vol. 55, no. 7, pp. 1–7, Jul. 2019. with Oklahoma State University, as a Joint Ph.D.
[212] M. N. Saadatzi, J. R. Baptist, Z. Yang, and D. O. Popa, ‘‘Modeling and Student with the Institute for Mechatronic Engi-
fabrication of scalable tactile sensor arrays for flexible robot skins,’’ IEEE neering, where he joined the Guoliang Fan’s Group, as a Professor. He has
Sensors J., vol. 19, no. 17, pp. 7632–7643, Sep. 2019. published over ten papers in reputed journals/conferences. His main research
[213] S. Sundaram, P. Kellnhofer, Y. Li, J.-Y. Zhu, A. Torralba, and W. Matusik, interests include machine vision, deep learning, and smart manufacturing
‘‘Learning the signatures of the human grasp using a scalable tactile applications. He has also served as a Reviewer for several journals, such as
glove,’’ Nature, vol. 569, no. 7758, pp. 698–702, May 2019. IEEE ACCESS and the IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING.

181878 VOLUME 8, 2020


Q. Bai et al.: Object Detection Recognition and Robot Grasping Based on Machine Learning

QISONG SONG received the B.S. degree in XINGXING ZHANG received the B.S. degree
mechanical engineering from the Harbin Univer- in mechanical engineering from Nanjing Normal
sity of Science and Technology, Harbin, China, University, in 2018. She is currently pursuing the
in 2018. He is currently pursuing the M.S. degree master’s degree with the School of Mechanical
in mechanical engineering with Guizhou Univer- Engineering, Guizhou University. Her research
sity, Guiyang, China. His research interest includes interests include robots, tactile sensing, and so on.
mobile robot path planning.

ZHIANG LI received the B.S. degree in mechani-


cal engineering from Guizhou University, in 2018,
where he is currently pursuing the master’s degree.
His research interests include trajectory planning
and machine vision of manipulator.

VOLUME 8, 2020 181879

You might also like