0% found this document useful (0 votes)
30 views11 pages

Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe

1) The document describes research into optimizing a visual recognition algorithm called YOLO for use in service robots. 2) Panoramic images cause issues for the original YOLO algorithm due to distortions, so the researchers modify the YOLO algorithm and network structure to improve panoramic object detection speed and accuracy. 3) Testing shows the modified YOLO algorithm can recognize objects in panoramic images in real-time at 32 frames per second with over 70% accuracy, meeting requirements for service robot applications.

Uploaded by

Bredino Loler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe

1) The document describes research into optimizing a visual recognition algorithm called YOLO for use in service robots. 2) Panoramic images cause issues for the original YOLO algorithm due to distortions, so the researchers modify the YOLO algorithm and network structure to improve panoramic object detection speed and accuracy. 3) Testing shows the modified YOLO algorithm can recognize objects in panoramic images in real-time at 32 frames per second with over 70% accuracy, meeting requirements for service robot applications.

Uploaded by

Bredino Loler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Research Article

International Journal of Advanced


Robotic Systems
May-June 2020: 1–11
Optimized visual recognition algorithm ª The Author(s) 2020
Article reuse guidelines:
in service robots sagepub.com/journals-permissions
DOI: 10.1177/1729881420925308
journals.sagepub.com/home/arx

Jun W Wu1 , Wei Cai1, Shi M Yu2, Zhuo L Xu1 and Xue Y He1

Abstract
Vision-based detection methods often require consideration of the robot’s sight. For example, panoramic images cause
image distortion, which negatively affects the target recognition and spatial localization. Furthermore, the original you
only look once method does not have a reasonable performance for the image recognition in the panoramic images.
Consequently, some failures have been reported so far when implementing the visual recognition on the robot. In the
present study, it is intended to optimize the conventional you only look once algorithm and propose the modified
you only look once algorithm. Comparing the obtained results with the experiment shows that the modified you only
look once method can be effectively applied in the graphics processing unit to reach the panoramic recognition
speedup to 32 frames rate per second, which meets the real-time requirements in diverse applications. It is found that
the accuracy of the object detection when applying the proposed modified you only look once method exceeds 70% in
the studied cases.

Keywords
Robot, visual recognition, YOLO algorithm, Panoramic shooting, deep learning

Date received: 5 September 2019; accepted: 19 April 2020

Topic: Robot Manipulation and Control


Topic Editor: Henry Leung
Associate Editor: Bin He

Introduction relational mapping3 between artificial features and grasping


poses from supervised data. Compared to the conventional
Currently, service robots are getting popular in numerous
grasping approaches of industrial robots, which create
applications at diverse industries. Meanwhile, these devices
model libraries based on known objects,4 this method can
are continuously miniaturized and become more intelligent.
transfer obtained experiences to grasp unknown objects.5
Focusing on the requirements of service robots indicates
However, since the characteristics of the artificial design
that although the robot’s functionality is continuously get-
are restrained by human cognition, there are some limita-
ting stronger, there still exists a serious lack of mature inter-
tions in the robot grasping, including the complexity of
action capabilities. Among the functions required for
making more complicated representations.6 Therefore, the
service robots, robotic grasping is the primary step of inter-
acting with the real world. In fact, the robot needs to recog-
nize the position of objects, which require the use of sensors 1
Shanghai Tiger Technology Co, Ltd, Shanghai, People’s Republic of China
and image recognition approaches.1 Involving the object 2
Shanghai High School International Division, Shanghai, People’s Republic
detection methods, a remarkable research model2 in the of China
field of computer vision roars across the horizon. The
Corresponding author:
majority of performed researches so far apply sensors, Jun W Wu, Shanghai Tiger Technology Co., Ltd, Shanghai, People’s
mostly cameras to manually extract features and then con- Republic of China.
ventional machine learning methods are applied to learn the Email: [email protected]

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without
further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/
open-access-at-sage).
2 International Journal of Advanced Robotic Systems

current technology is only valid for a specific object or a


certain task.7 Since the object is distinct, the object detec-
tion may be affected by different factors, including shape,
size, perspective variation, and external illumination.
Accordingly, the extracted features cannot be generalized
resulting in poor robustness. Therefore, finding an algo-
rithm for each specific object is highly demanded, which
is a great challenge to adapt to new objects.8
With recent advancements in the deep learning, which is
mainly divided into convolutional neural networks (CNNs)
and regional proposal networks (RPNs),9 object detection
has made a huge breakthrough. This is especially more
pronounced in the deep learning that promotes the progress
of automatic object recognition and localization.10 To this
end, different schemes, including the region-based convo-
lutional neural network (R-CNN),11 Fast R-CNN,12 Faster
R-CNN,13 and you only look once (YOLO) detectors14
have been proposed so far. Among them, the YOLO algo- Figure 1. Faster R-CNN.11 R-CNN: region-based convolutional
rithm is a unified and real-time object detection model, neural network.
which is based on a single neural network proposed by
Redmon and Farhadi.14 In fact, the YOLO model is a new overlapping borders to obtain a panoramic image. Moreover,
object detection method that combines the object determi- the panoramic field of view is obtained through the panora-
nation and the object recognition into one by end-to-end mic image stitching technology, which can easily detect
detection scheme, which has superior characteristics, targets in the surrounding. Panoramic photography is per-
including fast detection and high accuracy. Currently, the formed by rotating the camera evenly on a horizontal plane.
RCNN algorithms are mainly applied as the mainstream for It should be indicated that the achieved effect during the
the deep learning object detection of robotic grasping. performing of the panoramic photography is similar to the
However, the corresponding detection speed does not sat- rotation of the eyes or head watching around. Unlike normal
isfy the real-time requirements. Therefore, the YOLO images, the length–width ratio of the panoramic picture
model has significant potential in robot vision applications. changes greatly. Therefore, the original YOLO image rec-
Since the YOLO model learns to predict bounding boxes ognition cannot be applied to this picture size, which causes
from supervised data, it is a challenge to generalize prior very poor performance of the object detection. The network
learning to new objects or objects with unusual aspect structure of the YOLO is flexible, and such problems can be
ratios or irregular configurations. Since the proposed archi- resolved by applying appropriate changes to the YOLO. To
tecture has multiple downsampling layers from the input solve effects originating from the panoramic images, it is
image, the YOLO model applies relatively coarse features intended to build a deep neural network based on the idea
for predicting bounding boxes. Therefore, the YOLO of changing the structure of grid cells. It is expected to
model is weak in generalization when dealing with irregu- improve the detection accuracy by modifying the network
lar aspect ratios in which the same type of object appears in structure to achieve a high recognition accuracy in the real
other situations.15 time.
On the other hand, the camera sight is another challenge
to solve. Normally, a captured image has a limited vision
field that shows less image information than a panoramic Multiobjective recognition algorithms
image. When target objects are located out of sight, the robot based on the deep learning
cannot detect the object. Consequently, vision-based detec- To evaluate the performance of the modified YOLO
tion methods often require consideration of the size of the (M-YOLO) algorithm in the robotic grasping, YOLO and
camera sight. Considering this drawback, the panoramic Faster R-CNN algorithms are employed to perform the
image plays an important role in the object detection16–18 comparison.
and spatial positioning of the robot. Considering limitation
in the internal space, design, and the initial cost of the robot,
the robot is usually not equipped with a panoramic camera.
Faster R-CNN
However, the image stitching technology is quite mature Faster R-CNN is a deep learning method for identifying
without traces so that it is widely applied to investigate the objects in the image. Figure 1 demonstrates the entire
fields of image processing, computer graphics, and virtual framework of the Faster R-CNN.
reality. Panoramic image stitching refers to the process of Figure 1 indicates that the Faster R-CNN consists of four
seamlessly stitching a series of ordinary images with main layers as the following:
Wu et al. 3

Figure 2. Flowchart of the YOLO algorithm.12 YOLO: you only look once.

1. Convolution layers for feature extraction model. It indicates that the YOLO model consists of three
As an object detection method for the CNN, Faster steps as the following:
R-CNN initially uses a set of basic convolutions,
rectified linear unit activation functions, and pool- 1. The original image of the camera is captured and
ing layers to extract feature maps of the input then it is divided into a grid with S  S resolution.
image. These convolutions are used in the subse- 2. Each grid predicts B bounding boxes with confi-
quent RPN layers19 and full connection layers. dence scores. Each record has five parameters,
2. Region proposal network including x, y, w, h, and p* intersection over union
RPN is mainly applied to generate regional propos- (IOU), where p and the IOU denote probability that
als. First, it generates a pile of anchor boxes. After a current position is an object and the function that
performing the clipping and filtering processes, predicts probability of the overlap area. Moreover, x
Softmax software is employed to determine which and y are the center coordinates, while w and h are
anchors belong to the foreground or background. the width and height of the bounding box,
Meanwhile, the bounding box regression modifies respectively.
the anchor box to form more accurate proposals. 3. Then class probability of each grid prediction is
More specifically, it is relative to the next box calculated as Ci ¼ p (classi|object).
regression of the full connection layer behind it. The YOLO algorithm uses a CNN to implement the multi-
3. Region of interest pooling object recognition model. In the present study, the pattern
This layer uses the feature map generated by RPN analysis, statistical modeling, and computational learning
proposals and the last layer of the visual geometry visual object classes (PASCAL VOC)25 data set are utilized
group (VGG16)20 to obtain a feature map with the to evaluate the model. The initial convolutional layer of the
fixed-size proposal, which can be used to identify network extracts features from the image, while the fully
and locate objects by the fully connected operation. connected layer predicts the output class probability and
4. Classifier corresponding image coordinates. Figure 3 indicates that to
The region of interest pooling layer is formed into a perform the image classification, the network structure ref-
fixed-size feature map for full connection operation. erence employs the GoogleNet model with 24 convolution
Softmax software is applied to classify specific cate- layers and 2 fully connected layers.4
gories.21,22 Meanwhile, the L1 loss is used for com- During the training session, the sum-squared error is
pleting the bounding box regression operation to weighted equally in large and small bounding boxes.
obtain the precise position of the object.23 Furthermore, the error metric indicates that small devia-
tions of large bounding boxes are less relevant than that of
small bounding boxes. To solve this problem, the regres-
sion is performed based on the square root of the width
YOLO and height of bonding boxes, instead of the width and
Apart from the Faster R-CNN scheme, the YOLO model is height directly. The YOLO algorithm predicts multiple
another recognition algorithm for the multitarget deep bounding boxes in each grid. During the network training,
learning.24 Considering the superior characteristics of the only one bounding box is required to be responsible for
YOLO scheme, it is applied in the present study for robotic each project. Subsequently, the following loss functions
grasping. Figure 2 presents the flowchart of the YOLO are optimized13:
4 International Journal of Advanced Robotic Systems

Figure 3. Convolutional network used by the YOLO algorithm.12 YOLO: you only look once.

" pffiffiffiffi qffiffiffiffi 2 #


s2 X
X B h i s2 X
X B
obj 2 2 obj pffiffiffiffiffi pffiffiffiffiffi 2
Cost ¼ lcoord I ij ðxi  xi Þ þ ðyi  yi Þ þ lcoord I ij !i  !i þ hi  hi
i¼0 j¼0 i¼0 j¼0
s2 X
X B  2 s2 X
X B  2 s2
X X
þ I ijobj C i  Ci þ lnoobj I ijnobj C i  Ci þ I iobj ðpi ðcÞ  pi ðcÞÞ 2
i¼0 j¼0 i¼0 j¼0 i¼0 c2Classes

where I iobj and I ijobj are the object appearance in the ith cell on the Fast R-CNN algorithm. Fast R-CNN scheme uses a
and jth bonding box predictor in the ith cell, respectively. network to implement all parts except region proposal
Moreover, lcoord denotes the coordinate error weight. In extraction. Unlike the conventional R-CNN, the classifica-
the present study, classification error weights are set to tion and loss of coordinate regression in the Fast R-CNN
lcoord ¼ 5 and lnoobj ¼ 0.5. scheme initially update network parameters by the back
When the object is in the cell, the loss function will propagation. Second, not all region proposals but the whole
punish the classification error. Moreover, the loss function one are put into the extraction when extracting the feature.
penalizes the coordinate error of the bounding box for an Then the feature is extracted through the coordinate map-
active predictor. After the training session, the regression ping. It should be indicated that this type of extraction has
equation can be applied to predict coordinates of the object two advantages, including the fast performance and wide
category and the object in the image coordinate system in operational range. Since a picture walks through the net-
real time. Then the three-dimensional position of the object work only once, the feature is affected by the receptive
in the camera coordinate system26 can be obtained by cal- field so that it can fuse features of the adjacent background
culating the center depth. to “see” farther. Finally, studies show that it is almost
impossible to operate a real-time detection by applying the
selective search method.27 Therefore, it is intended to
replace the selective search method with the RPN and share
Differences between the YOLO and Faster
the feature extraction layer28 with the Fast R-CNN classi-
R-CNN algorithms fication and regression network to reduce the computa-
In this section, it is intended to reveal differences between tional expense. Experimental results also show that
the YOLO and the Faster R-CNN schemes. First, the applying the Fast R-CNN improves the speed and accuracy
R-CNN is a feature extractor. In fact, a selective search is of the prediction. It is found that the RPN is an essence of
normally applied to extract a certain number (e.g. 2000) of Faster R-CNN. Moreover, it is the main reason for the high
region proposals and then it convolutes the region propos- accuracy and low speed of the R-CNN algorithm compared
als and extracts features of the fc7 layer for the classifica- to that of the YOLO scheme.
tion and regression of the coordinates. In the present study, On the other hand, one of YOLO’s contributions is to
the support vector machine classification method is utilized convert detection problems into regression problems. In
instead of the conventional Softmax model. The main con- fact, the former Fast R-CNN is divided into two steps,
tribution of this algorithm is to propose an effective feature including extracting the region proposal and classifying.
utilization method. The majority of researchers in this area The former step judges whether the anchor is the fore-
use features of the fc7 layer in engineering practice based ground or the background, while the main purpose of the
Wu et al. 5

Figure 4. Object recognition flowchart of the M-YOLO method. M-YOLO: modified you only look once.

later step it to see what the foreground is. The YOLO


scheme generates coordinates and probabilities of each
type directly through the regression.
YOLO is characterized by its fast speed, regression
mechanism, and the extraction process of the regional pro-
posal. Unlike other algorithms, YOLO roughly divides the
images into 7  7 grids to analogize the Fast R-CNN. Each
location may belong to two objects by default, and 98
region proposals are extracted. In fact, Fast R-CNN is a
sliding window mechanism, and each feature map29 is a
map. Moreover, 9 anchors are returned, totaling about
20 K anchors, and 300 regional proposals are eventually
obtained through nonmaximum suppression. There is a
huge difference in candidate boxes between two algo- Figure 5. Bottle grasping.
rithms, so reasonable accuracy of the Faster R-CNN can
be justified. Since each location should be refined, the effi-
ciency further decreases. Therefore, the recognition speed tables, and other objects. This session will focus on design-
of the R-CNN cannot meet the real-time requirements. In ing the M-YOLO algorithm.
addition, YOLO30 streamlines the network, has slightly less Figures 4 and 5 indicate that the iKudo has a manipu-
computational expense than VGG, and may result in some lator and a camera for grasping objects and taking images.
speedup. However, the importance of computational General steps of the object grasping31 are presented in the
expense is less than the other two points mentioned above. flowchart. A mission is initially assigned to a robot as a
In the following section, the performances of each algo- command to grasp an object, for example, a bottle. Once
rithm will be compared with experimental data. the robot receives the task, it starts to look for an object.
There might be some similar objects in the scenario, while
the robot should pick a particular object. Therefore, the
M-YOLO algorithm based on the
robot starts to rotate to scan the surroundings through a
integrated convolution network depth camera and generate a panoramic image. The panora-
In the present study, robot iKudo is employed to grasp the mic cylindrically expanded image is obtained by projecting
objects from a complex background containing chairs, the annular panoramic image onto a cylindrical surface
6 International Journal of Advanced Robotic Systems

increasing the number of transverse grids. In other words, it


changes from S  S to S  nS where the number n is an
adaptive number (see Figure 7) and changes the network
structure35 by adding a feature layer at the end of YOLO’s
original network structure. It is worth noting that in the
present study, parameters n and S are set to 2 and 7, respec-
tively. The new network structure consists of 19 convolu-
tion layers and 5 max pooling layers. Figure 8 shows that
Figure 6. An example panoramic image. these layers generate an improved M-YOLO network based
on the YOLO network to meet the requirements of the
with a specified radius from the mirror. A two-dimensional robot-based panoramic image detection. The main purpose
rectangular cylindrical panorama32 is obtained by cutting for establishing the M-YOLO algorithm is to provide a
and spreading the cylindrical surface along the radial direc- more generalized model when the comparison is made with
tion. Figure 6 illustrates the obtained panoramic image. conventional models. Moreover, the established algorithm
Based on the obtained panoramic image, the object detec- should allow the model to flexibly detect images with large
tion33 is triggered and the robot is asked to process the image aspect ratios such as panoramic images. Ideally, this means
with object detection algorithms to detect the particular that until the input image size is regular, the M-YOLO and
object. As a result, the robot finds the object location and YOLO algorithms have the same performance. However,
starts the navigation to the target object by using the simul- in processing images with large aspect ratios such as
taneous localization and mapping technology. Finally, once panoramic images, the M-YOLO algorithm supposed to
the iKudo robot arrives at the position with an acceptable show a better performance and higher processing speed
compared with the YOLO algorithm.
standard for the object grasping, the manipulator starts to
To train the M-YOLO network, a python crawler is
grasp the detected object.
applied to generate a training set for the experiment. Then
The YOLO system divides the image into S  S grid cells,
a standard VOC data set is constructed to obtain pretraining
where each grid allows to predict two bounding boxes so that
parameters after 100 cycles (one epoch). It should be indi-
there are 2  S  S bounding boxes in total. These generated
cated that the input image size of the network is randomly
proposals cover the entire area of the image. If the target
changed every 10 cycles.
center falls in a grid cell, the corresponding cell will be in
charge of detecting the object. In this case, the two bounding
boxes of such grid cell are expressed as p(object) ¼ 1. In Build a VOC data set
each iteration, the network predicts the location and confi- According to the architecture of the VOC data set, the
dence of 2  S  S bounding boxes and 20 class probabilities present study constructs a new data set36 by using OpenCV
of S  S grid cells excluding the background classification. to initially read all of the images under the folder. Then, the
The class probabilities of bounding boxes are predicted by VOC format is renamed and changed. Figure 9 shows that
class probabilities of grid cells corresponding to the bound- after preparing the data, it is required to place the image file
ing box center. In the fully connected layer, it requires a corresponding to the structure of the VOC data set.
fixed-size vector as an input. In other words, this layer
requires original images with a fixed size. To this end, the
input size of the YOLO is set to 448  448. Since panoramic Labeling the object area
images contain many objects to be detected, applying the The object area in the original image is marked with the
conventional grid generation method on panoramic images labeling software for training.37 The fundamental method is
allows the single grid to predict two objects which can easily to frame the object area and double-click the category.
cause detection failures. This may be attributed to the grid Then, the complete image is marked and saved. The cate-
numbers, because the number of target objects in a single gory selects 20 common types of objects, such as mineral
grid increases as the grid number decreases.34 water bottles, parcels, boxes, and chairs. The annotation
However, when the panoramic picture of the robot is information is saved in the format of the PASCAL VOC
processed by YOLO, the aspect ratio of the object in the data set, which contains the category of the target and the
captured image is not an accurate reflection of its ground outsourced border. Then, the actual data of the target are
truth and the features that look high and thin are formed. divided by the width and height of the image so that the
Therefore, YOLO is less adaptive to the new samples. To data are in the range of 0–1. The data can be read faster
solve this problem, the number of forecasting frames in the during the training and the images of different sizes can be
longitudinal direction is changed, such as in the scene where trained. The format is five parameters for a set of data,
the height of the object is compressed in the actual image. including index (sequence of category), x (x coordinate of
M-YOLO, which is proposed in the present study, the target center), y (y coordinate of the target center),
reduces the number of objects detected in a single grid by w (width of the target), and h (height of the target).
Wu et al. 7

Figure 7. Improved grid of the M-YOLO. M-YOLO: modified you only look once.

processing unit (CPU), NVIDIA GeForce GTX1080, gra-


phics processing unit (GPU), 16 GB RAM, with Cuda 8.0,
and Python 3.5 installed. Moreover, the corresponding
libraries such as Tensorflow v1.1, Numpy, OpenCV3, and
CPython extension are used. The robot vision system is
connected to the computer equipped with a 1394 capture
card. The YOLO parameters are set with a momentum of
0.9 and a weight decay of 0.0005 to prevent overfitting.
Table 1 presents the parameters for training.

Figure 8. M-YOLO network structure. M-YOLO: modified you


only look once. Experimental test and the result analysis
In the experiment, a testing data set consisting of 1000
images is built. It should be indicated that the resolution
of each image is 300  1200. These images consist of 200
images for each class, including occlusion, no occlusion,
normal illumination, and weak illumination. The control
experiments are set up and tested under YOLO,
M-YOLO, and the mainstream detection method of Faster
R-CNN. Among them, YOLO and M-YOLO used the Dar-
knet framework, while Faster R-CNN used the Tensorflow
framework. Moreover, during the experiments, the bottles
and chairs are selected as statistical objects since both of
them have obvious motion characteristics.
There are three measurements for the object detection
algorithm, including rapidity, accuracy, and robustness. It
Figure 9. Build a VOC data set. VOC; visual object classes.
should be indicated that the detection frame rate per second
(FPS), the accuracy rate, and the average overlap rate are
Equation 3 indicates that xmax and ymax are the coordinates considered as comparison parameters.
of the lower right corner of the frame, while xmin and ymin
are the coordinates of the upper left corner of the frame.
Moreover, width and height are the image width and height. Processing speed
Three detectors YOLO, M-YOLO, and Faster R-CNN are
tested with both CPU and GPU. Moreover, the detection
Training frame rate is calculated according to the average time con-
The configuration of the computing environment that this sumption of the test pictures. Liu et al.22 investigated the
article utilizes during the training session is configured single-shot detector (SSD) and compared the model pro-
with Windows 10 system, Intel Core i7-8700 central cessing speed (FPS) of SSD300 and SSD512. They showed
8 International Journal of Advanced Robotic Systems

Table 1. Key parameter selection. Table 3. Comparison of detection accuracy.

Batch size Num Momentum Decay Learn_rate Model


64 5 0.9 0.0005 0.001 YOLO M-YOLO Faster R-CNN

Rate Chair Bottle Chair Bottle Chair Bottle


Table 2. Comparison of detection speed. FP 0.29 0.27 0.12 0.11 0.13 0.15
FN 0.17 0.15 0.21 0.24 0.16 0.15
Model
Total 0.46 0.42 0.33 0.35 0.29 0.30
Processing unit YOLO M-YOLO Faster R-CNN YOLO: you only look once; M-YOLO: modified you only look once;
CPU (FPS) 0.87 0.81 0.11 R-CNN: region-based convolutional neural network; FP: false positive;
FN: false negative.
GPU (FPS) 35.34 31.82 3.94
YOLO: you only look once; M-YOLO: modified you only look once;
R-CNN: region-based convolutional neural network; CPU: central pro-
processing time can be interpreted as the following: In the
cessing unit; GPU: graphics processing unit; FPS: frames rate per second. studied case, the scanning and merging processes in the
standard YOLO algorithm take 10 frames with an overlap
of 50%, while only a single frame (panoramic image) is
that data with different input sizes adversely affect the processed in the M-YOLO algorithm. A similar result is
processing speed of the model. Therefore, input images obtained when the experiment is conducted on the GPU. As
a result, it is concluded that the M-YOLO algorithm has a
with the same input size (448  448) are utilized for all
high efficiency rate and reasonable generalization ability
models. However, since the YOLO algorithm has to pro-
for object detection in the rotating case studies.
cess more images than the M-YOLO algorithm does, the
total processing time of YOLO and M-YOLO algorithms
are measured to perform the evaluation.Table 2 presents the Accuracy
obtained statistical results in this regard. It is observed that
all three detectors do not meet the real-time performances The accuracy of the object detection is reflected by the
under CPU. However, YOLO and M-YOLO detection error rate. The lower the error rate, the more reliable and
speeds exceed 30 FPS under GPU, which is fully satisfied accurate the detection model.38 Among the obtained
in real time. Moreover, the current detection method of results, false positive (FP) indicates that the object is clas-
Faster R-CNN is less than 5 FPS. As a result, YOLO and sified incorrectly, while a false negative (FN) indicates that
M-YOLO algorithms have a more reasonable performance the object cannot be located. Images are tested by using
in comparison with that of the Faster R-CNN scheme. YOLO, M-YOLO, and Faster R-CNN.Table 3 presents
In the present study, the M-YOLO method is performed respective error rates. It is concluded that:
on a webcam to evaluate the corresponding real-time per-
1. The error rate is about 40% by using YOLO for the
formance for fetching images from the camera and display
detection process. Since the object is in a “short and
the detected objects.
thin” state, the conventional detector cannot be
Table 2 indicates that performing the object detection
adapted. However, by using the M-YOLO detector,
process on each panoramic image by the M-YOLO takes
the error rate is reduced to 30% by increasing the
about 1 s on CPU and 35 ms on GPU. Moreover, when the
number of longitudinal prediction frames to fit the
standard YOLO algorithm is applied to scan the real-time “short and thin” object. Therefore, the performance
rotation of the robot, several frames should be processed so of the M-YOLO detector is superior to that of the
that longer processing time is required in comparison with YOLO detector.
the M-YOLO algorithm. In fact, the M-YOLO algorithm 2. Comparing the M-YOLO detector with the Faster
only processes a single frame (panoramic image). Table 2 R-CNN detector, the Faster R-CNN error rate is less
indicates that the processing speed of the standard YOLO than 30%, which is the best detector in terms of accu-
algorithm on the CPU is 0.87 FPS, while that of the racy. The error rate of the M-YOLO detector is about
M-YOLO algorithm is 0.81 FPS. In other words, process- 3% higher than that of the Faster R-CNN detector.
ing one frame through the YOLO and M-YOLO algorithms However, considering the significant advantages of
takes 1.15 s and 1.23 s, respectively. Therefore, it is con- the M-YOLO detection speed, it performs more prac-
cluded that the performance of the YOLO and M-YOLO tically in terms of robot vision detection.
algorithms in this case study is similar. However, for the
object detection in the rotating case, since the YOLO algo- With these comparisons shown above in section A and
rithm has to detect several frames during the rotation, the section B, it could be concluded that the M-YOLO method
processing time is about 11.49 s, while that of the M-YOLO is a fast, accurate object detector, which is ideal for robot
algorithm is 1.23 s. This remarkable difference in the vision applications.
Wu et al. 9

Figure 10. Test results in four cases. (a) Occlusion, (b) no occlusion, (c) normal illumination, and (d) insufficient illumination.

Table 4. Robustness quantitative test. occluded. The prediction accuracy and average overlap rate
are 6% and 7% lower than the normal environment, respec-
Case tively. However, the average overlap rate is higher than
Rate Normal Overlap Weak light 65%. When the lack of illumination leads to a certain
degree of convergence between the object and the back-
Accurate rate 0.74 0.65 0.69 ground, the detection accuracy decreases by 5%, and the
Average Overlap rate 0.78 0.71 0.75
average overlap rate decreases by 3%. It is concluded that
in the occlusion and underlight environment, the M-YOLO
method has lower performance. However, it still maintains
Robustness a higher accuracy and average overlap rate when the com-
The bottle is selected as the object to be detected, and other parison is made with the Faster R-CNN.Table 4 presents
objects such as the chair, box, and table are interference the robustness of the M-YOLO method.
objects. The accuracy and average overlap rate are used as
quantitative statistical indicators. The overlap ratio refers to
the ratio of the overlap between the detected region and the Conclusion
real region. The higher the value, the more accurate the The original YOLO image recognition does not perform
region of the detection result. well on the panoramic images, which cause some failures
The test results (Figure 10) show that M-YOLO obtains when implementing the visual recognition39 on the robot.
a certain degree of missed detection when the object is The present study proposes a real-time object detection
10 International Journal of Advanced Robotic Systems

method based on the improved YOLO algorithm that is 9. Lecun Y, Bengio Y, and Hinton G. Deep learning. Nature
named as the M-YOLO method. The experimental results 2015; 521(7553): 436.
demonstrate that the M-YOLO method ran with the GPU 10. Blaschko MB and Lampert CH. Learning to localize objects
can detect the panoramic shooting rate up to 32 FPS, which with structured output regression. In: Computer vision ECCV
exceeds the real-time requirement and also shows a good 2008, Berlin: Springer, 2008, pp. 2–15.
generalization ability for processing the regular and 11. Girshick R. Fast R-CNN. In: IEEE international conference
panoramic images. Moreover, it maintains the object rec- on computer vision, Santiago, 7–13 December 2015.
ognition accuracy rate of over 70%. However, there are still 12. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-
some problems. For example, the current M-YOLO model time object detection with region proposal networks. IEEE
only detects a small number of objects and does not obtain Tran Pat Anal Mach Intellig 2017; 39(6): 1137–1149.
a reasonable result in object-intensive scenarios. In the 13. Redmon J, Divvala S, Girshick R, et al. You only look once:
future studies, increasing and training the number of detect- unified, real-time object detection. In: IEEE conference on
ing objects should be considered. Moreover, the general computer vision and pattern recognition, 2016, pp. 779–788.
ability of the M-YOLO model should be developed to make 14. Zhou Y and Tuzel O. VoxelNet: end-to-end learning for point
it perform well in object-intensive scenarios. cloud based 3D object detection. In: IEEE conference on
computer vision and pattern recognition, 2018, pp.
Declaration of conflicting interests 4490–4499.
The author(s) declared no potential conflicts of interest with 15. Redmon J and Farhadi A. YOLOv3: an incremental improve-
respect to the research, authorship, and/or publication of this ment. arXiv 2018.
article. 16. Bourdev L and Malik J. Poselets: body part detectors trained
using 3D human pose annotations. In: International confer-
Funding ence on computer vision (ICCV). Kyoto, 29 September–2
The author(s) received no financial support for the research, October 2009.
authorship, and/or publication of this article. 17. He B, Liu YJ, Zeng LB, et al. Product carbon footprint across
sustainable supply chain. J Clean Prod 2019; 241: 118320.
ORCID iD 18. Dalal N and Triggs B. Histograms of oriented gradients for
Jun W Wu https://orcid.org/0000-0002-8139-6304 human detection. In: IEEE computer society conference com-
puter vision and pattern recognition (CVPR), San Diego,
References 20–25 June 2005; 1: 886–893.
19. He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep
1. He B, Wang S, and Liu YJ. Underactuated robotics: a review.
convolutional networks for visual recognition. In: European
Int J Adv Robot Syst 2019; 16(4): 1–29.
conference on computer vision (ECCV), Berlin: Springer,
2. Ferrari V, Fevrier L, Jurie F, et al. Groups of adjacent contour
2014.
segments for object detection. In: IEEE transactions on pat-
20. Simonyan K and Zisserman A. Very deep convolutional net-
tern analysis & machine intelligence, 2007.
works for large-scale image recognition. In: International
3. Shotton J. Textonboost: joint appearance, shape and context
conference on learning representations (ICLR), 2015.
modeling for multi-class object recognition and segmenta-
21. Girshick R, Donahue J, Darrell T, et al. Rich feature hierar-
tion. In: Proceeding of the 9th European conference on com-
puter vision, Berlin: Springer, 2006. chies for accurate object detection and semantic segmenta-
4. Redmon J and Farhadi A. YOLO9000: better, faster, stronger. tion. In: IEEE conference on computer vision and pattern
In: IEEE conference on computer vision and pattern recog- recognition (CVPR), 2014, pp. 580–587.
nition (CVPR), 2017: pp. 6517–6525. 22. Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multi-
5. Miller A, Knopp S, Christensen HI, et al. Automatic grasp box detector. Germany: Springer Verlag, 2016, pp. 21–37.
planning using shape primitives. In: Proceeding of IEEE 23. Gould S, Gao T, and Koller D. Region-based segmentation
international conference on robotics & automation. Taipei, and object detection. Adv Neu Inf Pro Syst 2009; 4:
14–19 September 2003. 655–663.
6. Goldfeder C, Allen PK, Lackner C, et al. Grasp planning via 24. Dean T, Ruzon M, Segal M, et al. Fast, accurate detection of
decomposition trees. In: IEEE international conference on 100,000 object classes on a single machine. In: 2013 IEEE
robotics & automation, Roma, 10–14 April 2007. conference computer vision and pattern recognition (CVPR),
7. He K, Zhang X, Ren S, et al. Deep residual learning for image Portland, 23–28 June 2013, pp. 1814–1821.
recognition. In: IEEE conference on computer vision and 25. Everingham M, Eslami SMA, Van Gool L, et al. The PAS-
pattern recognition, 2016: pp. 770–778. CAL visual object classes challenge: a retrospective. Int J
8. Balasubramanian R, Xu L, Brook P D, et al. Physical human Comp Vis 2015; 111(1): 98–136.
interactive guidance: identifying grasping principles from 26. Han J, Liao Y, Zhang J, et al. Target fusion detection of
human-planned grasps. IEEE Trans Robot 2012; 28(4): LiDAR and camera based on the improved yolo algorithm.
899–910. Mathematics 2018; 6(10): 213.
Wu et al. 11

27. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real- conference oncomputer vision, 1998, Bombay, 7–7 January
time object detection with region proposal networks. In: 1998, pp. 555–562.
IEEE Conference, 2015. 34. Shinde S, Kothari A, and Gupta V. YOLO based human
28. Zhang X, Yang W, Tang X, et al. A fast learning method for action recognition and localization. Pro Comput Sci 2018;
accurate and robust lane detection using two-stage feature 133: 831–838.
extraction with YOLO v3. Sensors 2018; 18(12): 4308. 35. Ren S, He K, Girshick RB, et al. Object detection networks on
29. Girshick R. Fast R-CNN. In: IEEE international conference convolutional feature maps. Tran Patt Analy Mach Intellig
computer vision, 2015, pp. 1440–1448. IEEE 2017; 39(7): 1476–1481.
30. Al-masni MA, Al-antari MA, Park JM, et al. Simultaneous 36. Russakovsky O, Deng J, Su H, et al. Imagenet large scale
detection and classification of breast masses in digital mam- visual recognition challenge. Int J Compu Vis 2015; 115:
mograms via a deep learning YOLO-based CAD system. 211–252.
Comp Meth Prog Bio 2018; 157: 85–94. 37. Shen Z, Liu Z, Li J, et al. DSOD: learning deeply supervised
31. Redmon J and Angelova A. Real-time grasp detection using object detectors from scratchf. In: IEEE international confer-
convolutional neural networks. In: IEEE international con- ence on computer vision, 2017, pp. 1919–1927.
ference on robotics and automation, Seattle, 26–30 May 38. Felzenszwalb PF, Girshick RB, McAllester D, et al. Object
2015, pp. 26–30. detection with discriminatively trained part based models.
32. Sihua H, Xiaofang S, Shaoqing Y, et al. Analysis for the IEEE Tran Pat Anal Mach Intellig 2010; 32(9): 1627–1645.
cylinder image quality of hyperbolic-catadioptric panorama 39. Donahue J, Jia Y, Vinyals O, et al. Decaf: a deep convolu-
image system. Las Inf 2012; 42 (2): 187–191. tional activation feature for generic visual recognition. In:
33. Papageorgiou CP, Oren M, and Poggio T. A general frame- IEEE conference on computer vision and pattern recognition
work for object detection. In: IEEE Sixth international 2013.

You might also like