Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe
Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe
Jun W Wu1 , Wei Cai1, Shi M Yu2, Zhuo L Xu1 and Xue Y He1
Abstract
Vision-based detection methods often require consideration of the robot’s sight. For example, panoramic images cause
image distortion, which negatively affects the target recognition and spatial localization. Furthermore, the original you
only look once method does not have a reasonable performance for the image recognition in the panoramic images.
Consequently, some failures have been reported so far when implementing the visual recognition on the robot. In the
present study, it is intended to optimize the conventional you only look once algorithm and propose the modified
you only look once algorithm. Comparing the obtained results with the experiment shows that the modified you only
look once method can be effectively applied in the graphics processing unit to reach the panoramic recognition
speedup to 32 frames rate per second, which meets the real-time requirements in diverse applications. It is found that
the accuracy of the object detection when applying the proposed modified you only look once method exceeds 70% in
the studied cases.
Keywords
Robot, visual recognition, YOLO algorithm, Panoramic shooting, deep learning
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without
further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/
open-access-at-sage).
2 International Journal of Advanced Robotic Systems
Figure 2. Flowchart of the YOLO algorithm.12 YOLO: you only look once.
1. Convolution layers for feature extraction model. It indicates that the YOLO model consists of three
As an object detection method for the CNN, Faster steps as the following:
R-CNN initially uses a set of basic convolutions,
rectified linear unit activation functions, and pool- 1. The original image of the camera is captured and
ing layers to extract feature maps of the input then it is divided into a grid with S S resolution.
image. These convolutions are used in the subse- 2. Each grid predicts B bounding boxes with confi-
quent RPN layers19 and full connection layers. dence scores. Each record has five parameters,
2. Region proposal network including x, y, w, h, and p* intersection over union
RPN is mainly applied to generate regional propos- (IOU), where p and the IOU denote probability that
als. First, it generates a pile of anchor boxes. After a current position is an object and the function that
performing the clipping and filtering processes, predicts probability of the overlap area. Moreover, x
Softmax software is employed to determine which and y are the center coordinates, while w and h are
anchors belong to the foreground or background. the width and height of the bounding box,
Meanwhile, the bounding box regression modifies respectively.
the anchor box to form more accurate proposals. 3. Then class probability of each grid prediction is
More specifically, it is relative to the next box calculated as Ci ¼ p (classi|object).
regression of the full connection layer behind it. The YOLO algorithm uses a CNN to implement the multi-
3. Region of interest pooling object recognition model. In the present study, the pattern
This layer uses the feature map generated by RPN analysis, statistical modeling, and computational learning
proposals and the last layer of the visual geometry visual object classes (PASCAL VOC)25 data set are utilized
group (VGG16)20 to obtain a feature map with the to evaluate the model. The initial convolutional layer of the
fixed-size proposal, which can be used to identify network extracts features from the image, while the fully
and locate objects by the fully connected operation. connected layer predicts the output class probability and
4. Classifier corresponding image coordinates. Figure 3 indicates that to
The region of interest pooling layer is formed into a perform the image classification, the network structure ref-
fixed-size feature map for full connection operation. erence employs the GoogleNet model with 24 convolution
Softmax software is applied to classify specific cate- layers and 2 fully connected layers.4
gories.21,22 Meanwhile, the L1 loss is used for com- During the training session, the sum-squared error is
pleting the bounding box regression operation to weighted equally in large and small bounding boxes.
obtain the precise position of the object.23 Furthermore, the error metric indicates that small devia-
tions of large bounding boxes are less relevant than that of
small bounding boxes. To solve this problem, the regres-
sion is performed based on the square root of the width
YOLO and height of bonding boxes, instead of the width and
Apart from the Faster R-CNN scheme, the YOLO model is height directly. The YOLO algorithm predicts multiple
another recognition algorithm for the multitarget deep bounding boxes in each grid. During the network training,
learning.24 Considering the superior characteristics of the only one bounding box is required to be responsible for
YOLO scheme, it is applied in the present study for robotic each project. Subsequently, the following loss functions
grasping. Figure 2 presents the flowchart of the YOLO are optimized13:
4 International Journal of Advanced Robotic Systems
Figure 3. Convolutional network used by the YOLO algorithm.12 YOLO: you only look once.
where I iobj and I ijobj are the object appearance in the ith cell on the Fast R-CNN algorithm. Fast R-CNN scheme uses a
and jth bonding box predictor in the ith cell, respectively. network to implement all parts except region proposal
Moreover, lcoord denotes the coordinate error weight. In extraction. Unlike the conventional R-CNN, the classifica-
the present study, classification error weights are set to tion and loss of coordinate regression in the Fast R-CNN
lcoord ¼ 5 and lnoobj ¼ 0.5. scheme initially update network parameters by the back
When the object is in the cell, the loss function will propagation. Second, not all region proposals but the whole
punish the classification error. Moreover, the loss function one are put into the extraction when extracting the feature.
penalizes the coordinate error of the bounding box for an Then the feature is extracted through the coordinate map-
active predictor. After the training session, the regression ping. It should be indicated that this type of extraction has
equation can be applied to predict coordinates of the object two advantages, including the fast performance and wide
category and the object in the image coordinate system in operational range. Since a picture walks through the net-
real time. Then the three-dimensional position of the object work only once, the feature is affected by the receptive
in the camera coordinate system26 can be obtained by cal- field so that it can fuse features of the adjacent background
culating the center depth. to “see” farther. Finally, studies show that it is almost
impossible to operate a real-time detection by applying the
selective search method.27 Therefore, it is intended to
replace the selective search method with the RPN and share
Differences between the YOLO and Faster
the feature extraction layer28 with the Fast R-CNN classi-
R-CNN algorithms fication and regression network to reduce the computa-
In this section, it is intended to reveal differences between tional expense. Experimental results also show that
the YOLO and the Faster R-CNN schemes. First, the applying the Fast R-CNN improves the speed and accuracy
R-CNN is a feature extractor. In fact, a selective search is of the prediction. It is found that the RPN is an essence of
normally applied to extract a certain number (e.g. 2000) of Faster R-CNN. Moreover, it is the main reason for the high
region proposals and then it convolutes the region propos- accuracy and low speed of the R-CNN algorithm compared
als and extracts features of the fc7 layer for the classifica- to that of the YOLO scheme.
tion and regression of the coordinates. In the present study, On the other hand, one of YOLO’s contributions is to
the support vector machine classification method is utilized convert detection problems into regression problems. In
instead of the conventional Softmax model. The main con- fact, the former Fast R-CNN is divided into two steps,
tribution of this algorithm is to propose an effective feature including extracting the region proposal and classifying.
utilization method. The majority of researchers in this area The former step judges whether the anchor is the fore-
use features of the fc7 layer in engineering practice based ground or the background, while the main purpose of the
Wu et al. 5
Figure 4. Object recognition flowchart of the M-YOLO method. M-YOLO: modified you only look once.
Figure 7. Improved grid of the M-YOLO. M-YOLO: modified you only look once.
Figure 10. Test results in four cases. (a) Occlusion, (b) no occlusion, (c) normal illumination, and (d) insufficient illumination.
Table 4. Robustness quantitative test. occluded. The prediction accuracy and average overlap rate
are 6% and 7% lower than the normal environment, respec-
Case tively. However, the average overlap rate is higher than
Rate Normal Overlap Weak light 65%. When the lack of illumination leads to a certain
degree of convergence between the object and the back-
Accurate rate 0.74 0.65 0.69 ground, the detection accuracy decreases by 5%, and the
Average Overlap rate 0.78 0.71 0.75
average overlap rate decreases by 3%. It is concluded that
in the occlusion and underlight environment, the M-YOLO
method has lower performance. However, it still maintains
Robustness a higher accuracy and average overlap rate when the com-
The bottle is selected as the object to be detected, and other parison is made with the Faster R-CNN.Table 4 presents
objects such as the chair, box, and table are interference the robustness of the M-YOLO method.
objects. The accuracy and average overlap rate are used as
quantitative statistical indicators. The overlap ratio refers to
the ratio of the overlap between the detected region and the Conclusion
real region. The higher the value, the more accurate the The original YOLO image recognition does not perform
region of the detection result. well on the panoramic images, which cause some failures
The test results (Figure 10) show that M-YOLO obtains when implementing the visual recognition39 on the robot.
a certain degree of missed detection when the object is The present study proposes a real-time object detection
10 International Journal of Advanced Robotic Systems
method based on the improved YOLO algorithm that is 9. Lecun Y, Bengio Y, and Hinton G. Deep learning. Nature
named as the M-YOLO method. The experimental results 2015; 521(7553): 436.
demonstrate that the M-YOLO method ran with the GPU 10. Blaschko MB and Lampert CH. Learning to localize objects
can detect the panoramic shooting rate up to 32 FPS, which with structured output regression. In: Computer vision ECCV
exceeds the real-time requirement and also shows a good 2008, Berlin: Springer, 2008, pp. 2–15.
generalization ability for processing the regular and 11. Girshick R. Fast R-CNN. In: IEEE international conference
panoramic images. Moreover, it maintains the object rec- on computer vision, Santiago, 7–13 December 2015.
ognition accuracy rate of over 70%. However, there are still 12. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-
some problems. For example, the current M-YOLO model time object detection with region proposal networks. IEEE
only detects a small number of objects and does not obtain Tran Pat Anal Mach Intellig 2017; 39(6): 1137–1149.
a reasonable result in object-intensive scenarios. In the 13. Redmon J, Divvala S, Girshick R, et al. You only look once:
future studies, increasing and training the number of detect- unified, real-time object detection. In: IEEE conference on
ing objects should be considered. Moreover, the general computer vision and pattern recognition, 2016, pp. 779–788.
ability of the M-YOLO model should be developed to make 14. Zhou Y and Tuzel O. VoxelNet: end-to-end learning for point
it perform well in object-intensive scenarios. cloud based 3D object detection. In: IEEE conference on
computer vision and pattern recognition, 2018, pp.
Declaration of conflicting interests 4490–4499.
The author(s) declared no potential conflicts of interest with 15. Redmon J and Farhadi A. YOLOv3: an incremental improve-
respect to the research, authorship, and/or publication of this ment. arXiv 2018.
article. 16. Bourdev L and Malik J. Poselets: body part detectors trained
using 3D human pose annotations. In: International confer-
Funding ence on computer vision (ICCV). Kyoto, 29 September–2
The author(s) received no financial support for the research, October 2009.
authorship, and/or publication of this article. 17. He B, Liu YJ, Zeng LB, et al. Product carbon footprint across
sustainable supply chain. J Clean Prod 2019; 241: 118320.
ORCID iD 18. Dalal N and Triggs B. Histograms of oriented gradients for
Jun W Wu https://orcid.org/0000-0002-8139-6304 human detection. In: IEEE computer society conference com-
puter vision and pattern recognition (CVPR), San Diego,
References 20–25 June 2005; 1: 886–893.
19. He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep
1. He B, Wang S, and Liu YJ. Underactuated robotics: a review.
convolutional networks for visual recognition. In: European
Int J Adv Robot Syst 2019; 16(4): 1–29.
conference on computer vision (ECCV), Berlin: Springer,
2. Ferrari V, Fevrier L, Jurie F, et al. Groups of adjacent contour
2014.
segments for object detection. In: IEEE transactions on pat-
20. Simonyan K and Zisserman A. Very deep convolutional net-
tern analysis & machine intelligence, 2007.
works for large-scale image recognition. In: International
3. Shotton J. Textonboost: joint appearance, shape and context
conference on learning representations (ICLR), 2015.
modeling for multi-class object recognition and segmenta-
21. Girshick R, Donahue J, Darrell T, et al. Rich feature hierar-
tion. In: Proceeding of the 9th European conference on com-
puter vision, Berlin: Springer, 2006. chies for accurate object detection and semantic segmenta-
4. Redmon J and Farhadi A. YOLO9000: better, faster, stronger. tion. In: IEEE conference on computer vision and pattern
In: IEEE conference on computer vision and pattern recog- recognition (CVPR), 2014, pp. 580–587.
nition (CVPR), 2017: pp. 6517–6525. 22. Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multi-
5. Miller A, Knopp S, Christensen HI, et al. Automatic grasp box detector. Germany: Springer Verlag, 2016, pp. 21–37.
planning using shape primitives. In: Proceeding of IEEE 23. Gould S, Gao T, and Koller D. Region-based segmentation
international conference on robotics & automation. Taipei, and object detection. Adv Neu Inf Pro Syst 2009; 4:
14–19 September 2003. 655–663.
6. Goldfeder C, Allen PK, Lackner C, et al. Grasp planning via 24. Dean T, Ruzon M, Segal M, et al. Fast, accurate detection of
decomposition trees. In: IEEE international conference on 100,000 object classes on a single machine. In: 2013 IEEE
robotics & automation, Roma, 10–14 April 2007. conference computer vision and pattern recognition (CVPR),
7. He K, Zhang X, Ren S, et al. Deep residual learning for image Portland, 23–28 June 2013, pp. 1814–1821.
recognition. In: IEEE conference on computer vision and 25. Everingham M, Eslami SMA, Van Gool L, et al. The PAS-
pattern recognition, 2016: pp. 770–778. CAL visual object classes challenge: a retrospective. Int J
8. Balasubramanian R, Xu L, Brook P D, et al. Physical human Comp Vis 2015; 111(1): 98–136.
interactive guidance: identifying grasping principles from 26. Han J, Liao Y, Zhang J, et al. Target fusion detection of
human-planned grasps. IEEE Trans Robot 2012; 28(4): LiDAR and camera based on the improved yolo algorithm.
899–910. Mathematics 2018; 6(10): 213.
Wu et al. 11
27. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real- conference oncomputer vision, 1998, Bombay, 7–7 January
time object detection with region proposal networks. In: 1998, pp. 555–562.
IEEE Conference, 2015. 34. Shinde S, Kothari A, and Gupta V. YOLO based human
28. Zhang X, Yang W, Tang X, et al. A fast learning method for action recognition and localization. Pro Comput Sci 2018;
accurate and robust lane detection using two-stage feature 133: 831–838.
extraction with YOLO v3. Sensors 2018; 18(12): 4308. 35. Ren S, He K, Girshick RB, et al. Object detection networks on
29. Girshick R. Fast R-CNN. In: IEEE international conference convolutional feature maps. Tran Patt Analy Mach Intellig
computer vision, 2015, pp. 1440–1448. IEEE 2017; 39(7): 1476–1481.
30. Al-masni MA, Al-antari MA, Park JM, et al. Simultaneous 36. Russakovsky O, Deng J, Su H, et al. Imagenet large scale
detection and classification of breast masses in digital mam- visual recognition challenge. Int J Compu Vis 2015; 115:
mograms via a deep learning YOLO-based CAD system. 211–252.
Comp Meth Prog Bio 2018; 157: 85–94. 37. Shen Z, Liu Z, Li J, et al. DSOD: learning deeply supervised
31. Redmon J and Angelova A. Real-time grasp detection using object detectors from scratchf. In: IEEE international confer-
convolutional neural networks. In: IEEE international con- ence on computer vision, 2017, pp. 1919–1927.
ference on robotics and automation, Seattle, 26–30 May 38. Felzenszwalb PF, Girshick RB, McAllester D, et al. Object
2015, pp. 26–30. detection with discriminatively trained part based models.
32. Sihua H, Xiaofang S, Shaoqing Y, et al. Analysis for the IEEE Tran Pat Anal Mach Intellig 2010; 32(9): 1627–1645.
cylinder image quality of hyperbolic-catadioptric panorama 39. Donahue J, Jia Y, Vinyals O, et al. Decaf: a deep convolu-
image system. Las Inf 2012; 42 (2): 187–191. tional activation feature for generic visual recognition. In:
33. Papageorgiou CP, Oren M, and Poggio T. A general frame- IEEE conference on computer vision and pattern recognition
work for object detection. In: IEEE Sixth international 2013.