0% found this document useful (0 votes)
53 views

Real Time Object Detection

Uploaded by

mohaned.jedidi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Real Time Object Detection

Uploaded by

mohaned.jedidi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

JMST Advances (2023) 5:69–76 Online ISSN 2524-7913

https://doi.org/10.1007/s42791-023-00049-7 Print ISSN 2524-7905

REVIEW

Real‑time object detection and segmentation technology: an analysis


of the YOLO algorithm
Chang Ho Kang1 · Sun Young Kim2

Received: 19 July 2023 / Revised: 6 August 2023 / Accepted: 7 August 2023 / Published online: 8 September 2023
© The Korean Society of Mechanical Engineers 2023

Abstract
In this paper, the YOLO (You Only Look Once) algorithm, which is a representative algorithm for real-time object detection
and segmentation technology, is analyzed according to in order of development. As its name suggests, the YOLO algorithm
can detect objects with a single forward pass, making possible fast and accurate object detection and segmentation. This
paper explores the characteristics and history of the YOLO algorithm. The performance of the YOLO algorithm is evaluated
using the COCO (Common Objects in Context) data set. By far the most difficult aspect of deep learning is preparing the
training data, and the data applicable to each field of application is severely limited. Despite these limitations, the YOLO
model still has a substantially faster processing speed than other conventional models and continues to be in widespread
use. Each version of the YOLO algorithm has adopted various ideas and techniques for further performance improvements,
presenting researchers with new directions for resolving problems in object detection. These advances will continue, and the
YOLO algorithm will provide important insights into the ways we understand and recognize images and video.
Graphical abstract

Keywords YOLO · Deep learning · Real-time · Detection · Segmentation

1 Introduction

Deep learning-based artificial intelligence (AI) is play-


ing a valuable role in our everyday lives. In particular, the
detection and classification of objects through analysis of
Extended author information available on the last page of the article

13
Vol.:(0123456789)
70 JMST Advances (2023) 5:69–76

images and video is becoming ever more precise in the field network, and the final detection output is generated using
of computer vision, and central to such R&D are two core a bounding box (Bbox) or prediction techniques. For Bbox
elements, namely, object detection and semantic segmenta- calculation, YOLO requires the post-processing steps
tion. Numerous algorithms contribute to the advancement of IoU (Intersect over Union) and NMS (non-maximum
of deep learning-based object detection and semantic seg- suppression).
mentation technologies. Among these, the real-time object
detection and semantic segmentation technology YOLO
(You Only Look Once) has gained particular attention. As 2.2 Characteristics of the YOLO algorithm
its name suggests, the YOLO algorithm can detect objects by version
with a single forward pass, making possible fast and accurate
object detection and segmentation. This article explores the 8 versions of the YOLO algorithm have been released
characteristics and history of the YOLO algorithm. (YOLOv1 [2], YOLOv2 [3], YOLOv3 [4], YOLOv4 [5],
YOLOv5 [6], YOLOv6 [7], YOLOv7 [8], YOLOv8 [9]).

2 Characteristics and history of the YOLO


algorithm 2.2.1 YOLOv1

2.1 Network structure and method of operation As aforementioned, YOLOv1 was released in 2015 by
of the YOLO algorithm Joseph Redmon and Andrew Farhadi and is a deep learning-
based network for real-time object detection. As shown in
YOLO is an object detection algorithm released by Joseph Fig. 1, YOLOv1 divides the image input into S x S grid
Redmon and Andrew Farhadi in 2015. Unlike conventional cells, predicting B number of Bboxes, a confidence score,
object detection algorithms, such as R-CNN [1], YOLO and class probability for each cell. The final output size is
boasts fast speeds and can detect objects in the entire field S x S x (B *5 + C). Here, S is the number of cells, B is the
with a single pass. The YOLO algorithm is based on a number of Bboxes, and C is the number of classes to be
convolutional neural network (CNN). As shown in Fig. 1, classified.
an image divided into a grid is passed through the neural The overlap problem arises in processing images in
YOLOv1, as shown in Fig. 2. The NMS method described
above was used to address this issue. A number B or Bboxes
was generated for each grid cell, and adjacent cells would
generate Bboxes that predict the same object. This issue
was named overlap. To resolve this issue, a new concept of
calculating confidence scores and IOUs to select the Bbox
with the highest confidence score, then deleting the remain-
ing Bboxes having a larger IOU than the selected Bbox was
adopted. This method is called NMS.
The network structure of YOLOv1 is shown in Fig. 3. The
structure is comprised of 24 convolution layers and 2 Fully
Connected (FC) layers. The structure called a DarkNet net-
work, uses a pre-trained network with an ImageNet data set.
Parameters are reduced by overlapping a 1 × 1 convolution
Fig. 1  YOLO image processing process [2] layer with a 3 × 3 convolution layer.

Fig. 2  Overlap issue [2]

13
JMST Advances (2023) 5:69–76 71

Fig. 3  Structure of DarkNet [2]

2.2.2 YOLOv2 improvements over YOLOv2 but with a slightly slower


detection speed than the YOLOv2 model. The backbone net-
YOLOv2 was released in 2016 by Joseph Redmon and Ali work was changed from DarkNet19 to DarkNet 53, applying
Farhadi. YOLOv2 has improved performance over YOLOv1 the skip connection concept proposed by ResNet [10] to the
and proposes DarkNet19, which is an improvement over previous DarkNet19. Furthermore, techniques similar to the
DarkNet. DarkNet19 deletes the final FC layer of the con- multi-scale feature layer of the SSD model and the feature
ventional network, substituting it with a 1 × 1 convolu- pyramid network (FPN) of RetinaNet [11] were applied.
tion layer. Furthermore, global average pooling is used to Here, the multi-scale feature layer of SSD is a technique,
reduce parameters and improve speed. Furthermore, whereas wherein feature detection is performed at each point of fea-
YOLOv1 predicts 2 Bbox coordinates per grid, YOLOv2 ture maps of different sizes.
finds 5 anchor boxes per grid. Here, an anchor box is a Bbox The most distinguishing feature of YOLOv3 is the ability
pre-defined with varying sizes and ratios. The user may des- to classify multi-class label objects for a given object. Soft-
ignate the number of anchor boxes arbitrarily. The residual max of the conventional network output layer was unable to
learning structure which gained much attention in 2016 was appropriately detect multiple objects existing in a single box.
used in YOLOv2 as well, as shown in Fig. 4. Here, an inter- To enable this in YOLOv3, instead of using Softmax on the
mediate feature map and final feature map are combined and final loss function, the sigmoid is obtained for all classes,
used, adding the high-resolution feature map of the former carrying out binary classification for each class.
convolution layer to the low-resolution feature map of the
latter convolution layer. High-resolution feature maps imply 2.2.4 YOLOv4
data on small objects and improve small object detection
performance. YOLOv4 was released in 2020 by Alexey Bochkovskiy,
Chien-Yao Wang, and Hong-Yuan Mark Liao. The object of
2.2.3 YOLOv3 YOLOv4 was to address problems in detecting small-sized
objects in the existing YOLO series. A large input resolu-
YOLOv3 was released in 2018 by Joseph Redmon and tion was used to allow for good detection of various small
Brandon Gordon. YOLOv3 featured further performance objects. Whereas 224 and 256 resolutions were previously

Fig. 4  YOLOv2 network structure [3]

13
72 JMST Advances (2023) 5:69–76

used for training, YOLOv4 used a 512 resolution. Further- and xlarge, respectively. As shown in Fig. 6, YOLOv5s is
more, the number of layers was increased to physically the fastest but has relatively low accuracy, while YOLOv5x
enlarge the receptive field, and the number of parameters is the slowest but with improved accuracy. This subdivided
was also increased, as high expressive power is necessary model line-up is found in the latest version, YOLOv8, as
for the simultaneous detection of objects of various types well.
and size from an image.
The network of YOLOv4 was modified for higher accu-
racy and faster processing than YOLOv3. As shown in 2.2.6 YOLOv6
Fig. 5, CSP (Cross Stage Partial connections) DarkNet53,
SPP (Cross Stage Partial connections), FPN and PAN(Path YOLOv6 was released by Ultralytics in 2022. YOLOv6
Aggregation Network) were applied to YOLOv3. Further- employs a few quantization and distillation methods for sys-
more, YOLOv4 employs quantization technology and data tem deployment, improving performance. As for network
augmentation technology using various data sets to enhance structure, the EfficientRep backbone was used, as shown
performance while reducing model size. in Figs. 7 and 8. Rep-PAN was used for the neck and an
efficient decoupled head was used. Most important to the
2.2.5 YOLOv5 network structure is the CSPstackRep block. CSPstackRep
block combines the CSP and RepVGG methods [13]. The
YOLOv5 was released by Ultralytics in 2021. Like small YOLOv6 model uses a typical single-path backbone,
YOLOv4, YOLOv5 uses CSPNet [12]. By uniformly dis- and the large model comprises an efficient multiple-branch
tributing computations for each layer through BottlenetCSP, block.
computation bottlenecks were removed, and utilization of YOLOv6 is designed for dynamic adjustment of labels
CNN layer computations was upgraded. YOLOv5 is distin- and teacher knowledge to allow for efficient learning of
guished from previous YOLO models in that the backbone knowledge at all learning steps when carrying out self-dis-
is based on depth multiples and width multiples, classified tillation. With RepOptimizer and an improved quantization
as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x by system for detecting objects through per-channel distillation,
size. Here, s, m, l, and x stand for small, medium, large, YOLOv6 has improved performance over previous versions.

Fig. 5  YOLOv4 network struc-


ture [5]

Fig.6  Performance comparison
of YOLOv5 sub-models [6]

13
JMST Advances (2023) 5:69–76 73

Fig. 7  YOLOv6 network’s main structure [7]

Fig. 8  YOLOv6 network structure including RepVGG [14]

2.2.7 YOLOv7 efficiently in calculation using extend and compound scal-


ing methods.
YOLOv7 was released by Ultralytics in 2022 and was engi- YOLOv7 proposes Extended-ELAN (E-ELAN) to ena-
neered using a trainable bag-of-freebies method which ena- ble effective learning even when using deep layers. The
bles real-time object detection while improving accuracy network structure of YOLOv7 is shown in Fig. 9 [15]. In
without increasing inference cost. YOLOv7 is characterized the network structure aspect, E-ELAN changes the archi-
in that, when learning and inferring multiple layers, instead tecture of the computational block but does not change the
of converging the layers into a single layer and using the architecture of the transition layer. The same group param-
ground truth as–is, new soft labels are generated consid- eters and channel multipliers are applied to all blocks of
ering the prediction of the model and the distribution of the computational layer, and the feature maps calculated
the ground truth. A dynamic label assignment strategy was in each computational block are combined as a group and
proposed for different outputs from different branches in connected by the configured parameters. Here, the channel
conventional label assignment methods, and these are used count of each feature map group is identical to that of the
original architecture. Finally, the feature map groups are
added to carry out merge cardinality.

13
74 JMST Advances (2023) 5:69–76

Fig. 9  YOLOv7 network structure [15]

2.2.8 YOLOv8 object classes. YOLOv1, the base version, can recognize


24 object classes and has a 21.6% mAP (mean Average
YOLOv8 was released by Ultralytics in 2023 and imple- Precision) measured using the COCO data set. YOLOv2
mented as an integrated framework for object detection, can recognize 90 object classes, with a COCO data set
instance subdivision, and image classification model train- mAP of 30.2%. YOLOv3 can recognize 1000 object
ing through a newly released repository. Various versions classes, with a mAP of 57.9%. YOLOv4 has improved
of YOLOv8 exist, including YOLOv8n, YOLOv8x, and performance over YOLOv3 but can recognize 80 object
YOLOv8m. The network structure of YOLOv8 is based on classes, with a mAP of 60.0%. YOLOv5 achieves sub-
CSPDarkNet53, as shown in Fig. 10 [16]. CSPDarkNet53 stantial performance improvements over YOLOv4 and
is a network structure that improves on the DarkNet53 achieved a mAP of 83.5% with the COCO data set.
structure and contributes to performance enhancements YOLOv6 achieved a COCO data set mAP of 84.4%.
in YOLOv8. YOLOv8 modifies the network structure of YOLOv7 achieved a mAP of 85.4%. YOLOv8 achieves
YOLOv5 by replacing the C3 module with a C2f module further performance improvements over YOLOv7 and
and replacing the first 6 × 6 convolution layer of the back- achieved a mAP of 86.4%.
bone with a 3 × 3 convolution layer. Furthermore, two con- Yet, the YOLO algorithm has shortcomings, which
volution layers are deleted, the first 1 × 1 convolution layer can be summarized below. First, spatial restrictions exist,
of ConvBottleneck is replaced by a 3 × 3 convolution layer, allowing only 8Bbox predictions per grid cell, making it
the objectness branch is deleted, and a separate head is used. difficult to distinguish objects which are close together.
YOLOv8 is an anchor-free model, where the object center Furthermore, multiple down samplings are used, and insuf-
is predicted directly instead of using an anchor box offset, ficient detail is often apparent. The third issue is inac-
improving NMS speed. curate localization, and finally, as Bbox training is car-
ried out from data, the algorithm has difficulty detecting
a test data set that does not exist in the training data. By
3 Performance comparison and limitations far the most difficult aspect of deep learning is preparing
of the YOLO algorithm the training data, and the data applicable to each field of
application is severely limited. Despite these limitations,
The performance of the YOLO algorithm is evaluated the YOLO model still has a substantially faster processing
using the COCO (Common Objects in Context) data set. speed than other conventional models and continues to be
The COCO data set is a large data set comprised of 80 in widespread use.

13
JMST Advances (2023) 5:69–76 75

Fig. 10  YOLOv8 network structure [16]

4 Conclusion Funding This study was supported by Ministry of Science and


ICT, South Korea, No. 2021R1C1C1009219, Sun Young Kim, No.
2021R1F1A1063298, Chang Ho Kang.
The YOLO algorithm has made large contributions to the
field of object detection. Enabling real-time object detection,
the algorithm has applications in numerous fields including
self-driving cars, security cameras, and robot technology. References
Each version of the YOLO algorithm has adopted various
ideas and techniques for further performance improvements, 1. K. He, G. Gkioxari, P. Dollár, R. Girshick, In Proceedings of the
presenting researchers with new directions for resolving IEEE international conference on computer vision. 2961–2969
problems in object detection. These advances will continue, (2017)
2. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, In Proceedings of
and the YOLO algorithm will provide important insights the IEEE conference on computer vision and pattern recognition.
into the ways we understand and recognize images and 779–788 (2016)
video. 3. J. Redmon, A. Farhadi, In Proceedings of the IEEE conference on
computer vision and pattern recognition. 7263–7271 (2017)
Acknowledgements This research was supported by the National 4. A. Farhadi, J. Redmon, Computer Vision and Pattern Recognition
Research Foundation of Korea(NRF) grant funded by the Ministry of (Springer, Berlin/Heidelberg, 2018), pp.1804–2767
Science and ICT, the Republic of Korea (No. 2021R1C1C1009219, 5. A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, arXiv preprint arXiv:​
No. 2021R1F1A1063298). 2004.​10934 (2020)

13
76 JMST Advances (2023) 5:69–76

6. G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, Y. Kwon, K. a Research Professor at the Research Institute of Engineering and Tech-
Michael, M. Jain, et al., Zenodo (2021) nology, Korea University. He is currently an Assistant Professor at the
7. C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Department of Mechanical System Engineering (Department of Aero-
Cheng, W. Nie, et al., arXiv preprint arXiv:​2209.​02976 (2022) nautics, Mechanical and Electronic Convergence Engineering), Kumoh
8. C.Y. Wang, A. Bochkovskiy, H.Y.M. Liao, In Proc. of the IEEE/ National Institute of Technology. His research interests include GNSS
CVF Conference on Computer Vision and Pattern Recognition. receivers, digital signal processing, nonlinear filtering, and deep
7464–7475 (2023) learning.
9. J. Terven, D. Cordova-Esparza, arXiv preprint arXiv:​2304.​00501
(2023) Sun Young Kim received a B.S.
10. Z. Wu, C. Shen, A. Van Den Hengel, Pattern Recogn. 90, 119–133 degree in Electronic Engineering
(2019) from Kookmin University in
11. T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, IEEE Transac- 1999 and M.S. and Ph.D.
tions on Pattern Analysis and Machine Intelligence. (2018) degrees in Mechanical and Aero-
12. C.Y. Wang, H.Y.M. Liao, Y.H. Wu, P.Y. Chen, J.W. Hsieh, I.H. space Engineering from Seoul
Yeh, In Proceedings of the IEEE/CVF conference on computer National University in 2015 and
vision and pattern recognition workshops. 390–391 (2020) 2019, respectively. From 2019 to
13. X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun, In Proc. of the 2020, she was a Postdoctoral
IEEE/CVF conference on computer vision and pattern recogni- Researcher at the School of
tion. 13733–13742 (2021) Intelligent Mechatronics Engi-
14. M. Contributors, YOLOv6 by MMYOLO. https://​github.​com/​ neering, Sejong University,
open-m​ mlab/m​ myolo/t​ ree/m
​ ain/c​ onfig​ s/y​ olov6. Accessed 13 May Seoul, Republic of Korea. From
2023 2020 to 2023, she was an Assis-
15. M. Contributors, YOLOv7 by MMYOLO. https://​github.​com/​ tant Professor at the School of
open-m​ mlab/m​ myolo/t​ ree/m
​ ain/c​ onfig​ s/y​ olov7. Accessed 13 May Mechanical Engineering, Kun-
2023 san National University, Jeollabuk-do, Republic of Korea. Since 2023,
16. M. Contributors, YOLOv8 by MMYOLO. https://​github.​com/​ she has been an Associate Professor at the School of Mechanical Engi-
open-m​ mlab/m​ myolo/t​ ree/m
​ ain/c​ onfig​ s/y​ olov8. Accessed 13 May neering, Kunsan National University, Jeollabuk-do, Republic of Korea.
2023 Her research interests include navigation systems, filtering, localiza-
tion, GNSS interference detection and mitigation, multi-sensor fusion,
Springer Nature or its licensor (e.g. a society or other partner) holds multi-target tracking, target detection and classification, SLAM, auton-
exclusive rights to this article under a publishing agreement with the omous vehicles, and deep learning.
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of
such publishing agreement and applicable law.

Chang Ho Kang received a B.S.


degree in mechanical and aero-
space engineering from Sejong
University, in 2009, and a Ph.D.
degree from the Department of
Mechanical and Aerospace Engi-
neering, Seoul National Univer-
sity, in 2016. From 2016 to 2018,
he was a Postdoctoral Researcher
with the BK21+ Transformative
Training Program for Creative
Mechanical and Aerospace Engi-
neers, at Seoul National Univer-
sity. From 2018 to 2019, he was

Authors and Affiliations

Chang Ho Kang1 · Sun Young Kim2

* Sun Young Kim Convergence Engineering), Kumoh National Institute


[email protected] of Technology, Gumi 39177, Republic of Korea
2
Chang Ho Kang School of Mechanical Engineering, Kunsan National
[email protected] University, Gunsan 54150, Republic of Korea
1
Department of Mechanical Systems Engineering
(Department of Aeronautics, Mechanical and Electronic

13

You might also like