Real Time Object Detection
Real Time Object Detection
REVIEW
Received: 19 July 2023 / Revised: 6 August 2023 / Accepted: 7 August 2023 / Published online: 8 September 2023
© The Korean Society of Mechanical Engineers 2023
Abstract
In this paper, the YOLO (You Only Look Once) algorithm, which is a representative algorithm for real-time object detection
and segmentation technology, is analyzed according to in order of development. As its name suggests, the YOLO algorithm
can detect objects with a single forward pass, making possible fast and accurate object detection and segmentation. This
paper explores the characteristics and history of the YOLO algorithm. The performance of the YOLO algorithm is evaluated
using the COCO (Common Objects in Context) data set. By far the most difficult aspect of deep learning is preparing the
training data, and the data applicable to each field of application is severely limited. Despite these limitations, the YOLO
model still has a substantially faster processing speed than other conventional models and continues to be in widespread
use. Each version of the YOLO algorithm has adopted various ideas and techniques for further performance improvements,
presenting researchers with new directions for resolving problems in object detection. These advances will continue, and the
YOLO algorithm will provide important insights into the ways we understand and recognize images and video.
Graphical abstract
1 Introduction
13
Vol.:(0123456789)
70 JMST Advances (2023) 5:69–76
images and video is becoming ever more precise in the field network, and the final detection output is generated using
of computer vision, and central to such R&D are two core a bounding box (Bbox) or prediction techniques. For Bbox
elements, namely, object detection and semantic segmenta- calculation, YOLO requires the post-processing steps
tion. Numerous algorithms contribute to the advancement of IoU (Intersect over Union) and NMS (non-maximum
of deep learning-based object detection and semantic seg- suppression).
mentation technologies. Among these, the real-time object
detection and semantic segmentation technology YOLO
(You Only Look Once) has gained particular attention. As 2.2 Characteristics of the YOLO algorithm
its name suggests, the YOLO algorithm can detect objects by version
with a single forward pass, making possible fast and accurate
object detection and segmentation. This article explores the 8 versions of the YOLO algorithm have been released
characteristics and history of the YOLO algorithm. (YOLOv1 [2], YOLOv2 [3], YOLOv3 [4], YOLOv4 [5],
YOLOv5 [6], YOLOv6 [7], YOLOv7 [8], YOLOv8 [9]).
2.1 Network structure and method of operation As aforementioned, YOLOv1 was released in 2015 by
of the YOLO algorithm Joseph Redmon and Andrew Farhadi and is a deep learning-
based network for real-time object detection. As shown in
YOLO is an object detection algorithm released by Joseph Fig. 1, YOLOv1 divides the image input into S x S grid
Redmon and Andrew Farhadi in 2015. Unlike conventional cells, predicting B number of Bboxes, a confidence score,
object detection algorithms, such as R-CNN [1], YOLO and class probability for each cell. The final output size is
boasts fast speeds and can detect objects in the entire field S x S x (B *5 + C). Here, S is the number of cells, B is the
with a single pass. The YOLO algorithm is based on a number of Bboxes, and C is the number of classes to be
convolutional neural network (CNN). As shown in Fig. 1, classified.
an image divided into a grid is passed through the neural The overlap problem arises in processing images in
YOLOv1, as shown in Fig. 2. The NMS method described
above was used to address this issue. A number B or Bboxes
was generated for each grid cell, and adjacent cells would
generate Bboxes that predict the same object. This issue
was named overlap. To resolve this issue, a new concept of
calculating confidence scores and IOUs to select the Bbox
with the highest confidence score, then deleting the remain-
ing Bboxes having a larger IOU than the selected Bbox was
adopted. This method is called NMS.
The network structure of YOLOv1 is shown in Fig. 3. The
structure is comprised of 24 convolution layers and 2 Fully
Connected (FC) layers. The structure called a DarkNet net-
work, uses a pre-trained network with an ImageNet data set.
Parameters are reduced by overlapping a 1 × 1 convolution
Fig. 1 YOLO image processing process [2] layer with a 3 × 3 convolution layer.
13
JMST Advances (2023) 5:69–76 71
13
72 JMST Advances (2023) 5:69–76
used for training, YOLOv4 used a 512 resolution. Further- and xlarge, respectively. As shown in Fig. 6, YOLOv5s is
more, the number of layers was increased to physically the fastest but has relatively low accuracy, while YOLOv5x
enlarge the receptive field, and the number of parameters is the slowest but with improved accuracy. This subdivided
was also increased, as high expressive power is necessary model line-up is found in the latest version, YOLOv8, as
for the simultaneous detection of objects of various types well.
and size from an image.
The network of YOLOv4 was modified for higher accu-
racy and faster processing than YOLOv3. As shown in 2.2.6 YOLOv6
Fig. 5, CSP (Cross Stage Partial connections) DarkNet53,
SPP (Cross Stage Partial connections), FPN and PAN(Path YOLOv6 was released by Ultralytics in 2022. YOLOv6
Aggregation Network) were applied to YOLOv3. Further- employs a few quantization and distillation methods for sys-
more, YOLOv4 employs quantization technology and data tem deployment, improving performance. As for network
augmentation technology using various data sets to enhance structure, the EfficientRep backbone was used, as shown
performance while reducing model size. in Figs. 7 and 8. Rep-PAN was used for the neck and an
efficient decoupled head was used. Most important to the
2.2.5 YOLOv5 network structure is the CSPstackRep block. CSPstackRep
block combines the CSP and RepVGG methods [13]. The
YOLOv5 was released by Ultralytics in 2021. Like small YOLOv6 model uses a typical single-path backbone,
YOLOv4, YOLOv5 uses CSPNet [12]. By uniformly dis- and the large model comprises an efficient multiple-branch
tributing computations for each layer through BottlenetCSP, block.
computation bottlenecks were removed, and utilization of YOLOv6 is designed for dynamic adjustment of labels
CNN layer computations was upgraded. YOLOv5 is distin- and teacher knowledge to allow for efficient learning of
guished from previous YOLO models in that the backbone knowledge at all learning steps when carrying out self-dis-
is based on depth multiples and width multiples, classified tillation. With RepOptimizer and an improved quantization
as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x by system for detecting objects through per-channel distillation,
size. Here, s, m, l, and x stand for small, medium, large, YOLOv6 has improved performance over previous versions.
Fig.6 Performance comparison
of YOLOv5 sub-models [6]
13
JMST Advances (2023) 5:69–76 73
13
74 JMST Advances (2023) 5:69–76
13
JMST Advances (2023) 5:69–76 75
13
76 JMST Advances (2023) 5:69–76
6. G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, Y. Kwon, K. a Research Professor at the Research Institute of Engineering and Tech-
Michael, M. Jain, et al., Zenodo (2021) nology, Korea University. He is currently an Assistant Professor at the
7. C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Department of Mechanical System Engineering (Department of Aero-
Cheng, W. Nie, et al., arXiv preprint arXiv:2209.02976 (2022) nautics, Mechanical and Electronic Convergence Engineering), Kumoh
8. C.Y. Wang, A. Bochkovskiy, H.Y.M. Liao, In Proc. of the IEEE/ National Institute of Technology. His research interests include GNSS
CVF Conference on Computer Vision and Pattern Recognition. receivers, digital signal processing, nonlinear filtering, and deep
7464–7475 (2023) learning.
9. J. Terven, D. Cordova-Esparza, arXiv preprint arXiv:2304.00501
(2023) Sun Young Kim received a B.S.
10. Z. Wu, C. Shen, A. Van Den Hengel, Pattern Recogn. 90, 119–133 degree in Electronic Engineering
(2019) from Kookmin University in
11. T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, IEEE Transac- 1999 and M.S. and Ph.D.
tions on Pattern Analysis and Machine Intelligence. (2018) degrees in Mechanical and Aero-
12. C.Y. Wang, H.Y.M. Liao, Y.H. Wu, P.Y. Chen, J.W. Hsieh, I.H. space Engineering from Seoul
Yeh, In Proceedings of the IEEE/CVF conference on computer National University in 2015 and
vision and pattern recognition workshops. 390–391 (2020) 2019, respectively. From 2019 to
13. X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun, In Proc. of the 2020, she was a Postdoctoral
IEEE/CVF conference on computer vision and pattern recogni- Researcher at the School of
tion. 13733–13742 (2021) Intelligent Mechatronics Engi-
14. M. Contributors, YOLOv6 by MMYOLO. https://github.com/ neering, Sejong University,
open-m mlab/m myolo/t ree/m
ain/c onfig s/y olov6. Accessed 13 May Seoul, Republic of Korea. From
2023 2020 to 2023, she was an Assis-
15. M. Contributors, YOLOv7 by MMYOLO. https://github.com/ tant Professor at the School of
open-m mlab/m myolo/t ree/m
ain/c onfig s/y olov7. Accessed 13 May Mechanical Engineering, Kun-
2023 san National University, Jeollabuk-do, Republic of Korea. Since 2023,
16. M. Contributors, YOLOv8 by MMYOLO. https://github.com/ she has been an Associate Professor at the School of Mechanical Engi-
open-m mlab/m myolo/t ree/m
ain/c onfig s/y olov8. Accessed 13 May neering, Kunsan National University, Jeollabuk-do, Republic of Korea.
2023 Her research interests include navigation systems, filtering, localiza-
tion, GNSS interference detection and mitigation, multi-sensor fusion,
Springer Nature or its licensor (e.g. a society or other partner) holds multi-target tracking, target detection and classification, SLAM, auton-
exclusive rights to this article under a publishing agreement with the omous vehicles, and deep learning.
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of
such publishing agreement and applicable law.
13