0% found this document useful (0 votes)
111 views8 pages

You Only Look Once - Object Detection Models A Review

Uploaded by

amrmausad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views8 pages

You Only Look Once - Object Detection Models A Review

Uploaded by

amrmausad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

You Only Look Once - Object Detection Models: A

Review
Aabidah Nazir Mohd. Arif Wani
Dept. of Computer Sciences Dept. of Computer Sciences
University of Kashmir University of Kashmir
Srinagar, India Srinagar, India
[email protected] [email protected]

Abstract— Object detection is the task of detecting instances models were discovered as a result of advances in neural
of particular classes in an image. The You Only Look Once networks and deep learning [2]. These detectors' primary
(YOLO) object detection algorithms have become popular in responsibilities include producing bounding boxes,
recent years due to their high accuracy and fast inference calculating class probabilities, and based on class
speed. In this review, an overview of YOLO variants, including probabilities, confidence score is calculated. This document
YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOv6 and provides an outline of the operation and differences of the
YOLOv7, is performed and compared on the basis of various object detection algorithms of YOLO and its
evaluation metrics. We begin by discussing the basic principles variations. The mean Average Precision (mAP), test
and architecture of YOLO, which involves a single network
duration, and memory requirements are used to assess these
architecture that predicts bounding boxes and class
probabilities directly from full images. In addition, the changes
models. The model that most closely matches the
made in each version of YOLO, such as incorporating skip requirements of your application can now be chosen and
connections, feature pyramid networks, and anchor boxes to used in accordance with. The YOLO algorithm and its
improve accuracy and speed is discussed. A critical modification are briefly evaluated in this study. Through the
comparative analysis of YOLO variants is performed, evaluation, it is clear that various utterances and accurate
highlighting the trade-offs between accuracy and speed. Finally, findings demonstrate the similarities and dissimilarities
we highlight some of the future research directions for YOLO between the YOLO version and convolutional neural
variants, such as improving their robustness to different networks (CNNS). The development of the YOLO is in
environmental conditions like motion blur, lighting condition progress with its new variants, and this is the pertinent
and integrating them with other computer vision tasks like perception. A novel method of object detection is called
image segmentation, image classification and object tracking. YOLO. The classifiers are utilised for object detection in
This work will help a researcher to select a version that is best earlier works. As a replacement for object detection, frame
suited for a given application. object detection is taken into account as a regression
problem to contiguous splitted bounding package container
Keywords— Object Detection, YOLO, DarkNet, E-ELAN, You and related class probabilities. By analysing the entire set of
Only Look Once images, a single neural network simultaneously identifies the
I. INTRODUCTION class and the boundary bin. It being a single network allows
for optimization of the complete direction pipeline from
A human eye's ability to differentiate, recognise, and beginning to end. A unified architecture operates very
classify the objects it sees is trivial. However, as real-world quickly and an image is processed by 2D YOLO model in
objects are extremely versatile and can take on a real time at a rate of 45 frames each frame for the efficacy of
diversification of shapes, sizes, textures, and colours, it is detection without delay [3]. Scaled-down YOLO, which
challenging for machines to understand them. However, operates at an astounding 125 frames per second, achieves
recent improvements in computer vision have changed the twice the coverage of the other modern real-time detectors.
process of object detection. The usage of object recognition YOLO does not predict false positives based on historical
and tracking technologies is prevalent in autonomous data, but it produces more localization errors than
vehicles, medical diagnosis, tracking of sports balls, and contemporary detection algorithms. YOLO gradually
video surveillance systems, among other applications [1]. In generalises from natural images to further domain names
a digital picture or video frame, object detection algorithms like emulsion and becomes a widely preferred representation
locate things and draw a bounding box throughout them with of stuff. It outperforms competing detection approaches like
a tag indicating to which class the object present in the Deformable Parts Model (DPM) and R-CNN [4].
image belongs. However, some items might not be picked up
by sensors, which could be crucial for autonomous vehicles The purpose of the research work on YOLO and its
as they need to operate with complete precision. For variants is to improve the performance and efficiency of
instance, a death involving a self-driving car has been object detection algorithms. YOLO is a popular real-time
reported. Unable to sense its surroundings, an Uber self- object detection algorithm that uses a single neural network
driving car struck a pedestrian. As a result, perception to predict the bounding boxes and class probabilities of
deserves much more attention because it can make or break objects in an image. YOLO can process images extremely
someone. Modern state-of-art models such as Region based quickly and is able to detect multiple objects in a single
Convolutional Neural Networks (R-CNN), YOLO, Single- image. Over the years, researchers have proposed various
Shot Multi-box Detection (SSD), and other cutting-edge improvements and modifications to the original YOLO

978-93-80544-47-2/23/©BVICAM, New Delhi, India 1088


Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
algorithm, such as YOLOv2, YOLOv3, and YOLOv4. These introduced the use of Transformer-based architectures for
variants aim to address some of the limitations of the object detection. YOLO makes localization errors but
original YOLO algorithm, such as improving accuracy, anticipates fewer false positives in the background [5].
reducing computational complexity, and increasing the
detection speed. Overall, the research work on YOLO and its A. YOLO (V1)
variants aims to improve the state-of-the-art performance of This object identification network was created by
object detection algorithms, making them more efficient, Redmon et al. [6] to simplify the enormous run-time
accurate, and scalable for real-world applications. complexity of R-CNN and its suggested modifications.
Unlike R-CNN [7] and its modified versions, YOLO divides
The sections of this review paper is organised as: In a complete image into S×S grids and finds bounding boxes
section II, the Object Detection Model YOLO and its as 'm' number inside each grid, allowing it to localise and
variants are explained in detail with their architectural classify objects without the need for region
modifications. In section III, a comparative analysis of recommendations. Each bounding box forecasts an offset
YOLO and its versions are discussed with their evaluation value and a class probability. Bounding boxes are suppressed
metrics. In section IV, the review is concluded by that predict class probabilities below a specific threshold. Fig
convincing the performance of YOLO and its variants. 1 provides a visual breakdown of the procedure involved in
II. YOLO AND ITS VARIANTS object detection using YOLO version

YOLO is a family of object detection models developed The major limitations of this variant are:
by Joseph Redmon and his team at the University of x A YOLO detector can only detect one thing per grid,
Washington and later at the company he founded, called hence the greatest number of objects it can detect
YOLOv1, also known as Darknet. always depends on the grid's dimensions. For
YOLOv1 was released in 2016 and it was the first real- instance, if the grid size is S×S, then mostly objects
time object detection model to achieve high accuracy. It can be found is S×S.
achieved this by dividing the input image into a grid of cells x The maximum number of objects YOLO can detect is
and predicting the object class and bounding box for each 1 but when there is more than one object within a
cell. grid, it makes an incorrect detection.
YOLOv2 was released in 2017 and made several
improvements over YOLOv1. It introduced a new
architecture called Darknet-19, which had fewer layers than
the original Darknet-53 architecture used in YOLOv1,
making it faster and more efficient. YOLOv2 also used
anchor boxes to improve the accuracy of bounding box
predictions and introduced batch normalization to improve
training.
YOLOv3, released in 2018, was a significant
improvement over YOLOv2. It introduced a new detection
architecture called Darknet-53, which was deeper and more
powerful than the Darknet-19 architecture used in YOLOv2.
YOLOv3 also used feature pyramid networks (FPN) to
detect objects at different scales, improving accuracy further.
Fig. 1. Architecture of YOLO v1
YOLOv4, released in 2020, was another significant
improvement over YOLOv3. It introduced several new B. YOLO (V2)
techniques, including the use of the CSP (cross-stage partial)
architecture to improve efficiency and accuracy, the use of An enhanced variant of YOLO, also known as
SPP (spatial pyramid pooling) to improve detection of YOLO9000, has been proposed by Redmon et al. [8]. It not
objects at different scales, and the introduction of the Mish only outperforms modern models like Fast R-CNN and
activation function to improve training stability. Faster R-CNN in terms of performance and efficiency but
also completes detection in an acceptable length of time. To
YOLOv5, released in 2020, was developed by address the drawbacks of YOLO version 1, the creators of
Ultralytics, an AI software company. It was not an official this version of the detector made a number of adjustments to
release by the original YOLO creators. It introduced a new the architecture.
architecture called YOLOv5, which was designed to be
smaller, faster, and more accurate than YOLOv4. YOLOv5 Some remarkable architectural modifications which are
also introduced the use of anchor-free detection, which done in YOLO version 1:
improved accuracy. x The modifications of the detector are enhanced and
YOLOv6 and YOLOv7 are not official releases by the the possibility of overfitting is removed with the
original YOLO creators. YOLOv6 was also developed by addition of this batch normalisation after every
Ultralytics and introduced several new features, including convolutional layer, all without the need for dropout
the use of self-attention mechanisms to improve detection layers.
accuracy. YOLOv7 was developed by the community and

2023 10th International Conference on Computing for Sustainable Global Development (INDIACom) 1089
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
x As opposed to the YOLO version, this enhanced The standard value in [6] of "k" in their work to be 5, as
model eliminates using a fully connected layer from this attains a reasonable commutation among the complexity
the architecture and a fully connected layer above the and performance of the network.
convolutional layer to predict the offset value, we use
C. YOLO (V3)
anchor boxes to predict the objectness score. Even
though YOLO version 2's mAP is lower than YOLO An updated version of the YOLO detector created by
version 1's, the inclusion of anchor boxes raises the Redmon et al. [9] is called YOLO version 3. Because YOLO
latter's Recall value. version 3 only permits the prediction of one class per item
and cannot effectively handle multiclass prediction, it does
x Direct Location Prediction: This attribute mostly not employ the softmax classifier to predict the classes of
relates to the method's stability with the addition of observed objects. To get around this problem, YOLO
anchor boxes, as anchor boxes to some extent version 3 uses distinct logistic classifiers for each class and
increase the model's instability. The authors used with DarkNet-19 uses a hybrid feature extraction strategy
logistic activation to confine the bounding box and the residual network, in contrast to YOLO v2 which
coordinates inside [0 1] in order to boost the model's utilizes Darknet-19 as a feature extractor. The YOLO v3
stability. proposed design features a number of alternative
x Fine grained features: A pass-through layer is added connections, which improves its efficiency in terms of
to the network and instead of using features with performance while identifying small items but diminishes it
varying resolutions to execute the network, both low- when detecting large and medium objects. Although
and high-resolution features were viewed horizontally YOLOv3 was speedier than YOLOv2, it didn't offer any
rather than geographically lined up and concatenated. revolutionary improvements over its predecessor. In terms of
speed, accuracy, and class specificity, YOLOv3 and earlier
versions differ significantly. In terms of mAP and
intersection over union (IOU) values, YOLOv3 is quick and
accurate. The AP for small objects increased by 13.3 in
YOLOv3 is a significant improvement over YOLOv2. Even
yet, RetinaNet still outperforms all objects (small, medium,
and large) in terms of average precision (AP).
There are notable gaps among YOLOv3 and other
variants with reference to precision/accuracy, speed and
class specificity.
x Precision/Accuracy for tiny/small objects: The AP for
tiny or small objects increased by 13.3 in YOLOv3, a
significant improvement over YOLOv2. Even yet,
RetinaNet still outperforms all objects like small,
medium-sized, and large in terms of average
precision (AP).
x Speed: While YOLOv3 currently uses Darknet-53 as
Fig. 2. Architecture of YOLO v2 its backbone for feature extractor. Darknet-53 is more
formidable than Darknet-19 and also more efficacious
x Unlike the YOLO detector, that employs images with than other backbones as it utilizes 53 convolutional
a dimension of 224×224 for training and 448×448 for layers as hostile to the precursory 19 layers as in
testing. The effectiveness in performance of the ResNet-152 or ResNet-101. In connection with the
YOLO detector version 1 is reduced by the abrupt values of IOU and mAP, YOLOv3 is expeditious and
increase in resolution of image during the testing specific.
phase. The network is fine-tuned and trained on
images having size 448×448 for approx 10 epochs in
YOLO v2 in order to overcome this issue. As a result,
the issue with the rapid rise in image dimension in
YOLO causing a fall in mean Average Precision
(mAP) is resolved.
x YOLO version 1 trains the network using bounding
boxes that have been manually annotated, but the
authors in [6] have trained the network where
bounding boxes are produced using the k-means
technique in conjunction with their suggested
evaluation metric i.e. distance to make learning more
straightforward.
Distance (box, centroid) = 1-IOU (box, centroid)
Fig. 3. Architecture of YOLO v3

1090 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom)
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
x Specificity of Classes: Binary cross-entropy and E. YOLO (v5)
Independent logistic classifiers are utilized by the This is the first version of YOLO [11] to be created in the
new YOLOv3 to predict classes during training. PyTorch framework, as opposed to earlier versions, which
These changes enable the use of complicated datasets were developed using the Darknet research framework. Due
for YOLOv3 model training, including Microsoft's to PyTorch's ease of configuration over Darknet, version 5 of
Open Images Dataset (OID). For photos in the YOLO is now significantly more production ready than
collection, OID has a large number of overlapping earlier iterations.
labels, including "man" and "person."
The run-time of this version of YOLO is another
The YOLO v3 enables classes to be more detailed using noteworthy advancement. In comparison to its earlier
multilable technique and to have several parameters for each suggested versions, YOLO version 5 is considerably faster.
bounding box and when a softmax is utilised; each bounding When developed using the same PyTorch library as YOLO
box can only belong to one class. version 5, YOLO v4's inference time is 50 frames per second
D. YOLO (V4) while YOLO v5's is 140 frames per second. The YOLO v5 is
smaller in size than YOLO v4 and thus is faster.
The YOLO version 4 [10] and its design was inspired by
a number of Bag-of-Specials and Bag-of-Freebies item Like other single-stage object detectors, YOLO v5
recognition techniques. Its accuracy is increased but its contains three influential components i.e. Neck, Backbone
inference time and training costs are increased with the Bag- and Head. The intent of Backbone is to extricate remarkable
of-Freebies approach and, to a lesser extent, with the Bag-of- information from the supplied input image. To extricate
Specials method. salient features from an image given as input to YOLO v5,
the Cross Stage Partial Networks (CSP) are employed as its
There were also improvements to the YOLO version 4 backbone. CSPNet has exhibit a considerable depletion in
model in the form of genetic algorithms for selecting the best processing time.
hyper-parameter values, SAT (Self-Adversarial Training)
and Mosaic techniques for data augmentation, as well as The key purpose of Neck is to fabricate feature pyramids.
adjustments to existing methods such as Cross mini-Batch They allow models to successfully scale objects in general.
Normalization and Spatial Attention Module. Recognizing the same thing in different scales and sizes is
useful. On unobserved data, feature pyramid models perform
The CSPResNext53 and CSPDarknet50 are both built on well. In YOLO v5 model to acquire the feature pyramid,
top of DenseNet. The dense network consists of PANet is used as a neck.
convolutional neural networks connected together, solving
the vanishing gradient problem (back-propagating loss The Head is used for the final detecting process. It gives
signals through a dense network is difficult), improving final output vectors and the vector comprises of bounding
feature propagation, promoting reuse of features, and using anchor boxes on the features boxes, objectness scores,
reducing network parameters by reducing the number of and class probabilities. The heads of the YOLO V3 and V4
layers. CSPResNext50 and CSPDarknet53 have modified versions are interchangeable with the head of the YOLO v5.
DenseNet so that the feature map of the plinth layer is split
F. YOLO (v6)
into two in which one copy is send through the dense block
and other one to the next step. YOLOv6 [12] accomplishes the best commutation in
terms of metrics like speed and accuracy. Modern
EfficientNet performs better in terms of image quantization techniques, such as QAT (quantization-aware
classification than other networks. However, the authors of training) and PTQ (post-training quantization), are
YOLO v4 postulate that the networks used as backbone investigated and integrated into YOLOv6 to increase
might shows better performance in the models of object inference speed without significantly degrading performance
detection and choose to analyse them all. The YOLOv4 networks. The main modifications of YOLOv6 are as:
network uses CSPDarknet53 for the backbone network for
experimental results (i.e., A LOT of experimental results). x Different size line networks are redesigned for
industrial applications in a variety of contexts. To
achieve the greatest speed and accuracy swap, the
designs at different sizes differ, with tiny models
having a simple single-path backbone and more
models being constructed on effective multi-branch
blocks.
x A self-distillation approach is a feature of YOLO v6,
and it is used for both classification and regression
tasks. To help the learner learn knowledge more
effectively across all training phases, the teacher's
expertise and labels are dynamically adjusted.
x Substantiate the latest methods of object detection for
loss function, label assignment, and data
augmentation generally before deciding which ones
to use to improve performance.
Fig. 4. Architecture of YOLO v4

2023 10th International Conference on Computing for Sustainable Global Development (INDIACom) 1091
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
x RepOptimizer [13] and channel-wise distillation [14] Re-parameterization techniques build a model that is
are used to reform the quantization strategy for more resilient to the broad patterns they are trying to capture
detection, resulting in an ever-faster and more precise by averaging a group of model weights. Module level re-
object detector with 43.3% MS COCO AP and a parameterization, in which individual network components
throughput with batch size of 32 is 869 FPS. have their own re-parameterization techniques, has been the
recent focus of research. The authors of YOLOv7
experiment with various levels of supervision for this head
before deciding on a coarse-to-fine definition in which
supervision is handed back from the lead head at varying
granularities. Main YOLO network improvement initiatives
from V1 to V7:
In YOLO, the division into grids is key point of detection
of objects and confidence loss. In YOLO V2, full
convolutional networks, two-stage training, and anchor using
K-means are employed. Multi-scale detection with FPN is
employed in YOLO V3. SPP, MISH activation function, and
data improvement are utilised in YOLO V4. The YOLO V5
model with its variable model size management utilizes the
Hardswish activation function, and data augmentation, as
Fig. 5. Architecture of YOLO v6 well as Mosaic/Mixup and the GIOU (Generalized
Intersection over Union) loss function. YOLO v6 is
G. YOLO (v7) quantized with QAT and PTQ for better speed and accuracy.
The computer vision and machine learning are buzzing The latest version of YOLO i.e. YOLO v7 is settled on E-
about the YOLO v7 [15] model. The most recent and fast ELAN.
algorithm of YOLO outperforms other earlier object The comparison of architectures of YOLO and variants
detection algorithms and YOLO iterations in terms of speed are given Table I:
and precision. It can be taught significantly on tiny datasets
without any pre-learned weights than other neural networks TABLE I. COMPARISON OF YOLO ARCHITECTURES
and requires technology that is several times less expensive. Object Architecture Limitations
Therefore, YOLOv7 is anticipated to overtake YOLO v4 as Detector
the current state-of-the-art for real-time demands and YOLO v1 Inspired by GoogleNet and Localization error
become the calibre for object detection in industries in the uses DarkNet framework
future. The authors of YOLOv7 build on prior research on YOLO v2 Inspired by VGG & uses Small sized objects can’t
DarkNet-19 framework be identified
the subject while taking into account the memory
YOLO v3 Inspired by Feature Pyramid It lacks accuracy with
requirements for maintaining layers in memory and the Network and uses DarkNet 53 medium and large sized
distance over which a gradient can propagate back through framework objects
the layers; the smaller the gradient, the more productively YOLO v4 Inspired by path aggregation Difficult to deploy in
their network will be able to master. They settle on E-ELAN, Network and used CSP embedded devices because
an extended variant of the ELAN computational block, as Darknet-53 of large size
their final layer aggregate. The depth, breadth, and resolution YOLO v5 Uses Focus structure with CSP The accuracy of detection
DarkNet-53 and inference speed is not
of the network that was used to train the network are optimal
frequently taken into account by object detection models. In YOLO v6 EfficientRep Backbone and Speed and accuracy is less
YOLOv7, the authors concatenate layers while scaling the Rep-PAN Neck
network's depth and width simultaneously. Studies on YOLO v7 Extended Efficient Layer -
ablations demonstrate that this method maintains the ideal Aggregation Network (E-
model design while scaling for various sizes. ELAN) and Model scaling for
concatenation based models

III. RESULTS AND DISCUSSION


YOLO is the major development in the field of object
identification as it is the first object detector model that
identifies objects in only single stage and also addresses
detection of objects as a regression problem. The object
detection models architecture [16] just needed to take a
single look at the image to identify the elements' locations
and class labels. When benchmarked on a Titan X GPU, the
basic YOLO model predicts images at a rate of 45 FPS
(Frames per Second), as it is designed to train in a way that
is similar to image classification from beginning to end. The
fact that YOLO achieved mAP (mean average precision) of
63.4, more than twice as much as the other real-time
detectors, makes it even more exceptional.
Fig. 6. Architecture of YOLO v7

1092 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom)
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
On detection datasets like MS COCO, YOLOv2 was version [25] of YOLOv6-S adds new metrics as 43.3% AP
trained. By simultaneously training it on the ImageNet and and 869 FPS, which is even better. Additionally, YOLOv6-
MS COCO datasets, the YOLO9000 was created to predict M/L performs better in terms of accuracy (i.e., 49.5% and
over 9000 different item categories. The enhanced YOLOv2 52.3%) than other modern object detectors with equivalent
model outperformed cutting-edge approaches in both inference speeds.
accuracy and speed by utilising a variety of unique
techniques. The multi-scale training approach [17] uses the The comparative analysis of YOLO models on PASCAL
network as an input to predict at several scales, allowing VOC dataset is given in Table II.
time and accuracy to be traded off. In the VOC 2007 dataset With performance spanning to 160 FPS from 5 FPS and
[18], YOLOv2 achieved 76.8 mAP at 416×416 input the greatest accuracy of all modern real-time object detectors
resolution and 67 frames per second. On that dataset with above 30 FPS on GPU V100 with 56.8% You have% AP,
dimensions of 544×544 as input, YOLOv2 obtained 78.6 YOLOv7 exceeds all known object detectors [26]. The
mAP and 40 FPS. performance of the object detector YOLOv7-E6 with 56 FPS
The developers of YOLOv3: An YOLOv2 adopted many V100 and 55.9D44AP is 509% faster and 2 percentage
of the other strategies from YOLOv1 and this network points more accurate than that of the transformer-based
detector SWINL Cascade-Mask R-CNN having 53.9% AP
architecture was significantly modified by Incremental
Improvement. Darknet-53 [19], new network architecture, and 9.2 FPS A100. YOLOv7 surpasses object detectors
was unveiled. The Darknet-53 is a lot larger, more precise, trained on the MS COCO dataset with reference to accuracy
and swifter network than the one it replaced. It has been and speed.
honed at different image resolutions, including 320x320 and TABLE II. COMPARISON OF YOLO MODELS USING PASCAL VOC
416x416. On the Titan X GPU, YOLOv3 runs at 45 FPS and DATASET
reaches 28.2 mAP at 320320 resolutions, making it more Mean Average No of Frames
precise and rapid. Object Detector Model
Precision Per Second
YOLO version1 63.4 45
YOLOv4 is the result of numerous studies and
YOLO version2 with input 69 91
experiments that combine a variety of tiny, one-of-a-kind
image 288*288
methods to increase the convolutional neural network's YOLO version2 with input 73.7 81
accuracy and speed. In addition to CSP (Cross-Stage Partial image 352*352
Connections), WRC (Weighted-Residual-Connections), SAT YOLO version2 with input 76.8 67
(Self-adversarial training), CmBN (Cross-mini-Batch image 416*416
Normalization), DropBlock regular sampling, and Mosaic YOLO version2 with input 77.8 59
data augmentation, there are also other methods which can image 480*480
YOLO version2 with input 78.6 40
be applied. It was found that Optimal Accuracy and Speed image 544*544
[20] of Object Detection in YOLOv4 runs twice as rapidly as
EfficientDet and is having equal performance. Version 4 of The comparative analysis of YOLO models on MS
the YOLO model [21] integrates features to improve model COCO dataset is given in Table III.
training as well as object detection accuracy.
TABLE III. COMPARISON OF YOLO MODELS USING MS COCO DATASET
On the datasets like PASCAL VOC and MS COCO [22],
Mean Average No of Frames
respectively, the experimental findings of YOLO v5. The Object Detector Model
Precision Per Second
approach has higher large object recognition accuracy YOLO v2 with input image 48.1 40
compared to the four YOLO models. The fact that our 608*608
technique on the mAP [0.5:0.95] is 5.4% better than YOLO v3 57.9 20
YOLOv5 on the MS COCO datasets deserves greater YOLO v4 65.7 33
attention [23]. YOLO v5 50.7 48
On the COCO dataset, YOLOv6-N [24] achieves 35.9% YOLO v6 43.1 520
AP at a throughput of 1234 frames per second using an YOLO v7 56.8 160
NVIDIA Tesla T4 GPU.
There are several environmental conditions that can
By striking at 495 FPS and 43.5% AP, YOLOv6-S beats improve the robustness of the YOLO (You Only Look Once)
other significant detectors on that very scale. The modified object detection algorithm (Table IV).

TABLE IV. COMPARISON OF MODELS WITH DIFFERENT ENVIRONMENTAL CONDITIONS


Accuracy using MS COCO Dataset Accuracy using KITTI Dataset Accuracy using SUN-RGBD Dataset
Model Other environmental Other environmental Other environmental
Motion Blur Motion Blur Motion Blur
conditions (occlusion, conditions (occlusion, conditions (occlusion,
(%) (%) (%)
truncation) (%) truncation) (%) truncation) (%)
YOLO v1 13.7 33.1 20.3 63.4 19.8 38.8
YOLO v2 20.1 44 25.2 70.9 24.7 49.5
YOLO v3 21.8 45.5 25.3 71.4 26.4 51.9
YOLO v4 33.8 55.3 29.8 73.2 - -
YOLO v5 31.3 53.4 29.8 73 - -

2023 10th International Conference on Computing for Sustainable Global Development (INDIACom)
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
1093
Here are some of them: ACKNOWLEDGEMENT
x Good lighting conditions: YOLO performs best when The authors are thankful to the Artificial Intelligence
there is ample lighting and the objects are clearly Research Center at the Department of Computer Science,
visible. University of Kashmir for acquiring High-performance
NIVIDA Server (DGX A100) under RUSA 2.0 grant and
x Poor lighting conditions can cause the algorithm to providing access to it for the smooth conduction of research.
miss objects or detect false positives.
x Minimal occlusion: The algorithm performs better REFERENCES
when objects are not heavily occluded or partially [1] J. V. Raju, P. Rakesh, and N. Neelima, “Driver drowsiness monitoring
obstructed by other objects. Heavy occlusion can system,” Intelligent Manufacturing and Energy Sustainability, pp.
make it difficult for YOLO to detect objects 675–683, 2020.
accurately. [2] T. Iqball and M. A. Wani, “Weighted ensemble model for image
classification,” Int. j. inf. tecnol., vol. 15, pp. 557–564, 2023.
x Clear and uncluttered background: Objects are easier [3] M. A. Sofi and M. A. Wani, “Protein secondary structure prediction
to detect when they are set against a clear and using data-partitioning combined with stacked convolutional neural
uncluttered background. A busy or cluttered networks and bidirectional gated recurrent units,” International Journal
background can make it difficult for YOLO to of Information Technology, vol. 14(5), pp. 2285-2295, 2022.
accurately identify objects. [4] M. Maity, S. Banerjee, and S. Sinha Chaudhuri, “Faster R-CNN and
YOLO based Vehicle detection: A Survey,” In Proc of the 5th
x Adequate training data: Adequate training data is International Conference on Computing Methodologies and
crucial for YOLO to accurately detect objects. The Communication, ICCMC 2021, Apr. 2021, pp. 1442–1447.
more varied the training data, the better the algorithm [5] S. Geethapriya, N. Duraimurugan, and S. P. Chokkalingam, “Real
time object detection with yolo,” International Journal of Engineering
can perform. and Advanced Technology, vol. 8, pp. 578-581, 2019.
x No motion blur: Motion blur can make it difficult for [6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
YOLO to accurately detect objects. A still image or a once: Unified, real-time object detection,” In Proc. of the IEEE
video with minimal motion blur is ideal for the Computer Society Conference on Computer Vision and Pattern
Recognition, Dec. 2016, vol. 2016-December, pp. 779–788.
algorithm.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in
x Limited camera distortion: YOLO works best when Deep Convolutional Networks for Visual Recognition,” IEEE Trans.
there is minimal camera distortion. Camera distortion Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
can cause objects to appear distorted, making it [8] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” In
difficult for the algorithm to accurately detect and Proc. of the 30th IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, Nov. 2017, vol. 2017-January, pp. 6517–
classify them. 6525.
Overall, YOLO performs best when the environmental [9] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
conditions are optimized for object detection. unpublished, Apr. 2018, [Online]. Available: http://arxiv.org/abs/1804.
02767
IV. CONCLUSION AND FUTURE SCOPE [10] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
Speed and Accuracy of Object Detection,” Apr. 2020, [Online].
In this paper, we have performed a critical analysis of Available: http://arxiv.org/abs/2004.10934
object detection using various YOLO variants. We review [11] G. Jocher, “ultralytics/yolov5,” GitHub, Aug. 21, 2020. https://github.
each variant from YOLO 1 to YOLO V7 examining in detail com/ultralytics/yolov5.
how they perform single stage object detection. We found [12] C. Li et al., “YOLOv6: A Single-Stage Object Detection Framework
out that YOLO models achieve high accuracy and fast for Industrial Applications,” Sep. 2022, [Online]. Available:
http://arxiv.org/abs/2209.02976
inference speed, making them suitable for various real-world
applications. We also found out YOLO V1 and V2 are able [13] X. Ding, H. Chen, X. Zhang, K. Huang, J. Han, and G. Ding, “Re-
parameterizing Your Optimizers rather than Architectures,” May
to accurately detect large objects, however the accuracy 2022, [Online]. Available: http://arxiv.org/abs/2205.15242
reduces for small objects. The other variants like YOLO V3- [14] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise
V7 actively focused on this limitation and achieve good Knowledge Distillation for Dense Prediction,” In Proc. of the
accuracy on small object detection, but identifying multiple IEEE/CVF International Conference on Computer Vision (ICCV),
small objects in a group is still a challenge. The current 2021.
state-of-the-art variants of YOLO include YOLOv4, [15] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7:
YOLOv5, and YOLOv6, have introduced several Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors,”, Jul. 2022, [Online]. Available: http://arxiv.org/abs/2207.
improvements in terms of accuracy, speed, and efficiency. 02696
The future scope of YOLO variants is vast, with several [16] I. Bello et al., “Revisiting ResNets: Improved Training and Scaling
research directions being explored to enhance its Strategies,” 2021. [Online]. Available: https://github.com/tensorflow/
capabilities. One such direction is to integrate YOLO with tpu/tree/master/
other computer vision tasks, such as semantic segmentation [17] K. Chen, W. Lin, J. Li, J. See, J. Wang, and J. Zou, “AP-Loss for
and instance segmentation, to build more comprehensive Accurate One-Stage Object Detection,” IEEE Trans. Pattern Anal.
models. Additionally, researchers are also working on Mach. Intell., vol. 43, no. 11, pp. 3782–3798, Nov. 2021, doi:
10.1109/TPAMI.2020.2991457.
improving the robustness of these models to different
[18] D. Jia, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. and Li,
environmental conditions and occlusion. Overall, YOLO and “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE
its variants are powerful tools for real-time object detection, conference on computer vision and pattern recognition, 2009.
and their future looks promising with ongoing research and [19] “Darknet: Open Source Neural Networks in C,” pjreddie.com.
advancements. http://pjreddie.com/darknet/

1094 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom)
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.
[20] J. Huang et al., “Speed/accuracy trade-offs for modern convolutional
object detectors Jonathan,” in IEEE conference on computer vision
and pattern recognition, 2017, vol. 84, no. 3–4, pp. 7310–7311.
[21] J. Guo et al., “Hit-Detector: Hierarchical Trinity Architecture Search
for Object Detection,” In Proc. of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2020.
[22] T. Y. Lin et al., “Microsoft COCO: Common objects in context,”
Computer Vision – ECCV 2014, 2014.
[23] G. Jocher, “Releases ultralytics/yolov5,” GitHub, 2022. https://github.
com/ultralytics/yolov5/releases.
[24] D. Feng et al., “Deep Multi-Modal Object Detection and Semantic
Segmentation for Autonomous Driving: Datasets, Methods, and
Challenges,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3, pp.
1341–1360, Mar. 2021.
[25] M. Hu et al., “Online Convolutional Reparameterization,” In Proc. of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2022, pp. 558–567.
[26] C. Y. Wang, A. Bochkovskiy, and H. Y. M. Liao, “Scaled-yolov4:
Scaling cross stage partial network,” In Proc. of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
2021.

2023 10th International Conference on Computing for Sustainable Global Development (INDIACom) 1095
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on July 15,2024 at 11:56:01 UTC from IEEE Xplore. Restrictions apply.

You might also like