Yolo1 11
Yolo1 11
This is an Open Access article, distributed under the terms of the Creative Commons
Attribution licence (http:// cr eativecommons.or g/ licenses/by-nc/ 4.0/ ), which permits
unrestricted re-use, distribution, and reproduction in any medium, for non-commercial use,
provided the original work is properly cited.
Overview Paper
YOLOv1 to YOLOv10: The Fastest and
Most Accurate Real-time Object
Detection Systems
Chien-Yao Wang1,2* and Hong-Yuan Mark Liao1,2,3
1
Institute of Information Science, Academia Sinica, Taiwan
2
National Taipei University of Technology, Taiwan
3
National Chung Hsing University, Taiwan
ABSTRACT
This is a comprehensive review of the YOLO series of systems.
Different from previous literature surveys, this review article re-
examines the characteristics of the YOLO series from the latest
technical point of view. At the same time, we also analyzed
how the YOLO series continued to influence and promote real-
time computer vision-related research and led to the subsequent
development of computer vision and language models. We take
a closer look at how the methods proposed by the YOLO series
in the past ten years have affected the development of subsequent
technologies and show the applications of YOLO in various fields.
We hope this article can play a good guiding role in subsequent
real-time computer vision development.
1 Introduction
2 YOLO series
YOLO is synonymous with the most advanced real-time object detector of our
time. The biggest difference between YOLO and traditional object detection
systems is that it abandons the previous two-stage object detection method that
requires first finding the locations where objects may be located in the image,
and then analyzing the content of these locations individually. YOLO proposes
a unified one-stage object detection method, and this method is streamlined
and efficient, which makes YOLO widely used in various edge devices and
real-time applications. Next we will introduce several representative YOLO
versions, as listed in Table 1, and this literature review is different from the
previous ones. We will put our emphasis on the state-of-the-art object detection
methods and review the advantages and disadvantages of these methods.
YOLOv1: Redmon et al. [82] was the first one who proposed the one-stage
object detector in 2015, and the architecture of YOLOv1 is illustrated in
Figure 1. As shown in the figure, an input image first passes through CNN
for feature extraction, and then passes through two fully connected layers to
obtain global features. Then, the aforementioned global features are reshaped
back to the two-dimensional space for per grid prediction. YOLOv1 has the
following important features:
One-Stage Object Detector. As shown in Figure 1, YOLOv1 directly
classifies each grid of feature map, and also predicts B bounding boxes. Each
bounding box will predict the object center (bx , by ), object size (bw , bh ), and
object score (bobj ) respectively. The one stage prediction method does not need
to rely on the selective search that must be executed in the object proposal
generation stage, which can avoid missed detections caused by insufficient
manual design clues. In addition, the one-stage method can avoid the large
number of parameters and calculations generated by fully connected layers
4 Wang and Liao
in the second stage, and it can avoid the irregular operations required when
connecting two stages of RoI operations. Therefore, YOLO’s design can
capture features and make predictions more timely and effectively. Below we
will take a closer look at the most important concepts in YOLOv1, which are
anchor-free bounding box regression, IoU-aware objectness, and global context
features.
YOLOv1 to YOLOv10 5
bx = tx + cx ,
by = ty + cy ,
(1)
bw = tw 2 ,
bh = th 2
IoU-aware Objectness. In order to more accurately measure the quality
of bounding box prediction, the method proposed by YOLOv1 is to predict
the IoU value between a certain bounding box and the assigned ground-truth
bounding box, and use this as the soft label of the objectness predicted by
IoU-aware branch. Finally, the confidence score of bounding box is determined
by the product of objectness score and classification probability.
Global Context Feature. To ensure that a grid doesn’t only see the
local feature and cause prediction errors, YOLOv1 uses fully connected layer
to retrieve global context features. In such a design, no matter what the
underlying CNN architecture is, each grid can see a sufficient range of features
to predict the target object during prediction. Compared with fast R-CNN,
this design effectively reduces background error by more than half.
training, the pre-trained model has never seen the state of larger objects.
YOLOv2 uses the image classification pre-train of the same training size
image, so that the object detection training process does not require additional
learning of new size object information.
Joint Training with WordTree. YOLOv2 designed the training of group
softmax using ImageNet with a similar hierarchy as WordTree, and then
integrated the categories of COCO and ImageNet using Word- Tree. In the
end, this technology requires joint training of ImageNet’s image classification
and COCO’s object detection tasks. Because of the above design, YOLOv2
has the ability to detect 9000 categories of objects.
2.3 YOLOv3
YOLOv3 [84] was proposed by Redmon and Farhadi in 2018. They integrated
the advanced technology of existing object detection and made corresponding
optimizations to one-stage object detectors. As shown in Figure 3, in terms
of architecture, YOLOv3 mainly combines FPN [61] to enable prediction of
multiple scales at the same time. It also introduces the residual network archi-
tecture and designs DarkNet53. In addition, YOLOv3 also made significant
changes to the label assignment task. The first change is that a ground truth
will only be assigned to one anchor, while the second change is to change from
soft label to hard label for IoU aware objectness. To this day YOLOv3 is still
the most popular version of YOLO series. In what follows, let us detail the
special designs of YOLOv3, namely prediction across scales, high GPU utility,
and SPP.
Figure 3: Architecture of YOLOv3, YOLOv5, and PP-YOLO. The design of YOLOv3 mainly
changes the feature extraction to use SPP (optional) and FPN to extract multi-resolution
features. The initial versions of YOLOv5 and PP-YOLO also follow this architecture.
Gaussian YOLOv3 [14] proposed a great way to significantly reduce the false
positive of an object detection process. Gaussian YOLOv3 mainly changes the
decoding method of the prediction head, and the method used is to convert
the bounding box numerical regression problem into predicting its distribution.
Its architecture is shown in Figure 4.
2.5 YOLOv4
Since Joseph Redmon withdrew from computer vision research for some reason,
subsequent versions of YOLO were mainly released on the open source platform
GitHub. As for the publication time of the paper, it is later than the open
source time. YOLOv4 [2] was submitted to Joseph Redmon as a draft by
Alexey Bochkovskiy, in early April 2020, and was officially released on April
23 2020. YOLOv4 mainly integrates various technologies in different fields
of computer vision in recent years to improve the learning effect of real-time
object detectors. The architectural change of YOLOv4 is to replace FPN with
PAN [66] and introduce CSPNet [106] as backbone, as shown in Figure 5.
Most of the subsequent similar YOLO architectures followed this architecture.
YOLOv1 to YOLOv10 9
Figure 5: Architecture of YOLOv4 [2], Scaled-YOLOv4 [104], YOLOv5 r1–r7 [27–36], and
PP-YOLOv2 [51]. YOLOv4 change the feature integration architecture to Path Aggregation
Network (PAN) [66], and almost all subsequent YOLO versions adopted this design.
bx = (1 + sx )σ(tx ) − 0.5sx + cx ,
by = (1 + sy )σ(ty ) − 0.5sy + cy ,
(2)
bw = pw etw ,
bh = ph eth
Self-Adversarial Training. YOLOv4 also introduces self-adversarial sample
generation training to enhance the robustness of the object detection system.
Training with Memory Sharing. YOLOv4 is also designed to allow GPU
10 Wang and Liao
and CPU to share memory for storing the information required for gradient
updates. This design allows the trained batch size no longer be limited by
GPU memory.
2.6 Scaled-YOLOv4
In 2020, Wang et al. [104] continued the success achieved with YOLOv4
and continued to develop scaled-YOLOv4 that can be used on both edge
and cloud. Thanks to the activity of the DarkNet and PyTorch YOLOv3
communities, scaled-YOLOv4 can abandon the pre-train steps required by
ImageNet and directly use the train-from-scratch method to obtain high-
quality object detection results. In terms of architecture, scaled-YOLOv4
has also introduced CSPNet into PAN, which can comprehensively improve
the performance of speed, accuracy, number of parameters, and number of
calculations. Scaled-YOLOv4 also designs model scaling methods for various
edge devices and provides three types of models: P5, P6, and P7. In the training
part, scaled-YOLOv4 uses the decoder and label assignment strategy proposed
by the initial version of YOLOv5. Because of the various improvements
mentioned above, scaled-YOLOv4 has achieved the highest accuracy and
fastest inference speed of all object detectors. Below we list several unique
designs of scaled-YOLOv4:
Compound Model Scaling. Previous model scaling methods only considered
the integer hyperparameters of a given architecture. Scaled-YOLOv4 proposed
a model scaling that simultaneously considers the input image resolution and
receptive field matching, and uses the number of scaling model stages to design
a more efficient architecture that can be applied to high-resolution images.
Hardware Friendly Architecture. Taking into account ShuffleNet- v2
[70] and HardNet’s [8] analysis of hardware performance, the highly efficient
CSPDark module and CSPOSA module were designed.
Naïve Once For All Model. Since scaled-YOLOv4 is trained in the mode
of train-from-scratch, the problem of inconsistent resolution between the pre-
trained models and the detection model no longer exists. However, the problem
of inconsistency between user input images and training data still exists. The
model scaling method proposed in scaled-YOLOv4 allows users to obtain the
best accuracy without re-training during the inference stage, and only needs
to remove the output of the corresponding stage.
2.7 YOLOv5
YOLOv5 [26] continues the design concept of PyTorch YOLO- v3 and has
simplified and revised the overall architecture definition method. So far,
there are about 10 different versions. The initial version is designed with
an architecture similar to YOLOv3, while following EfficientDet’s [95] model
YOLOv1 to YOLOv10 11
bx = 2σ(tx ) − 0.5 + cx ,
by = 2σ(ty ) − 0.5 + cy ,
(3)
bw = pw (2σ(tw ))2 ,
bh = ph (2σ(th ))2
Neighborhood Positive Samples. In order to make up for the deficiency
caused by recall, YOLOv5 proposed to add more neighbor grids as positive
samples. At the same time, in order to allow these neighbor grids to correctly
predict the center point, they also enlarged the sigmoid scaling coefficient of
the YOLOv4 center point decoder.
2.8 PP-YOLO
There are four versions of the PP-YOLO series, namely PP-YOLO [68], PP-
YOLOv2 [51], PP-PicoDet [119], and PP-YOLOE [116]. PP-YOLO is improved
based on YOLOv3. In addition to using a variety of YOLOv4 training tech-
niques, it also adds CoordConv [65], Matrix NMS [111], and better ImageNet
pre-trained model and other methods for improvement, while PP-YOLOv2
further introduces scaled-YOLOv4’s CSPPAN and other mechanisms. PP-
PicoDet uses neural architecture search as the basis to design the backbone,
12 Wang and Liao
Figure 6: Architecture of PP-YOLOE [116], YOLOv6 2.0 [56], YOLOv7 AF [105], YOLOv8
[37], and YOLO-NAS [93]. PP-YOLOE changed bounding box regression head to TOOD’s
[20] anchor-free distribution-based regressor, and subsequent YOLO versions adopted this
design.
2.9 YOLOR
YOLOR [110] is not an official version of the YOLO series, but its use of
Latent Variable Model (LVM) as implicit knowledge encoder can significantly
improve the detection effects of all YOLO series models, as shown in Figure 7.
YOLOR’s multi-task model has also been widely used in subsequent YOLO
versions, and the advanced training technology it proposed has been continued
and promoted in all subsequent versions. Below are some specially designed
features of YOLOR.
Implicit Knowledge Modeling. YOLOR proposed three LVMs to encode
implicit knowledge, including vector-based, neural network-based, and ma-
trix factorization-based. The above three encoding methods can effectively
enhance the feature alignment, prediction refinement, and multi-task learning
capabilities of deep neural networks.
YOLOv1 to YOLOv10 13
Figure 7: Architecture of YOLOR. YOLOR proposed refining bounding box regression head
to Latent Variable Models (LVM).
2.10 YOLOX
Figure 8: Architecture of YOLOX [22]. YOLOX proposed changing bounding box regression
head to anchor-free regressor of FCOS [98], which also led to the development of the YOLO
series towards anchor-free.
2.11 YOLOv6
The initial version of YOLOv6 [73] uses RepVGG [17] as the main architecture.
In versions after version 2.0, such as Li et al. [56, 57], CSPNet [106] was
introduced. YOLOv6 is a system specially designed for industry, so it has
put a lot of effort into quantization issues. The contributions of YOLOv6
include using RepOPT [16] to make the quantized model more stable, and
using quantization aware training (QAT) and knowledge distillation to enhance
the accuracy of the quantized model. YOLOv6 version 3.0 [56] proposed a
concept of anchor-aid training, as shown in Figure 9, to improve the accuracy
of the system. Later in YOLOv6 version 4.0 [58], a lightweight architecture
YOLOv6-lite based on depth-wise convolution was proposed to face lower-end
computing devices. The following lists some of the unique features proposed
by YOLOv6.
Reparameterizing Optimizer. YOLOv6 version 2.0 uses RepOPT to slow
down the accuracy lost after model quantization.
Quantization Aware Training. In YOLOv6 version 2.0, QAT is used to
improve the accuracy of the quantization model.
Knowledge Distillation. YOLOv6 version 2.0 uses self-distillation and
channel-wise distillation respectively to improve model accuracy, and it also
uses QAT to reduce the accuracy loss after model quantization.
Anchor-Aided Training. YOLOv6 version 3.0 proposed using anchor-based
head to assist anchor-free head learning, as shown in Figure 9, to improve
accuracy.
2.12 YOLOv7
YOLOv7 [105] introduces trainable auxiliary architectures that can be removed
or integrated during the inference stage, including YOLOR [110], the recently
popular RepVGG [17], and additional auxiliary losses. Architecturally, as
shown in Figure 10, YOLOv7 uses ELAN [107] to replace the CSPNet used by
YOLOv4, and proposes E-ELAN to design large models. YOLOv7 also provides
YOLOv1 to YOLOv10 15
Figure 9: Architecture of YOLOv6 3.0 and YOLOv6 4.0. YOLOv6 3.0 proposed to guide
the learning of anchor-free bounding box regression head with anchor-based bounding box
regression head.
Figure 10: Architecture of YOLOv7 [105]. YOLOv7 guides learning with coarse-to-fine
consistent auxliary head and inspires many consistent multiple prediction training research.
2.13 DAMO-YOLO
2.14 YOLOv8
YOLOv8 [37] is a refactored version of YOLOv5 [26], which updates the way
the overall API is used and makes a lot of underlying code optimizations. It
architecturally changes YOLOv7’s ELAN, plus additional residual connection,
while its decoder is the same as YOLOv6 2.0. It is not so much a new
YOLOv1 to YOLOv10 17
2.15 YOLO-NAS
YOLO-NAS [93] did not reveal too many technical details. It mainly uses its
own AutoNAC NAS to design the quantization friendly architecture and uses
a multi-stage training process, including pre-training on Object365, COCO
Pseudo-Labeled data, Knowledge Distillation (KD), and Distribution Focal
Loss (DFL).
2.16 Gold-YOLO
Gold-YOLO [103] The overall architecture of Gold-YOLO is similar to that of
YOLOv6 3.0. The main design is that the Gather-and-Distribute mechanism
replaces PAN in the architecture, and masked image modeling is pre-trained
during the training process.
Gather-and-Distribute Mechanism. The main architecture of
Gather-and-Distribute is shown in Figure 12. It mainly collects features from
each layer through two gather-and-distribute modules and integrates them
into global features using transformers. The integrated global features will
be distributed to the low-level and high-level layers respectively, and the
distribution method uses the information injection module to integrate the
global features with the features that have been distributed to layers.
18 Wang and Liao
Figure 12: Architecture of Gold-YOLO. Gold-YOLO proposed adding global features to the
feature integration architecture.
2.17 YOLOv9
YOLOv9 [109] proposed an important trustworthy technology – Programmable
Gradient Information (PGI), whose architecture is shown in Figure 13. The
design architecture in the figure can enhance the interpretability, robustness,
and versatility of the model. The design of PGI is to use the concepts of
reversible architecture and multi-level information to maximize the original
data that the model can retain and the information needed to complete the
target tasks. YOLOv9 extended ELAN to G-ELAN and used it to show how
PGI can achieve excellent accuracy, stability and inference speed on models
with low number of parameters. Several outstanding features of YOLOv9 are
described below.
Auxiliary Reversible Branch. PGI exploits the properties of reversible ar-
chitecture to solve the information bottleneck problem in deep neural networks.
This is completely different from the general-purpose reversible architecture
which simply maximizes the information to be retained. What PGI uses is
to share the information retained by reversible architecture with the main
branch in the form of auxiliary information. On the premise of retaining the
information required for the target task, retain as much information as possible
from the original data.
Multi-level Auxiliary Information. PGI proposed the concept of multi-
level auxiliary information so that each layer of the main branch features retains
the information required for all task objectives as much as possible. This
can avoid the problem that past methods tend to lose important information
at the shallow level, which in turn leads to the inability to obtain sufficient
information at the deep level.
Generalize to Down-stream Tasks. Because PGI can maximize the
retention of original data information, models trained by PGI achieve more
robust performance in small datasets, transfer learning, multi-task learning,
and adaptation to new downstream tasks.
YOLOv1 to YOLOv10 19
Figure 13: Architecture of YOLOv9. YOLOv9 proposed using auxiliary branches to help
learning.
2.18 YOLOv10
Table 2: Architecture of YOLO series. The bold font indicates changes that significantly
affected subsequent versions.
use label assignment methods such as YOLOv4 that can effectively improve
recall. On the contrary, it is recommended to use the dynamic label assignment
method developed after YOLOX. When the aspect ratio of an object is
relatively fixed, it is more effective to use anchor-based prediction head. When
the aspect ratio of an object is extreme, the anchor-free method is suitable.
22 Wang and Liao
Table 3: The performance of the main architecture of the YOLO series of papers on the
COCO dataset. The bold font indicates Pareto optimal.
Model #Param. (M) FLOPs (G) mAP (%) T4, TRT (ms)
YOLOv4 64.4 142.8 49.7 –
Scaled-YOLOv4 52.9 120.4 50.3 –
YOLOR-CSP 52.9 120.4 50.8 –
YOLOv7 36.9 104.7 51.2 –
YOLOv5-L r6.2 46.5 109.1 49.0 –
YOLOv6-L 2.0 58.5 144.0 51.0 –
YOLOv7-L AF 43.6 130.5 53.0 6.7
YOLOv8-L 43.7 165.2 52.9 8.1
YOLOv6-L 3.0 59.6 150.7 51.8 7.9
YOLOv9-C 25.3 102.1 53.0 6.1
YOLOv9-TR 14.1 67.5 53.1 5.9
YOLOv10-B 19.1 92.0 52.5 5.7
YOLOv10-L 24.4 120.3 53.2 7.2
YOLOv8-L r3.0 25.3 86.9 53.4 6.2
The YOLO series of algorithms have the characteristics of (1) relatively simple
frame and (2) relatively easy deployment. In what follows we will describe
these characteristics in detail.
3.1 Simpler
3.2 Better
3.3 Faster
Faster Architecture. Another feature of the YOLO series is its very fast
inference speed, mainly because its architecture is designed for the actual
inference speed of the hardware. The designers of YOLOv3 found that even a
simple 1×1 convolution and 3×3 convolution combined architecture, although
it has a lower computational load, does not necessarily mean that it has an
advantage in inference speed. Therefore, they designed DarkNet for real-time
object detection. As for the designers of Scaled-YOLOv4, they referred to
research including ShuffleNetv2 [70] and HarDNet [8], and further analyzed
the criteria that need to be considered for high inference speed architecture for
different levels of devices from edge to cloud. To achieve the same purpose, the
developers of scaled-YOLOv4 designed Fully CSPOSANet and CSPDarkNet.
As for the developers of YOLOv6, they used the efficient RepVGG as the
backbone, while the designers of DAMO-YOLO used NAS technology to
directly search for efficient architectures in CSPNet and ELAN.
3.4 Stronger
Stronger Adaptability. The YOLO series has gained great progress and
response in the open source community. The training method integrated by
Darknet and PyTorch YOLOv3 allows YOLO series to train object detectors
without relying on ImageNet’s pre-trained model. Due to the above reasons,
the YOLO series can be easily applied to data in different domains without
relying on a large number of training models corresponding to the domain.
The above advantages enable the YOLO series to be widely used in various
application domains. In addition, the YOLO series can also be easily applied to
different datasets. For example, PyTorch YOLOv3 proposes to use evolutionary
algorithms to automatically search for hyperparameters, which can be applied
to different datasets. In addition, the improved anchor-free-based YOLO from
YOLOX to PP-YOLOE allows YOLO to rely on fewer hyperparameters during
training and can be used more widely in various application domains.
Stronger Capability. The YOLO series has excellent performance in a variety
of computer vision tasks. For example, after being widely used in the field of
real-time object detection, many other computer vision models based on YOLO
have been developed, including YOLACT [3] instance segmentation model,
JDE [113] multiple object tracking, and so on. Taking YOLOR as an example,
it began to combine multiple tasks into the same model for prediction. It can
perform image recognition, object detection, and multi-object tracking at the
same time, and significantly improve the effect of multi-task joint learning. On
the same task, YOLOv5 trains image recognition and object detection models
separately. In addition, YOLOv7 also demonstrated outstanding performance
in a variety of computer vision domains. At that time, it became the most
26 Wang and Liao
The YOLO series systems have been widely used in many fields. In this section,
we will introduce YOLO’s representative works in other computer vision fields
and explain the new designs either in architecture or methods completed by
these representative works in order to achieve real-time performance.
There are also some studies that generalize YOLO series from 2D to 3D. In
addition to ComplexYOLO [91] which combines images and LIDAR as input,
and Expandable YOLO [94] which uses RGB-D images as input, there is also
YOLO 6D [97] and YOLO 3D [86] which simply use images as input.
YOLOV [89] and YOLOV++ [90] can be applied to video object detection.
Alternatively, stream YOLO [118] can be used with streaming perception.
Face detection is one of the most popular subfields among the various possible
application domains of object detection. The face detection models designed
based on YOLO [11, 80, 120] also performs quite well in this field.
4.11 Summarization
also been developed more and more complete. We believe that real-time
multi-modal multi-tasking based on YOLO will have further development in
the future. The third is the combination of YOLO with large language models
and foundation models. The combination of YOLO and SAM in FastSAM
and the combination of YOLO and CLIP in YOLO-World are pioneers of this
type of work. Due to the attention brought by these studies, it is conceivable
that combining YOLO with large language models will be a direction worth
exploring in the future.
5 Conclusions
In this article, we introduce the evolution of the YOLO series over the years,
review these technologies from the perspective of modern object detection
technology, and point out the key contributions they made at each stage. We
analyze YOLO’s influence on the field of modern computer vision from aspects
such as ease of use, accuracy improvement, speed improvement, and versatility
in various fields. Finally, we introduce the YOLO-related models in various
fields. The purpose is that through this review article, readers can not only be
inspired by the development of the YOLO series, but also better understand
how to develop various real-time computer vision methods. We also hope
to provide readers an idea of the different tasks YOLO can be used for and
possible future directions.
References
[6] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
object detection”, in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2018, 6154–62.
[7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S.
Zagoruyko, “End-to-end object detection with transformers”, in Pro-
ceedings of the European conference on computer vision (ECCV), 2020,
213–29.
[8] P. Chao, C.-Y. Kao, Y.-S. Ruan, C.-H. Huang, and Y.-L. Lin, “HarDNet:
A low memory traffic network”, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2019, 3552–61.
[9] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J.
Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmentation”,
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019, 4974–83.
[10] P. Chen, Y. Wang, and H. Liu, “GCN-YOLO: YOLO Based on Graph
Convolutional Network for SAR Vehicle Target Detection”, IEEE Geo-
science and Remote Sensing Letters, 2024.
[11] W. Chen, H. Huang, S. Peng, C. Zhou, and C. Zhang, “YOLO-Face: a
real-time face detector”, The Visual Computer, 37, 2021, 805–13.
[12] Y. Chen, Q. Chen, Q. Hu, and J. Cheng, “DATE: Dual assignment
for end-to-end fully convolutional object detection”, arXiv preprint
arXiv:2211.13859, 2022.
[13] T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, “YOLO-
World: Real-time open-vocabulary object detection”, in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2024, 16901–11.
[14] J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian YOLOv3: An
accurate and fast object detector using localization uncertainty for au-
tonomous driving”, in Proceedings of the IEEE International Conference
on Computer Vision (ICCV), 2019, 502–11.
[15] danielsyahputra, “KAN-YOLO”, 2024, https://github.com/danielsyahp
utra/ultralytics.
[16] X. Ding, H. Chen, X. Zhang, K. Huang, J. Han, and G. Ding, “Re-
parameterizing your optimizers rather than architectures”, in The In-
ternational Conference on Learning Representations (ICLR), 2023.
[17] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “RepVGG: Mak-
ing VGG-style ConvNets great again”, in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
13733–42.
[18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T.
Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
“An image is worth 16x16 words: Transformers for image recognition at
YOLOv1 to YOLOv10 31
[106] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Yeh, “CSPNet: A new backbone that can enhance learning capability
of CNN”, in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2020, 390–1.
[107] C.-Y. Wang, H.-Y. M. Liao, and I.-H. Yeh, “Designing network design
strategies through gradient path analysis”, Journal of Information
Science and Engineering (JISE), 39(4), 2023, 975–95.
[108] C.-Y. Wang, H.-Y. M. Liao, I.-H. Yeh, Y.-Y. Chuang, and Y.-L. Lin,
“Exploring the power of lightweight YOLOv4”, in roceedings of the IEEE
International Conference on Computer Vision Workshops (ICCVW),
2021, 779–88.
[109] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “YOLOv9: Learning what
you want to learn using programmable gradient information”, in Pro-
ceedings of the European Conference on Computer Vision (ECCV),
2024.
[110] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “You only learn one repre-
sentation: Unified network for multiple tasks”, Journal of Information
Science and Engineering (JISE), 39(2), 2023, 691–709.
[111] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dy-
namic and fast instance segmentation”, Advances in Neural Information
Processing Systems (NeurIPS), 33, 2020, 17721–32.
[112] Z. Wang, C. Li, H. Xu, and X. Zhu, “Mamba YOLO: SSMs-Based
YOLO For Object Detection”, arXiv preprint arXiv:2406.05835, 2024.
[113] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time
multi-object tracking”, in Proceedings of the European conference on
computer vision (ECCV), 2020, 107–22.
[114] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime track-
ing with a deep association metric”, in IEEE International Conference
on Image Processing (ICIP), 2017, 3645–9.
[115] D. Wu, M.-W. Liao, W.-T. Zhang, X.-G. Wang, X. Bai, W.-Q. Cheng,
and W.-Y. Liu, “YOLOP: You only look once for panoptic driving
perception”, Machine Intelligence Research, 19(6), 2022, 550–62.
[116] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang,
S. Wei, Y. Du, et al., “PP-YOLOE: An evolved version of YOLO”,
arXiv preprint arXiv:2203.16250, 2022.
[117] X. Xu, Y. Jiang, W. Chen, Y. Huang, Y. Zhang, and X. Sun, “DAMO-
YOLO: A report on real-time object detection design”, arXiv preprint
arXiv:2211.15444, 2022.
[118] J. Yang, S. Liu, Z. Li, X. Li, and J. Sun, “Real-time object detection
for streaming perception”, in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2022, 5385–95.
38 Wang and Liao