FDGDFD
FDGDFD
School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China;
[email protected] (J.C.); [email protected] (C.S.)
* Correspondence: [email protected]
Abstract: Currently, detectors have made significant progress in inference speed and accuracy.
However, these detectors require Non-Maximum Suppression (NMS) during the post-processing
stage to eliminate redundant boxes, which limits the optimization of model inference speed. We first
analyzed the reason for the dependence on NMS in the post-processing stage. The result showed that
a score loss in a one-to-many label assignment leads to the presence of high-quality redundant boxes,
making them difficult to remove. To realize end-to-end object detection and simplify the detection
pipeline, we propose herein a mixed label assignment (MLA) training method, which uses one-to-
many label assignment to provide rich supervision signals, alleviating the performance degradation,
and we eliminate the need for NMS in the post-processing stage by using one-to-one label assignment.
Additionally, a window feature propagation block (WFPB) is introduced, utilizing the inductive
bias of images to enable feature sharing in local regions. Through these methods, we conducted
experiments on the VOC and DUO datasets; our end-to-end detector MA-YOLOX achieved 66.0 mAP
and 52.6 mAP, respectively, outperforming the YOLOX by 1.7 and 1.6. Additionally, our model
performed faster than other real-time detectors without NMS.
1. Introduction
Citation: Chen, J.; Shao, C.; Su, Z. Object detection is a fundamental task in the field of computer vision, requiring the
Mixed Label Assignment Realizes identification and localization of objects within images. Early object detection methods
End-to-End Object Detection. evolved from two-stage [1,2] to one-stage [3,4]. The two-stage methods utilize Region
Electronics 2024, 13, 4856. https:// Proposal Networks (RPNs) to generate a set of candidate regions containing objects, while
doi.org/10.3390/electronics13234856 one-stage methods directly produce dense prediction boxes to achieve object localization,
simplifying the object detection process and speeding up inference. In recent years, from
Academic Editor: Eva Cernadas
Anchor-Based [5,6] to Anchor-Free [7], lightweight and high-performance network struc-
Received: 19 November 2024 tures have continuously simplified models, achieving significant progress and superior
Revised: 3 December 2024 performance. However, the series of prediction boxes generated by detectors often con-
Accepted: 4 December 2024 tains a large number of redundant results, necessitating the filtering of these redundant
Published: 9 December 2024 boxes during the post-processing stage, which involves the manually designed component
known as Non-Maximum Suppression (NMS). NMS effectively removes redundant boxes
with high overlap by calculating the Intersection over Union (IoU) between prediction
Copyright: © 2024 by the authors.
boxes for the same object, ensuring that only the optimal detection result for each object is
Published by MDPI on behalf of the
retained. Conventional detectors rely on NMS during the post-processing stage, making
World Electric Vehicle Association.
the detection process cumbersome and not truly end-to-end.
Licensee MDPI, Basel, Switzerland. Recently, Transformer-based object detectors (DETR, Detection with Transformer) [8]
This article is an open access article can directly predict without NMS, removing various manually designed components,
distributed under the terms and greatly simplifying the pipeline of object detection, and achieving end-to-end object de-
conditions of the Creative Commons tection. DETR uses a bipartite graph matching algorithm to find a positive sample for
Attribution (CC BY) license (https:// each ground truth box, achieving end to-end object detection. However, the high com-
creativecommons.org/licenses/by/ putational cost of DETR limits its effectiveness and prevents it from being fully utilized,
4.0/). while CNN-based detectors have achieved a reasonable trade-off between detection speed
and accuracy, which can be further enhanced if NMS is not required. DeFCN [9] and
OneNet [10] each achieve end-to-end object detection using fully convolutional networks,
demonstrating that one-to-one label assignment is crucial for implementing end-to-end.
However, the training method of one-to-one label assignment leads to a decline in the
detector’s performance. Ref. [11] introduces a PSS module in the detection head to replace
NMS in FCOS [12] detector, by selecting a single positive sample for each instance, while
this approach increases the complexity of the detection head structure. To address these
issues, we propose a new end-to-end training method that maintains both the superior
performance and unchanged structure of the detection head.
CNN-based object detectors generate multiple nearly redundant predictions for each
object, and the post-processing stage uses Non-Maximum Suppression to select the optimal
prediction boxes as detection results. We delved into the reasons for the detector’s reliance
on NMS by examining the post-processing algorithm flow and discovered that the score
loss under one-to-many label assignment is a key factor causing numerous redundant boxes
that cannot be easily eliminated. Previous works have demonstrated that employing one-
to-one label assignment to eliminate NMS supports this finding. However, one-to-one label
assignment can lead to a significant decline in detector performance. To address this issue,
we propose an end-to-end training method called mixed label assignment (MLA). This ap-
proach uses one-to-one score loss to prevent the generation of high-quality redundant boxes,
eliminating the need for NMS and realizing end-to-end object detection. It also retains the
one-to-many bounding box regression loss, which provides rich supervisory information to
optimize the model and alleviate performance degradation. Additionally, DETR achieves
strong competitive results through the use of attention mechanisms. However, due to the
high computational cost and memory explosion caused by Attention [13] operating on
large-scale features, it is challenging to embed into every layer of the feature extraction
stage. To leverage the inductive bias of convolutions and images for feature propagation
in local areas, we propose a window feature propagation block (WFPB) that enhances the
feature sharing capability, making it more suitable for the feature extraction stage.
The main contributions of this paper are as follows:
• Propose a novel end-to-end training method, mixed label assignment, which elimi-
nates the need for NMS and simplifies the detection pipeline;
• Introduce a window feature propagation block that is better suited for the feature
extraction stage, enhancing local feature sharing;
• Conduct extensive comparative and ablation experiments on the PASCAL VOC and
DUO, demonstrating the superiority and effectiveness of the proposed method.
2. Related Work
2.1. End-to-End Object Detection
Carion et al. [8] firstly proposed the Transformer-based object detection model DETR
by using Hungarian matching to achieve one-to-one label assignment as DETR realizes
end-to-end object detection. DETR eliminates manually designed components needed in
traditional detectors, such as Non-Maximum Suppression algorithms, thus simplifying
the object detection pipeline. However, DETR still has two issues: heavy computational
burden and slow training convergence. Although Deformable DETR [14] reduces the
computational cost of Transformers by using deformable attention, and DAB-DETR [15]
accelerates training convergence by replacing queries with dynamic anchor box repre-
sentations, DETR’s training cost and inference speed are still significantly higher than
CNN-based object detectors.
Electronics 2024, 13, 4856 3 of 15
3. Methods
Figure 1 presents MA-YOLOX’s structure and training methodology. While the object
detection network employs one-to-many label assignment (Baseline) training to provide
robust feature representations, this approach generates redundant boxes that typically
require NMS filtering for final detection results. Through the analysis of the post-processing
stage, we propose mixed label assignment (MLA), a novel end-to-end training method that
eliminates NMS requirements. Additionally, we integrate a window feature propagation
block (WFPB) to enhance local feature sharing and boost performance. These innovations
enable the detector to achieve superior detection results without NMS post-processing.
Electronics 2024, 13, 4856 4 of 15
Figure 2. Visualization of confidence heatmaps predicted by various methods. The image is sourced
from the VOC2007 test set and contains examples: ‘Man’ and ‘horse’. The methods are sequentially
trained using one-to-many matching (Baseline), one-to-one matching (O2O), and mixed label alloca-
tion (MLA) proposed in this paper. The heatmaps represent the confidence scores for predictions at
the “P4, P5” scale. It can be found that the O2O and MLA methods significantly reduce the redundant
prediction of the same object compared to Baseline.
Figure 3. The end-to-end training method of mixed label assignment (MLA). For input images,
based on the model’s predictions of regression, classification, and confidence information, compute
the matching cost and choose positive samples, then use the best positive sample ‘1’ to optimize
the confidence prediction head by one-to-one label assignment; the positive samples ‘1, 2, 3, 4’ are
optimized by performing one-to-many matching for the regression and classification.
the model’s convergence, and OneNet proposes the matching cost without classification
information, which is one of the reasons hindering end-to-end. To select the best positive
and negative samples, for the sample i and target j, the matching cost is defined as follows:
the sampling process. Shared features are then extracted to achieve feature sharing. After
convolution sampling, each window implicitly contains features from the surrounding
windows. At the same time, by using pooling operations, important features from the local
area are sampled. By summing, the window can capture important features within the local
area, playing a role in feature propagation. At the same time, using large kernel pooling
operation, important features from the local area are sampled. By adding, the window
can capture important features within the local area, facilitating feature propagation. The
final step is to restore the channels using a 1 × 1 convolution and employ residual connec-
tions [32] to enhance stability. Setting the K × K size of convolution kernel serves as the
window size.
Figure 4. (a) The basic composition of the YOLOX backbone’s dark module; (b) the window feature
propagation block (WFPB).
The window feature propagation block in the feature extraction phase of the network
is incorporated. The backbone of YOLOX is composed of the basic dark module [33]. The
WFPB is applied after the convolution downsampling of the dark module, performing
feature propagation and cropping on the sampled feature map to achieve the best results.
4. Experiments
4.1. Applied Datasets
The public datasets PASCAL VOC [34] and DUO dataset [35] were chosen for model
evaluation. The PASCAL Visual Object Classes is a world-class computer vision challenge
that was proposed in 2005, which includes the task of object detection. It features a total
of 20 detection categories: person, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle,
boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and moni-
tor. These categories encompass commonly seen objects in daily life, including people,
animals, household items, and transportation vehicles. It provides a wealth of resources
for the development of object detection. This paper uses the 2012 training and validation
set, consisting of 11,540 images for training, and tests the results on the 2007 test set of
4952 images.
In contrast to VOC, the DUO dataset is an underwater object detection dataset. The
DUO dataset was proposed by the Underwater Robot Professional Contest in 2021, aimed
at robot picking based on underwater images. It contains about 6671 images in the training
set and 1111 images in the testing set. The dataset includes four categories of underwater
targets: holothurians, echinoderms, scallops, and starfish. These two datasets cover normal
terrestrial scenes and underwater scenes, respectively. Achieving excellent results on both
diverse datasets can better reflect the superiority of the method.
Electronics 2024, 13, 4856 8 of 15
TP
R= (6)
TP + FN
N Z 1
1
mAP =
N ∑ P( R)dR (7)
i =1 0
Table 1. Comparison of different methods on the number of parameters and the accuracy on the
DUO dataset; w/o NMS means without NMS during the validation (val) stage.
We also compared the results of various real-time detectors YOLOv5, YOLOv7, and
YOLOV8 [41] on Pascal VOC, as shown in Table 2. VOC was derived from various images
in natural scenes and included many common categories from daily life, making the results
more reflective of the detector’s performance in real-world scenarios. The best result is
displayed in bold in the table; the detection results still outperformed other detectors.
During the evaluation of the detector, NMS was not used, demonstrating the effectiveness
of the mixed label assignment. Additionally, Table 2 also shows the inference speeds of
various real-time detectors. To present the results more intuitively, Figure 5 visualizes
the detection results in terms of speed and accuracy. Although MA-YOLOX’s number of
parameters or computational complexity was not the lowest, our method demonstrated a
significant advantage in inference speed, achieving a detection speed of 2.5ms per image.
This speed improvement is attributed to the removal of NMS in the post-processing stage,
which eliminates the additional computational overhead introduced by NMS. As a result,
MA-YOLOX is faster than other real-time detectors. Specifically, compared to YOLOX, MA-
YOLOX reduces post-processing time by 0.5 ms by removing NMS, increasing processing
speed by 50%. In comparison, MA-YOLOX is also faster than the YOLOV5, YOLOV7, and
YOLOV8 real-time detectors. This result further proves our core idea: removing NMS
is not only theoretically feasible, but it also significantly optimizes inference speed in
practical applications.
Table 2. Comparisons of different methods on the VOC dataset. Latency f denotes the Latency in the
forward process of the model without post-processing.
Figure 6. Visualization of detection results with or without NMS by YOLOX and MA-YOLOX.
label assignment, exceeding the Baseline. The increase in parameters and computational
cost was only around 10%. Our method achieved a result of 52.6 AP and a Latency of 2.5 ms
without NMS, improved the detection results by 1.6 AP, and increased the inference speed
by 0.4 ms, outperforming the Baseline in both performance and speed. Moreover, without
NMS, there was a slight improvement in detection results, indicating that NMS removed
some more accurate predicted boxes, effectively proving the feasibility and effectiveness of
this approach.
Table 3. Ablation studies on VOC, evaluating model use conf = 0.001 and IoU threshold = 0.65.
End-to-
Model MLA WFPB Param. (M) GFLOPS mAP mAP(w/o NMS) Latency (ms)
End
8.95 26.80 51.0 18.6 2.9
✓ 8.95 26.80 44.9 44.9 2.4
YOLOX-S
✓ ✓ 8.95 26.80 50.6 50.7 2.4
✓ ✓ ✓ 10.08 29.57 52.6 52.7 2.5
The best results are in bold.
λ mAP mAP50
1 50.1 73.1
3 50.2 73.7
4 50.4 73.5
5 50.6 73.7
7 50.4 73.3
The best results are in bold.
Table 5. The results of the window feature-propagating blocks at different K sizes in VOC.
However, mixed label assignment effectively alleviated this issue. The generalization
experiments further highlight the effectiveness of this method.
5. Conclusions
This paper proposes a novel end-to-end training method called mixed label assign-
ment, which avoids the performance degradation caused by one-to-one label assignment
in traditional methods while preserving the advantages of one-to-many label assignment.
This approach does not require additional branches or training overhead, significantly im-
proving the performance of end-to-end object detection. Furthermore, the window feature
propagation module effectively shares features in local regions by leveraging inductive
bias, and has achieved remarkable results. Our experiments demonstrated the importance
of local region features in image-based detection tasks. The detector based on our method
outperformed the Baseline in both detection results and inference speed. We hope that the
design introduced in this work will contribute to the development of better end-to-end
training methods for object detection.
Author Contributions: Conceptualization, J.C. and C.S.; Methodology, J.C.; Software, J.C.; Validation,
J.C.; Formal Analysis, J.C.; Investigation, J.C.; Resources, J.C.; Data Curation, J.C.; Writing—Original
Draft Preparation, J.C.; Writing—Review and Editing, J.C., C.S. and Z.S.; Visualization, J.C.; Supervi-
sion, Z.S.; Project Administration, Z.S.; Funding Acquisition, Z.S. All authors have read and agreed
to the published version of the manuscript.
Funding: This research was funded by Jiangsu Provincial Key Research and Development Pro-
gram (No. BE2022136), High-tech Ship Research Projects (No. CBG4N21-4-3), Key Research and
Development Program of Zhenjiang City (No. GY2023019).
Data Availability Statement: The PASCAL VOC Datasets were from http://host.robots.ox.ac.uk/
pascal/VOC/voc2012/index.html (accessed on 25 May 2024) and the DUO dataset can be found at
https://github.com/chongweiliu/DUO (accessed on 25 May 2024).
Conflicts of Interest: The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in this paper.
References
1. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
2. Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083.
3. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
4. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings,
Part I 14; pp. 21–37.
5. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
6. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
7. Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002.
8. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229.
Electronics 2024, 13, 4856 14 of 15
9. Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-end object detection with fully convolutional network. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15849–15858.
10. Sun, P.; Jiang, Y.; Xie, E.; Shao, W.; Yuan, Z.; Wang, C.; Luo, P. What makes for end-to-end object detection? In Proceedings of the
International Conference on Machine Learning, Online, 18–24 July 2021; pp. 9934–9944.
11. Zhou, Q.; Yu, C.; Shen, C.; Wang, Z.; Li, H. Object Detection Made Simpler by Eliminating Heuristic NMS. arXiv 2021,
arXiv:2101.11782. [CrossRef]
12. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355.
13. Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach,
CA, USA, 4–9 December 2017.
14. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
15. Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr.
arXiv 2022, arXiv:2201.12329.
16. Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597.
17. Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515.
18. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object
detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463.
19. YOLOv5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 May 2024).
20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv
2015, arXiv:1506.01497. [CrossRef] [PubMed]
21. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training
sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 9759–9768.
22. Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 303–312.
23. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499.
24. Jia, D.; Yuan, Y.; He, H.; Wu, X.; Yu, H.; Lin, W.; Sun, L.; Zhang, C.; Hu, H. Detrs with hybrid matching. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19702–19712.
25. Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group detr: Fast detr training with
group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris,
France, 4–6 October 2023; pp. 6633–6642.
26. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 13619–13627.
27. Zhao, C.; Sun, Y.; Wang, W.; Chen, Q.; Ding, E.; Yang, Y.; Wang, J. MS-DETR: Efficient DETR Training with Mixed Supervision.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024;
pp. 17027–17036.
28. Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430.
29. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
30. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.
31. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
33. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
34. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
35. Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A dataset and benchmark of underwater object detection for robot
picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China,
5–9 July 2021; pp. 1–6.
36. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
Electronics 2024, 13, 4856 15 of 15
37. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666.
38. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012.
39. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
17–24 June 2023; pp. 7464–7475.
40. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014; Proceedings, Part V 13; pp. 740–755.
41. Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed
on 25 May 2024).
42. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
43. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
44. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024,
arXiv:2405.14458.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.