Electronics 12 02323 v2
Electronics 12 02323 v2
Electronics 12 02323 v2
Article
DC-YOLOv8: Small-Size Object Detection Algorithm Based on
Camera Sensor
Haitong Lou 1 , Xuehu Duan 1 , Junmei Guo 1 , Haiying Liu 1, *, Jason Gu 2 , Lingyun Bi 1 and Haonan Chen 1
Abstract: Traditional camera sensors rely on human eyes for observation. However, human eyes
are prone to fatigue when observing objects of different sizes for a long time in complex scenes, and
human cognition is limited, which often leads to judgment errors and greatly reduces efficiency.
Object recognition technology is an important technology used to judge the object’s category on a
camera sensor. In order to solve this problem, a small-size object detection algorithm for special
scenarios was proposed in this paper. The advantage of this algorithm is that it not only has higher
precision for small-size object detection but also can ensure that the detection accuracy for each size
is not lower than that of the existing algorithm. There are three main innovations in this paper, as
follows: (1) A new downsampling method which could better preserve the context feature information
is proposed. (2) The feature fusion network is improved to effectively combine shallow information
and deep information. (3) A new network structure is proposed to effectively improve the detection
accuracy of the model. From the point of view of detection accuracy, it is better than YOLOX, YOLOR,
YOLOv3, scaled YOLOv5, YOLOv7-Tiny, and YOLOv8. Three authoritative public datasets are used
in these experiments: (a) In the Visdron dataset (small-size objects), the map, precision, and recall
ratios of DC-YOLOv8 are 2.5%, 1.9%, and 2.1% higher than those of YOLOv8s, respectively. (b) On
the Tinyperson dataset (minimal-size objects), the map, precision, and recall ratios of DC-YOLOv8
are 1%, 0.2%, and 1.2% higher than those of YOLOv8s, respectively. (c) On the PASCAL VOC2007
Citation: Lou, H.; Duan, X.; Guo, J.;
dataset (normal-size objects), the map, precision, and recall ratios of DC-YOLOv8 are 0.5%, 0.3%, and
Liu, H.; Gu, J.; Bi, L.; Chen, H. 0.4% higher than those of YOLOv8s, respectively.
DC-YOLOv8: Small-Size Object
Detection Algorithm Based on Keywords: small-size objects; object detection; camera sensor; feature fusion
Camera Sensor. Electronics 2023, 12,
2323. https://doi.org/10.3390/
electronics12102323
1. Introduction
Academic Editor: Donghyeon Cho
As one of the most widely used devices, cameras have been an essential device in
Received: 6 April 2023 various industries and families, such as robotics, monitoring, transportation, medicine,
Revised: 15 May 2023
autonomous driving, and so on [1–5]. A camera sensor is one of the core sensors of the above
Accepted: 16 May 2023
requirements; it is composed of a lens, a lens module, a filter, a CMOS (complementary
Published: 21 May 2023
metal oxide semiconductor)/CCD (charge-coupled device), ISP (image signal processing),
and a data transmission part. It works by first collecting images using optical imaging
principles and finally performing image signal processing. The application of cameras in
Copyright: © 2023 by the authors.
traffic, medicine, automatic driving, etc., is crucial to accurately identify an object, so the
Licensee MDPI, Basel, Switzerland. object recognition algorithm is one of the most important parts in a camera sensor.
This article is an open access article Traditional video cameras capture a scene and present it on a screen; then, the shape
distributed under the terms and and type of the object are observed and judged by the human eye. However, human
conditions of the Creative Commons cognitive ability is limited, and it is difficult to judge the category of the object when the
Attribution (CC BY) license (https:// camera resolution is too low. A complex scene will also strain the human eye, resulting
creativecommons.org/licenses/by/ in the inability to detect some small details. A viable alternative to this problem is to use
4.0/). camera sensors to find areas and categories of interest [6].
At present, technology for object recognition through a camera is one of the most chal-
lenging topics, and accuracy and real-time performance are the most important indicators
applied in a camera sensor. In recent years, with the ultimate goal of achieving accuracy
or being used in real time, MobileNet [7–9], ShuffleNet [10,11], etc., which can be used
on a CPU, and ResNet [12], DarkNet [13], etc., which can be used on a GPU, have been
proposed by researchers.
At this stage, the most classical object detection algorithms are divided into two kinds:
two-stage object detection algorithms and one-stage object detection algorithms. Two-stage
object detection algorithms include R-CNN (Region-based Convolutional Neural Net-
work) [14], Fast R-CNN [15], Faster R-CNN [16], Mask R-CNN [17], etc. One-stage object
detection algorithms include YOLO series algorithms (you only look once) [13,18–22], SSD
algorithms (Single Shot MultiBox Detector) [23], and so on. The YOLO series of algorithms
is one of the fastest growing and best algorithms so far, especially the novel YOLOv8 algo-
rithm released in 2023, which has reached the highest accuracy so far. However, YOLO only
solves for object of full sizes. When the project becomes a special scene with a special size,
its performance is not as good as some current small-size object detection algorithms [24,25].
In order to solve this problem, this paper proposed an improved algorithm for YOLOv8.
The detection accuracy of this algorithm had a stable small improvement for normal-scale
objects and greatly improved the detection accuracy of small objects in complex scenes.
The pixels of small objects are small, which make the detector extract features accurately
and comprehensively during feature extraction. Especially in complex scenes such as object
overlap, it is more difficult to extract information, so the accuracy of various algorithms for
small objects is generally low. Greatly improving the detection accuracy of small objects
in complex scenes while the detection accuracy of normal-scale objects remains stable
or shows slight improvement, the main contributions of the proposed algorithm are as
follows:
(a) The MDC module is proposed to perform downsampling operations (the method
of concatenating depth-wise separable convolutions, maxpool, and convolutions of
dimension size 3 × 3 with stride = 2 is presented). It can supplement the information
lost by each module in the downsampling process, making the contextual information
saved in the feature extraction process more complete.
(b) The C2f module in front of the detector in YOLOv8 is replaced by the DC module pro-
posed in this paper (the network structure formed by stacking depth-wise separable
convolution and ordinary convolution). A new network structure is formed by stack-
ing DC modules and fusing each small module continuously. It increases the depth
of the whole structure, achieves higher resolution without significant computational
cost, and is able to capture more contextual information.
(c) The feature fusion method of YOLOv8 is improved, which could perfectly combine
shallow information and deep information, make the information retained during
network feature extraction more comprehensive, and solves the problem of missed
detection due to inaccurate positioning.
This paper is divided into the following parts: Section 2 introduces the reasons
for choosing YOLOv8 as the baseline and the main idea of YOLOv8; Section 3 mainly
introduces the improved method of this paper; Section 4 focuses on the experimental
results and comparative experiments; Section 5 provides the conclusions and directions of
subsequent work and improvement.
2. Related Work
Currently, camera sensors are crucial and have been widely used in real life. Existing
researchers also applied a large number of camera sensors to a variety of different sce-
narios. For example, Zou et al. proposed a new camera-sensor-based obstacle detection
method for day and night on a traditional excavator based on a camera sensor [1]. Addi-
tionally, robust multi-target tracking with camera sensor fusion based on both a camera
sensor and object detection has been proposed by Sengupta et al [26]. There is also the
Electronics 2023, 12, 2323 3 of 14
camera-sensor approach proposed by Bharati applied to assisted navigation for people with
visual impairments [27]. However, in order to be applicable to real life. It can ensure real-
time detection was the most important indicator, so we used the most popular one-stage
algorithm. The YOLO family of algorithms is the state of the art for real-time performance.
instance as positive samples and selects the other anchors as negative samples, and then
trains through the loss function. After the above improvements, YOLOv8 is 1% more
accurate than YOLOv5, making it the most accurate detector so far.
t = sα × uβ (2)
easily partially blocked by other size objects, making it difficult to distinguish and locate in
an image.
In order to solve the above problems, we proposed a detection algorithm that could
greatly improve the detection effect of small-size objects on the basis of ensuring the detec-
tion effect of normal-size objects. First, we proposed the MDC module for the downsam-
pling operation, which adopted depth separable convolution, Maxpool and a convolution of
size 3 × 3 with a stride = 2 for concatenation. This can fully supplement the information lost
in the downsampling process of each item and can more completely preserve the contextual
information of the feature extraction. Secondly, the feature fusion method was improved to
better combine shallow information and deep information, so that the information retained
in the process of network feature extraction was more comprehensive, and the problem
of not detecting the object due to inaccurate positioning and being misled by large-size
targets was solved. Finally, a DC module (depth-wise separable convolution + convolution
of size 3 × 3) that was constantly stacked and fused was proposed to form a new network
structure, and it was replaced by the C2f module in front of the detection head. This method
increases the depth of the whole structure and obtains higher resolution without increasing
significant computational cost. It can capture more contextual information and effectively
improve the problem of a low detection accuracy caused by object overlap.
Figure 3. In the figure, orange represents the detector for small objects, blue represents the medium
size object detector, and purple represents the large size object detector. (a) FPN fuses feature maps
from different levels to form a feature pyramid with multi-scale information by constructing a top-
down feature pyramid and adds lateral connections between feature maps at different levels in
order to better utilize high-frequency detail information contained in low-level features. (b) PAN
aggregates features from different scales to produce feature maps containing multiple resolutions
by adding a path aggregation module between the encoder and decoder. This hierarchical feature
pyramid structure can make full use of feature information at different scales, thereby improving
the accuracy of semantic segmentation. (c) YOLOv8 adopts the combination of SPP and PAN, which
improves the accuracy and efficiency of object detection. (d) The feature fusion method proposed in
this paper.
Electronics 2023, 12, 2323 7 of 14
Figure 4. (a) The C2f module, which is designed by referring to the idea of the C3 module and
ELAN, so that YOLOv8 can obtain more abundant gradient flow information while ensuring light
weight. (b) The network structure proposed in this paper. It not only adopts the ideas of DenseNet
and VOVNet but also replaces the original convolution with a parallel cascade of convolutions
and depth-wise separable convolutions. (c) The basic block in the network architecture, which is
composed of convolutions and depth-wise separable convolutions.
Electronics 2023, 12, 2323 8 of 14
4. Experiments
The new algorithm was trained and tested on the Visdrone dataset to improve each
stage and compared with YOLOv8. In order to verify that this algorithm could improve
the detection accuracy of small-size targets without reducing the accuracy of other scale
targets, comparative experiments was carried out on the PASCAL VOC2007 dataset and
the Tinperson dataset. Finally, we selected complex scene pictures in different scenarios to
compare the detection effects of the proposed algorithm and YOLOv8 algorithm in actual
scenes.
After many experiments, it can be known that the algorithm basically iterates 120 times
and then begins to converge. According to the hardware facilities and multiple experimental
attempts, we set the following parameters: batch size = 8 and epoch = 200.
TP
P= (3)
( TP + FP)
TP
R= (4)
( TP + FN )
TP is the number of correctly predicted bounding boxes, FP is the number of incorrectly
judged positive samples, and FN is the number of undetected targets.
Average precision (AP) is the average accuracy of the model. Mean average precision
(mAP) is the average value of the AP. k is the number of categories. The formulas for AP
and mAP are as shown in Equations (5) and (6).
Z 1
AP = p(r )dr (5)
0
1 k
k i∑
mAP = APi (6)
=1
AISKYEYE team at Tianjin University, China. This dataset is acquired by a UAV and has a
wide coverage; it is collected in different scenes, weather, and light, so there are numerous
small-size targets in complex environments. This dataset also provides some attributes
such as scene visibility, object class, and occlusion. The Visdron dataset is extensive and
authoritative, and this dataset conforms to the content studied in this experiment in all
aspects. So, we used this dataset for control experiments.
In order to clearly show the authenticity of the experiment, this experiment used
mAP0.5 and mAP0.5:0.9 as the evaluation index. The test results are shown in Table 1.
Table 1 showed that for the detection of small-size targets in complex scenes, the im-
proved algorithm has a certain improvement at each stage. Additionally, the recall rate
is improved by 2%, which means that there is a lot of room for improvement. It can be
proved that the three methods improved in this experiment are obviously effective: (a) The
improvement in the downsampling method can fully supplement the information lost in
the downsampling process and can save on context information during feature extraction
more completely. (b) The improvement in the feature fusion method effectively prevented
the problem of small targets being ignored in the whole learning process due to location
information. (c) The improvement in the network structure effectively solved the problem
of losing a lot of important information due to being misled by large-size objects in the
feature extraction process. The experimental results showed that improvement in the
algorithm at each stage can improve the learning ability of the model.
In order to compare the detection effect of different types of objects in the DC-YOLOv8
algorithm, we recorded the mAP of 10 kinds of objects in the Visdrone dataset, and the
specific results are shown in Figure 6. From the results, we can see that there are four
categories in which the recognition accuracy is higher than the average level of the whole
dataset. The modified algorithm shows steady improvement with larger objects such as cars
and great improvement with smaller objects such as tricycles, bicycles, awning-tricycles, etc.
Figure 6. Comparing the 10 categories of YOLOv8 and DC-YOLOv8: blue is the result of DC-YOLOv8
proposed in this paper, orange is the result of YOLOv8, and gray is the accuracy of the difference
between the two algorithms.
The reasons why the DC-YOLOv8 algorithm performs better than other algorithms
were analyzed: (a) Most of the feature fusion methods used by classical algorithms are FPN
+ PAN, and small-size targets are easy to be misled by normal-size targets when extracting
features layer by layer, resulting in loss of most information. The feature fusion method of
DC-YOLOv8 fuses shallow information in the final information result well and effectively
avoids the problem of information loss in shallow layers. (b) Unimportant information will
be automatically ignored in feature extraction, and small-size target pixels will be ignored
when extracting features, resulting in reduced accuracy. However, DC-YOLOv8 adopts the
ideas of DenseNet and VOVNet, which use deeper networks and are able to learn more
detailed information. The class loss of DC-YOLOv8 and YOLOv8 is shown in Figure 7. We
could see from the Figure 7 that DC-YOLOv8 had a certain degree of reduction in three
category losses: box loss, class loss, and dfl loss.
In order to intuitively see the detection effect of DC-YOLOv8, two sets of complex
scenes’ graphs were selected for testing. The weight files with the highest accuracies using
DC-YOLOv8 and YOLOv8 were retained and used for test comparison. The image selection
criteria included complex scenes with targets of various sizes and overlap. The differences
between DC-YOLOv8 and YOLOv8 can be clearly seen based on the above requirements.
Electronics 2023, 12, 2323 11 of 14
Among them, Figure 8 shows the comparison of a complex life scene (the highest
weight file of the Visdrone dataset is used). Figure 9 shows the detection comparison of
normal-sized objects (the highest weight file of the Pascal VOC2007 dataset is used). It can
be seen from Figures 8 to 9 that DC-YOLOv8 has both a higher number of detected targets
and higher accuracy of detected targets than YOLOv8.
In the first group of comparison experiments, images in the Visdrone dataset with com-
plex scenes, more interference, and overlap were selected. Figure 8 shows that YOLOv8
falsely detected the leftmost part of the picture due to its dark color. False detections
occurred at the middle tent, mainly due to overlapping objects. The right bike was misde-
tected due to the complex and overlapping environment nearby. On the far right, it was
not detected because of incomplete information. In the middle, false detection occurred
because of overlap. It can be seen that although YOLOv8 has many advantages, there are
still some problems with small-size targets. In contrast, DC-YOLOv8 can accurately detect
the right target when only partial information is available and can accurately detect the
target in complex and overlapping scenes without false detections or missed detections.
It can be seen from Figure 8 that the detection effect of DC-YOLOv8 is better than that of
YOLOv8 when the target size is small.
For the second set of comparison experiments, images in the PASCALVOC2007 dataset
with overlap between multiple people were selected. Figure 9 shows that two people
overlap in front of the middle door, and we can only see the head of the person behind due
to occlusion from the person in front. In this case, the YOLOv8 detector failed to detect the
person behind, with only their head visible. At the position of the cat, there was a false
detection (detecting the human arm as a cat) because the color of the human arm is similar
to that of a cat. In the case of severe overlap on the rightmost side, YOLOv8 did not detect
the person behind. In contrast, in the case of overlap, DC-YOLOv8 accurately detected the
person near the door and the person to the far right, and there was no false detection due
to similar colors. It can be seen from Figure 9 that DC-YOLOv8 also outperforms YOLOv8
in the detection of normal-sized objects.
Electronics 2023, 12, 2323 12 of 14
(a) (b)
(c) (d)
Figure 8. Experimental comparison of complex scenes in life. From the figure, it can be concluded
that the inference times of YOLOv5 and YOLOv7-Tny are less than that of DC-YOLOv8 and that the
inference time of YOLOv8 is the same as that of DC-YOLOv8, but their detection results are not as
accurate as that of DC-YOLOv8. (a) Test results of YOLOv5 with inference time of 10 ms. (b) Test
results of YOLOv7-Tiny with inference time of 8.2 ms. (c) Test results of YOLOv8 with inference time
of 12 ms. (d) Test results of DC-YOLOv8 with inference time of 12 ms.
(a) (b)
(c) (d)
Figure 9. Comparison diagram of the normal-size target experiment. From the figure, it can be
concluded that the inference times of YOLOv5 and YOLOv7-Tny are less than that of DC-YOLOv8,
and the inference time of YOLOv8 is the same as that of DC-YOLOv8, but their detection results are
not as accurate as that of DC-YOLOv8. (a) Test results of YOLOv5 with inference time of 9 ms. (b) Test
results of YOLOv7-Tiny with inference time of 8.4 ms. (c) Test results of YOLOv8 with inference time
of 12 ms. (d) Test results of DC-YOLOv8 with inference time of 12 ms.
Electronics 2023, 12, 2323 13 of 14
5. Conclusions
This paper proposed a small-size object detection algorithm based on a camera sensor;
different from traditional camera sensors, we combined a camera sensor and artificial intel-
ligence. Then, some problems in the newly released YOLOv8 and existing small-size object
detection algorithms were analyzed and solved. New feature fusion methods and net-
work architectures were proposed. It greatly improved the learning ability of the network.
The test and comparison were carried out on the Visdrone dataset, the Tinyperson dataset,
and the PASCAL VOC2007 dataset. Through an analysis and experiments, the feasibility of
each part of the optimization was proved. DC-YOLOv8 outperformed other detectors in
both accuracy and speed. Small targets in various complex scenes were easier to capture.
At present, the application of our proposed algorithm in traffic safety detection is more
efficient. As shown in Figure 6, the accuracy of car detection is the highest, so application
in traffic safety effect is the best, such as traffic signs, vehicles, pedestrians, etc.
In the future, we will continue to conduct in-depth research on camera sensors and
will strive to outperform existing detectors in detection accuracy at various sizes as soon
as possible.
Author Contributions: J.G. (Junmei Guo), H.L. (Haiying Liu) and J.G. (Jason Gu) gave technical
and writing method guidance as instructors; H.L. (Haitong Lou), X.D., L.B. and H.C. performed the
experiments and writing together as classmates. All authors have read and agreed to the published
version of the manuscript.
Funding: This paper is funded by Junmei Guo. This work was supported by QLUTGJHZ2018019.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zou, M.Y.; Yu, J.J.; Lv, Y.; Lu, B.; Chi, W.Z.; Sun, L.N. A Novel Day-to-Night Obstacle Detection Method for Excavators based on
Image Enhancement and Multi-sensor Fusion. IEEE Sens. J. 2023, 23, 10825–10835.
2. Liu, H.; Member, L.L. Anomaly detection of high-frequency sensing data in transportation infrastructure monitoring system
based on fine-tuned model. IEEE Sens. J. 2023, 23, 8630–8638. [CrossRef]
3. Zhu, F.; Lv, Y.; Chen, Y.; Wang, X.; Xiong, G.; Wang, F.Y. Parallel Transportation Systems: Toward IoT-Enabled Smart Urban Traffic
Control and Management. IEEE Trans. Intell. Transp. Syst. 2020, 21, 4063–4071. [CrossRef]
4. Thevenot, J.; López, M.B.; Hadid, A. A Survey on Computer Vision for Assistive Medical Diagnosis from Faces. IEEE J. Biomed.
Health Inform. 2018, 22, 1497–1511. [CrossRef] [PubMed]
5. Abadi, A.D.; Gu, Y.; Goncharenko, I.; Kamijo, S. Detection of Cyclist’s Crossing Intention based on Posture Estimation for
Autonomous Driving. IEEE Sens. J. 2023, 2023, 1. [CrossRef]
6. Singh, G.; Stefenon, S.F.; Yow, K.C. Yow, Interpretable Visual Transmission Lines Inspections Using Pseudo-Prototypical Part
Network. Mach. Vis. Appl. 2023, 34, 41. [CrossRef]
7. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
8. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 4510–4520.
9. Howard, A.; Wang, W.; Chu, G.; Chen, L.; Chen, B.; Tan, M. Searching for MobileNetV3 Accuracy vs MADDs vs model size. In
Proceedings of the IEEE Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324.
10. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 6848–6856.
11. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet V2: Practical guidelines for efficient cnn architecture design. In Proceedings of
the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Volume 11218, pp. 122–138.
12. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
13. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
14. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
Electronics 2023, 12, 2323 14 of 14
15. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13
December 2015; pp. 1440–1448.
16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In
Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015;
Volume 28.
17. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer
Vision, Seattle, WA, USA, 13–19 June 2020; Volume 42, pp. 386–397.
18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
19. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, Hawaii, USA, 21–26 July 2017; pp. 6517–6525.
20. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
21. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection
Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976.
22. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
23. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905,
pp. 21–37.
24. Liu, H.; Duan, X.; Chen, H.; Lou, H.; Deng, L. DBF-YOLO:UAV Small Targets Detection Based on Shallow Feature Fusion. IEEJ
Trans. Electr. Electron. Eng. 2023, 18, 605–612. [CrossRef]
25. Liu, H.; Sun, F.; Gu, J.; Deng, L. SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion
Mode. Sensors 2022, 22, 5817. [CrossRef] [PubMed]
26. Sengupta, A.; Cheng, L.; Cao, S. Robust multiobject tracking using mmwave radar-camera sensor fusion. IEEE Sens. Lett. 2022, 6,
1–4. [CrossRef]
27. Bharati, V. LiDAR+ Camera Sensor Data Fusion On Mobiles With AI-based Virtual Sensors to Provide Situational Awareness
for the Visually Impaired. In Proceedings of the 2021 IEEE Sensors Applications Symposium (SAS), Sundsvall, Sweden, 23–25
August 2021; pp. 1–6.
28. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning
capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle,
WA, USA, 13–19 June 2020; pp. 1571–1580.
29. Lin, T.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 21–26 July 2017; pp. 2117–2125.
30. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
31. Bolya, D.; Zhou, C.; Xiao, F.; Lee Jae, Y. Yolact: Real-time Instance Segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166.
32. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430.
33. Cao, Y.; Chen, K.; Loy, C.C.; Lin, D. Prime Sample Attention in Object Detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11583–11591.
34. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 21–26 July 2017; pp. 2261–2269.
35. Lee, Y.; Hwang, J.W.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object
detection. In Proceedings of the IEEE Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 752–760.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.