0% found this document useful (0 votes)
43 views23 pages

Sensors 23 06423

Uploaded by

tntautomation01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views23 pages

Sensors 23 06423

Uploaded by

tntautomation01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

sensors

Article
Efficient-Lightweight YOLO: Improving Small Object
Detection in YOLO for Aerial Images
Mengzi Hu 1 , Ziyang Li 1, *, Jiong Yu 1,2 , Xueqiang Wan 1 , Haotian Tan 2 and Zeyu Lin 1

1 School of Software, Xinjiang University, Urumqi 830091, China; [email protected] (M.H.);


[email protected] (J.Y.); [email protected] (X.W.); [email protected] (Z.L.)
2 College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China;
[email protected]
* Correspondence: [email protected]

Abstract: The most significant technical challenges of current aerial image object-detection tasks are
the extremely low accuracy for detecting small objects that are densely distributed within a scene
and the lack of semantic information. Moreover, existing detectors with large parameter scales are
unsuitable for aerial image object-detection scenarios oriented toward low-end GPUs. To address this
technical challenge, we propose efficient-lightweight You Only Look Once (EL-YOLO), an innovative
model that overcomes the limitations of existing detectors and low-end GPU orientation. EL-YOLO
surpasses the baseline models in three key areas. Firstly, we design and scrutinize three model
architectures to intensify the model’s focus on small objects and identify the most effective network
structure. Secondly, we design efficient spatial pyramid pooling (ESPP) to augment the representation
of small-object features in aerial images. Lastly, we introduce the alpha-complete intersection over
union (α-CIoU) loss function to tackle the imbalance between positive and negative samples in aerial
images. Our proposed EL-YOLO method demonstrates a strong generalization and robustness for the
small-object detection problem in aerial images. The experimental results show that, with the model
parameters maintained below 10 M while the input image size was unified at 640 × 640 pixels, the
APS of the EL-YOLOv5 reached 10.8% and 10.7% and enhanced the APs by 1.9% and 2.2% compared
to YOLOv5 on two challenging aerial image datasets, DIOR and VisDrone, respectively.

Citation: Hu, M.; Li, Z.; Yu, J.; Wan,


Keywords: aerial images; small-object detection; model architecture; SPP; loss function
X.; Tan, H.; Lin, Z.
Efficient-Lightweight YOLO:
Improving Small Object Detection in
YOLO for Aerial Images. Sensors 1. Introduction
2023, 23, 6423. https://doi.org/ In recent years, aerial image object detection has been widely used due to the rapid
10.3390/s23146423 development of unmanned aerial vehicles (UAV) and satellite remote-sensing technology.
Academic Editor: Marcin Woźniak As a branch of object detection, aerial image object detection can not only be applied in
areas of defense, such as military monitoring, missile guidance, and UAV combat systems,
Received: 23 May 2023 but also plays an important role in our day-to-day lives, such as through environmental
Revised: 11 July 2023
management, traffic monitoring, and urban planning. Therefore, aerial image object
Accepted: 12 July 2023
detection is of great research value and significance [1].
Published: 15 July 2023
Traditional methods for object detection rely on manually designed features, which are
inefficient and encounter difficulty in exploiting the relevance of massive image data. Re-
cently, researchers have introduced deep learning techniques to the field of object detection
Copyright: © 2023 by the authors.
because their advanced semantic features and learning capabilities can provide a powerful
Licensee MDPI, Basel, Switzerland. technical framework for extracting the rich feature information contained in high-resolution
This article is an open access article aerial images. Meanwhile, with the development of deep learning techniques, in addition
distributed under the terms and to the commonly used convolution neural network (CNN) [2], recurrent neural network
conditions of the Creative Commons (RNN) [3], AutoEncoder (AE) [4], and generative neural network (GAN) [5] methods have
Attribution (CC BY) license (https:// been widely used in object detection. In addition, the emergence of challenging natural
creativecommons.org/licenses/by/ image datasets, such as the PASCAL Visual Object Classes (PASCAL VOC) [6,7] and Mi-
4.0/). crosoft Common Objects in Context (MS COCO) [8], has further advanced the development

Sensors 2023, 23, 6423. https://doi.org/10.3390/s23146423 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 6423 2 of 23

of object detection. Accordingly, increasing numbers of object-detection algorithms with


excellent performance on natural images have emerged, with representative algorithms
including Faster Regions with CNN (Faster R-CNN) [9], RetinaNet [10], and the You Only
Look Once (YOLO) series [11–17]. However, aerial and natural images differ significantly,
mainly in terms of the following: Firstly, the number of objects in aerial images is consider-
ably higher than in natural images. Secondly, the distribution of objects in aerial images is
denser than in natural images. Thirdly, aerial images present significant scale variation for
similar objects due to the view angle, and most targets are small. Fourthly, aerial images
are often high-resolution.
The aerial image object-detection task has strict requirements for the detection model.
Firstly, the object-detection model needs to meet real-time processing. Secondly, the object-
detection model needs to consider the model’s parameters. Embedded platforms such as
UAVs and airborne processors have limited computational resources; thus, ensuring that
real-time detection models are installed on these platforms is also a significant challenge.
Currently, in areas such as UAV remote sensing and space exploration, the chips that are
used to implement image-processing techniques such as object detection usually require a
model size of less than 10 MB to be loaded. For example, field programmable gate array
(FPGA) chips tend to be used for small volumes, customization, and real-time demanding
applications [18]. For inference, a sufficiently small model can be stored directly on the
FGPA without the limitation of memory bandwidth. In this paper, we propose a real-
time object detector, which we want to support embedded platforms such as UAVs and
airborne processors.
Consequently, the detection performance of object-detection algorithms, which are
effective for natural scenes, does not meet the needs of practical applications when imple-
mented directly on aerial images. Two main technical challenges exist. Firstly, aerial image
datasets contain many small objects, and small targets usually lack sufficient appearance
information, resulting in high false and missed detection rates for small objects in detection
tasks. Secondly, practical aerial image object-detection tasks necessitate the consideration
of computational costs, as well as the real-time processing of images, and thus present
requirements regarding the number of parameters and the topicality of the model.
To address the above technical challenges, we have focused on improving the detection
performance of lightweight frameworks for small objects in aerial images. The YOLO
family of detectors are one-stage object detection models that can directly predict both the
position and class of objects from an input image, meeting the requirements of real-time
image processing. According to our literature review, You Only Look Once version 7
(YOLOv7) [17] is the latest version of YOLO and provides the best performance for object-
detection tasks in natural scenes. However, YOLOv7 suffers from a severe problem with
aerial images, namely that YOLOv7 is preoccupied with accuracy and uses too many tricks,
thus consuming too many computational resources and failing to meet the requirements of
a lightweight model. Therefore, considering the complexity and accuracy of the model, we
have chosen the S-scale You Only Look Once version 5 (YOLOv5s) algorithm [15] as the
baseline model for this paper.
Due to the difference between natural and aerial images, YOLOv5s fails at the task of
detecting objects in aerial images. We enhanced the original YOLOv5s model in three ways.
Firstly, YOLOv5s includes a continuous downsampling operation between the backbone
and neck, resulting in instances where small objects that cover less feature information
may directly lose their information. To solve this problem, we changed the connection of
feature maps to indirectly control the ratio of both low-level and deep feature maps in the
model architecture and maximize the retention of small-object feature information, making
the model better adaptable to small objects. Then, spatial pyramid pooling (SPP) [19] was
applied to YOLOv5s to enhance the information fusion of local and global features, but this
was not effective in practical small-object detection tasks. Thus, the ESPP approach was
designed to replace SPP and more effectively retain the detailed information of small objects.
Thirdly, the complete-intersection over union (CIoU) [20] of YOLOv5s was replaced by the
Sensors 2023, 23, 6423 3 of 23

α-CIoU loss function [21], which enabled the model to obtain higher-quality anchor frames.
Based on the above, we proposed the efficient-lightweight You Only Look Once version 5
(EL-YOLOv5), which successfully balanced accuracy and speed in object detection. The
experimental results showed that EL-YOLOv5 met the requirements of embedded platform
deployment and outperformed the general YOLOv5 model in terms of aerial image object-
detection accuracy, while the S-scale EL-YOLOv5 (EL-YOLOv5s) also outperformed the
YOLOv7 model of comparable size.
The contributions of this paper are summarized below:
1. We modified the model architecture of YOLOv5. Intending to introduce high-resolution
low-level feature maps, we evaluated three model architectures through several
rounds of experiments and then analyzed the reasons for their superior or inferior
performance and architectural characteristics; finally, we selected the best-performing
model. The top-performing model architecture managed to maximize the precision
of the model in detecting small objects while imposing only a marginal increase in
computational overhead.
2. We designed the ESPP method based on the human visual perception system to
replace the original SPP approach, which enhanced the model’s ability to extract
features from small objects.
3. We used α-CIoU to replace the original localization loss function of the object detector.
The α-CIoU function could control the parameter α to optimize the positive and
negative sample imbalance problem in the bounding box regression task, allowing
the detector to locate small objects more quickly and precisely.
4. Our proposed embeddable S-scale EL-YOLOv5 model attained an APS of 10.8% on
the DIOR dataset and 10.7% on the VisDrone dataset. This is the highest accuracy
achieved to date among the available lightweight models, showcasing the superior
performance of our proposal.

2. Related Work
2.1. General Object Detection
Object detection synthesizes complex vision tasks such as segmentation, recognition,
and detection into a unified problem. The accuracy and real-time performance of such
an approach are critical benchmarks for the efficacy of a comprehensive computer vision
system. In simple terms, object detection solves the problem of where and what objects are
in an image. Currently, object detection based on deep learning techniques can be classified
into two main types: two-stage and one-stage object-detection algorithms. The two-stage
approach [9,22,23] splits the object-detection process into two steps: the task in the first
step is to obtain candidate regions from the displayed region proposals. The detection
task in the second step uses a detection network to achieve classification recognition and
bounding box regression. The two-stage object detection models are highly accurate, but
the detection speed is often limited accordingly. Representative models include Faster
R-CNN, mask regions with CNN (Mask R-CNN) [22], and cascade regions with CNN
features (Cascade R-CNN) [23].
Unlike two-stage object detection algorithms, one-stage methods [10–17,24] discard
region proposals and directly use bounding box regression for object classification and
localization. One-stage object detection models significantly improve detection efficiency
and reduce computational overhead. However, one-stage methods suffer from a class im-
balance, making two-stage methods superior in terms of detection accuracy. As technology
continues to evolve, one-stage object detection methods are constantly being upgraded,
and their detection accuracy is improving accordingly. Representative models include the
single shot multibox detector (SSD) [24], RetinaNet, and the YOLO series.
YOLO [11], the first model in the YOLO series of one-stage object detection algorithms,
aimed at an extremely low runtime. You Only Look Once version 2 (YOLOv2) [12] used
Darknet-19 as the new feature-extraction network while borrowing ideas from the region
proposal network (RPN) [9] proposed by Faster R-CNN and introducing prior anchors
Sensors 2023, 23, 6423 4 of 23

based on the previous YOLO model. However, YOLOv2 still had a low detection accuracy
for dense objects. To further improve the detection accuracy of the model, You Only Look
Once version 3 (YOLOv3) [13] introduced residual connections from the deep residual
network (ResNet) [25], updated the backbone network from Darknet-19 to Darknet-53, and
borrowed the idea of feature pyramid networks (FPN) [26] to construct three different scales
of feature maps. You Only Look Once version 4 (YOLOv4) [14] was further optimized from
YOLOv3 by incorporating the CSPDarknet53 architecture for the backbone and an FPN
for the neck, combined with path aggregation network (PAN) architecture [27], to enhance
information fusion. Meanwhile, YOLOv4 used mosaic data augmentation and introduced
the Mish activation function and CIoU loss function. YOLOv5 further extended the cross-
stage partial network structure [28] of the YOLOv4 backbone network to the neck and
proposed the spatial pyramid pooling–fast (SPPF) module. You Only Look Once version 6
(YOLOv6) [16] introduced the RepVGG structure [29] into the YOLO model architecture to
enhance the adaptability of the model to GPU devices. The YOLOv7 detection algorithm is
similar to YOLOv5 and was mainly optimized for model structure re-parameterization and
dynamic label assignment problems.
In addition to the continuous optimization of the mainstream YOLO model, certain ex-
cellent algorithms have emerged and optimized the YOLO model in special domains [30–33].
The YOLOv3—four-scale detection layers (YOLOv3-FDL) [32] significantly improved the abil-
ity of YOLOv3 to detect GPR images with a high missed detection rate of small cracks feature,
mainly through multiscale fusion structures, advanced loss function, and hyperparameter
optimization. Jiawen Wu et al. [33] proposed a local adaptive illumination-drive input-level
fusion (LAIIFusion) module, which can effectively sense the illumination in different scenes
and enable realistic remote-sensing image object-detection tasks to adapt to changing light-
ing conditions. These excellent algorithms demonstrated that YOLO is an object-detection
algorithm with a relatively wide application area and an excellent performance.
It has been found that the detection accuracy of the YOLO algorithms improved
significantly through these continuous improvements, but with this comes the burden of
computational overhead and model size, which is detrimental to the establishment of the
model in a particular application domain. Among the current YOLO families, YOLOv5
not only provides an effective balance between speed and model complexity for object
detection but also offers the easiest deployment. Therefore, considering the needs of the
application domain of aerial image object detection, we have chosen YOLOv5s as the
baseline framework for this study.

2.2. Aerial Image Objects Detection


The most prominent feature of aerial images is the high image resolution. Nevertheless,
small objects in aerial images still have a low resolution, often comprising tens or even
just a few pixels, making the learning of small-object feature information difficult for the
model. Object-detection algorithms designed for natural scenes do not perform well on
high-resolution images containing a dense distribution of small targets. Many researchers
have adopted different schemes to address this problem.
DBNet [34] was based on the Cascade R-CNN to train the detector and used ResNext-
101 [35] as the backbone network to incorporate deformable convolutions and enhance the
network’s ability to handle multiscale objects in aerial images. The drone networks with
effective fusion strategy (DNEFS) project [34] used YOLOv5, Cascade R-CNN, and FPN
as the baseline models while incorporating attention mechanisms, double-headers, and
other effective strategies to achieve a higher detection accuracy. Transformer prediction
heads–You Only Look Once version 5 (TPH-YOLOv5) [36] addressed the problem of
scale variation in aerial image overhead angles using transformer prediction heads (TPH),
the convolutional block attention model (CBAM) [37], and a series of data-enhancement
strategies based on the YOLOv5 model. The stronger visual information for tiny-object
detection (VistrongerDet) project [38] integrated the FPN, region of interest (ROI), and
Sensors 2023, 23, 6423 5 of 23

head-level data enhancement components to largely mitigate the detrimental effects of


aerial image scale variation and small object size on detection.
An analysis of the above well-performing algorithms revealed two problems with the
current aerial image object-detection algorithms that address the low detection accuracy for
small objects. First, most of these algorithms focused on the average detection accuracy of
the model for aerial images, rather than for small objects. Second, these algorithms tended
to consider only the use of cascaded networks and the addition of new feature enhancement
modules to improve model accuracy, ignoring model complexity and memory loss. At
the same time, most of these algorithms were adapted from two-stage object detection
models, which have a high detection accuracy but incur substantial resource overheads
in the computation process, leading to their low application value. In contrast to existing
research, the present study fully considered the computational overhead and addressed
the problem of low detection precision for densely distributed small objects in aerial image
object-detection tasks.

3. Materials and Methods


3.1. Fundamental Models
3.1.1. Baseline Model
The YOLO family has become a very popular model framework in the field of object
detection in recent years. Compared with the existing YOLO models, YOLOv5 provides a
good balance between memory loss and model accuracy. The structure of YOLOv5 can be
broadly subdivided into three parts according to function: the backbone network, neck,
and prediction head. Firstly, the backbone network mainly extracts features from the input
images to pave the way for subsequent object recognition and localization. Secondly, the
neck further enhances the fusion of the feature information extracted by the backbone and
constructs three scales of feature maps. Finally, the head achieves object classification and
localization based on these feature maps and completes the object detection task.
In conclusion, as a one-stage object detection algorithm, YOLOv5 has been widely
applied in various fields due to its simple network structure and high detection efficiency.
YOLOv5 can control the size and complexity of the model by setting different width and
depth coefficients, which can in turn be divided into four scales: S, M, L, and X. The
network architectures of these different scales are identical. Considering that our model
may need to be deployed in the application domain, we had to limit the computational
resource overhead to a certain extent. We selected YOLOv5s, which is of relatively small
size and low complexity, as the baseline model. The YOLOv5s balances detection precision
and speed, meeting the requirements of a lightweight embedded model.

3.1.2. Receptive Field Block


The receptive field block (RFB) [39] is a module that enhances the feature representa-
tion of neural networks by mimicking the receptive field mechanism of the human visual
system. The RFB further models the relationship between receptive field size and receptive
field eccentricity in the structure of the human visual system by constructing multibranch
convolutional layers with different sizes of convolutional kernels and atrous pooling or
convolution layers. With this structure, on the one hand, the RFB can improve the feature
representation capability of lightweight models with a lower computational burden. On
the other hand, the RFB module can improve the discriminative power and robustness of
the object features, improving the performance of object detectors. The RFB has proven to
be an effective method when successfully applied to improve one-stage object detectors.

3.2. Proposed Model


The original YOLOv5s exhibited a suboptimal performance in practical aerial image
object detection tasks, especially for small objects. To counter the shortcomings inherent
to the existing model and address the unique challenges posed by aerial image detection
3.2. Proposed Model

Sensors 2023, 23, 6423


The original YOLOv5s exhibited a suboptimal performance in practical aerial image6 of 23
object detection tasks, especially for small objects. To counter the shortcomings inherent
to the existing model and address the unique challenges posed by aerial image detection
tasks, we enhanced the model in three key areas: the model architecture, the SPP module,
tasks,
and thewe enhanced
loss function.the model in three key areas: the model architecture, the SPP module,
and the loss function.
The architectural blueprint of the proposed model, EL-YOLOv5, specifically tailored
The architectural blueprint of the proposed model, EL-YOLOv5, specifically tailored
to the domain of aerial image object detection, is depicted in Figure 1.
to the domain of aerial image object detection, is depicted in Figure 1.
Backbone Neck Head
yolo
ESPP C3 CBS C3 19*19
CBS Concat
Upsample

CBS
C3 Concat
yolo
CBS C3 38*38
C3
CBS Concat
Upsample
CBS
C3 Concat
yolo
CBS C3 76*76
C3
CBS Concat
Upsample CBS
yolo
C3 Concat C3
152*152
CBS
Focus
CBS = Conv+BN+SiLU ESPP = Efficient Spatial Pyramid Pooling
Input
C3 = 3* CBS +Bottleneck
608*608*3

Thestructure
Figure1.1.The
Figure structureofof EL-YOLOv5.
EL-YOLOv5. In contrast
In contrast to the
to the baseline
baseline model,
model, the 152 152×output
the ×152 152 output
layer
layer indicates
indicates the addition
the addition of theof the remaining
remaining connected
connected small-object
small-object detection
detection head.head. Meanwhile,
Meanwhile, we
we de-
signed a new
designed ESPP
a new module
ESPP module to improve performance.
to improve performance.

3.2.1. Model
3.2.1. ModelArchitecture
Architecture
Accordingto
According tothe
theYOLOv5s
YOLOv5smodel modelarchitecture,
architecture,the theprocess
processof ofobject
objectdetection
detectioncom-
com-
prises three main parts: extracting features; building feature maps with different
prises three main parts: extracting features; building feature maps with different scales; scales; and
regressing feature maps for classification and regression. Accordingly, the
and regressing feature maps for classification and regression. Accordingly, the feature feature map is of
great importance for object detection. We can further classify feature maps
map is of great importance for object detection. We can further classify feature maps into into low-level
and deep and
low-level feature
deepmaps according
feature to their distance
maps according to theirfrom the input
distance from layer.
the input layer.
Low-level feature maps are located close to the input
Low-level feature maps are located close to the input layer and are layer and are extracted
extractedby bythethe
shallow neural web of the model. The shallow web has a smaller receptive field, and
shallow neural web of the model. The shallow web has a smaller receptive field, and the
the overlapping area of the receptive field is also smaller. As shown in Figure 2a, low-
overlapping area of the receptive field is also smaller. As shown in Figure 2a, low-level
level feature maps contain more pixel-wise information, and this fine-grained information
feature maps contain more pixel-wise information, and this fine-grained information in-
includes the color, texture, edge, and corner information of the image. Generally speaking,
cludes the color, texture, edge, and corner information of the image. Generally speaking,
low-level feature maps have a higher resolution and contain more location and detailed
low-level feature maps have a higher resolution and contain more location and detailed
information beneficial for small-object detection. However, due to the smaller number
information beneficial for small-object detection. However, due to the smaller number of
of convolutions undergone, their semanticity is lower. On the contrary, deep feature
convolutions undergone, their semanticity is lower. On the contrary, deep feature maps
maps are generally farther away from the input and closer to the output. As the image
are generally farther away from the input and closer to the output. As the image infor-
information is continuously being compressed by convolution, the receptive field of the
mation is continuously being compressed by convolution, the receptive field of the deep
deep web increases, as well as the overlapping region between the receptive fields. As
web increases, as well as the overlapping region between the receptive fields. As shown
shown in Figure 2b, deep feature maps contain more semantic information but have a
in Figure
lower 2b, deep feature
resolution, maps
appearing contain
only more semantic
as colored spots, and information but haveperception
their small-object a lower res- is
olution,
poorer. Therefore, coordinating shallow and deep feature maps by improving is
appearing only as colored spots, and their small-object perception thepoorer.
model
Therefore,
architecture coordinating
to increaseshallow and deep
the detection featurefor
accuracy maps by improving
small objects in the model
aerial architec-
images is an
ture to increase
urgent issue. the detection accuracy for small objects in aerial images is an urgent issue.
The backbone of YOLOv5s contains continuously downsampled convolutional layers,
which have a detrimental effect on the detection accuracy of small objects. On the one hand,
during the feature-extraction stage, successive downsampling continuously reduces the
size of the output feature map. When the convolutional downsampling rate is too large, it
may cause the small objects to be much smaller than the downsampling step size, which
Sensors 2023, 23, 6423 7 of 23

can easily lead to a loss of small-object feature information in the feature-extraction phase.
On the other hand, YOLOv5s selects three deep feature maps for detection in the prediction
stage. However, the further down the feature map is passed, the less information about
Sensors 2023, 23, x FOR PEER REVIEW small objects is retained, until no feature information about small objects is retained
7 of 23at all.
For the above two reasons, the original YOLOv5s model performs poorly in aerial image
small-object detection.

(a) (b)
Figure 2. Example visualization results. (a) Example visualization results of low-level feature maps;
(b) example visualization results of deep feature maps.

The backbone of YOLOv5s contains continuously downsampled convolutional lay-


ers, which have a detrimental effect on the detection accuracy of small objects. On the one
hand, during the feature-extraction stage, successive downsampling continuously re-
duces the size of the output feature map. When the convolutional downsampling rate is
too large, it may cause the small objects to be much smaller than the downsampling step
size, which can easily lead to a loss of small-object feature information in the feature-ex-
traction phase. On the other hand, YOLOv5s selects three deep feature maps for detection
in the prediction(a) stage. However, the further down the feature (b) map is passed, the less
information about small objects is retained, until no feature information about small ob-
Figure 2. Example
Figure
jects visualization
Example
is2.retained all. For results.
atvisualization (a) Example
theresults.
above (a)
two visualization
Example
reasons, results
visualization
the original ofYOLOv5s
resultslow-level feature
of low-level
model maps;
feature maps;
performs
(b) example
(b) visualization
example results
visualization of deep
results of feature
deep maps.
feature maps.
poorly in aerial image small-object detection.
To address the above issues, we aimed to introduce a low-level feature map containing
To addressof
The backbone the above issues,
YOLOv5s we aimed
contains to introduce
continuously a low-level convolutional
downsampled feature map containing
lay-
more
more location
location information
informationeffect and
and other
other details,
details, which
which would
would be be very beneficial for small-object
ers, which have a detrimental on the detection accuracy of very
smallbeneficial
objects. Onfor the
small-object
one
detection.
detection. AsAsshown
shown ininFigure
Figure 3,3,wewecontinuously
continuously adjusted
adjustedthetheweights
weightsof the
of low-level
the and
low-level
hand, during the feature-extraction stage, successive downsampling continuously re-
deep
and feature maps in the model based on the baseline model to increase the sensitivity for
duces thedeep
size feature maps infeature
of the output the model map.based
When onthetheconvolutional
baseline model to increase therate
downsampling sensitivity
is
small-object
for small-objectdetection
detectionas much
as much as possible. In other words, by continuously increasing the
too large, it may cause the small objectsastopossible.
be muchInsmaller
other words,
than the bydownsampling
continuously increasing
step
weights of the low-level features, we made the model pay more attention to the to
feature maps
size, the weights
which of the lead
can easily low-level features,
to a loss we made the
of small-object model
feature pay more attention
information the feature
in the feature-ex-
of small
maps objects,
of small which
objects, maximized the capacity for small-object detection.
traction phase. On the otherwhich
hand,maximized
YOLOv5s selects the capacity for small-object
three deep feature maps detection.
for detection
in theBackbone
prediction stage.
Neck However, Head the further down the feature
Backbone Neck
map is passed,
Head
the less
information about small objects is retained, until no feature information about small ob-
Conv
jects is retained at all. For the above two reasons, theConv original YOLOv5s model performs
152*152

poorly in aerial
Conv image small-object 76*76 detection. Conv 76*76
To address
Conv
the above issues, we aimed
38*38
to introduce a low-level feature map containing
Conv 38*38
more location information and other details, which would be very beneficial for small-object
Conv 19*19 Conv 19*19
detection. As shown in Figure 3, we continuously adjusted the weights of the low-level and
deep feature maps in the model
Baseline Model based on the baseline model to increase Model 1 the sensitivity for
small-object detection as much as possible. In other words, by continuously increasing the
Backbone Neck Head Backbone Neck Head
weights of the low-level features, we made the model pay more attention to the feature maps
of small objects,
Conv which maximized 152*152 the capacity for small-object
Conv detection. 152*152

Conv 76*76 Conv 76*76


Backbone Neck Head Backbone Neck Head
Conv 38*38 Conv 38*38
Conv Conv 152*152
Conv
Conv 76*76 Conv 76*76
Model 2 Model 3
Conv 38*38 Conv 38*38
network architecture
Conv 3. The network
Figure architecture of the
19*19 of the baseline
baseline model and
model
Conv
and three
three modified
modified model architectures.
model
19*19
architectures.
The red box represents the shallow feature level, and the green box represents the deep feature level.
Baseline Model Model 1
Therefore,
Neck
we aimed to introduce a high-resolution Neck
low-level feature map and de-
Backbone Head Backbone Head
signed three model architectures based on the baseline model. Figure 3 shows the following:
Model
Conv 1 retained the deep feature map andConv
152*152 large-object detection head with a detection
152*152
layer
Conv scale of 19 × 19 of YOLOv5s
76*76 and introduced
Conv a low-level feature map
76*76and small-object

Conv 38*38 Conv 38*38

Conv

Model 2 Model 3
Sensors 2023, 23, 6423 8 of 23

detection head with a detection layer scale of 152 × 152 on the basic of the baseline model.
Model 2 retained the deep feature map, low-level feature map, and small-object detection
head of Model 1 and removed the large-object detection head of the baseline model. Model
3 retained the low-level feature map and small-object detection head from Model 1 while
removing the deep feature map and large-object detection head. The proportion of shallow
feature maps in the overall model architecture increased gradually from the baseline model
to Model 3.
We indirectly controlled the proportion of low-level and deep feature maps in the
model architecture by changing the connectivity of the feature maps. Then, we compared
the experimental results of the model structure with different feature map proportions to
obtain the best model architecture for small-object detection. It is thereby demonstrated
that Model 1 effectively balanced both detailed and semantical information, thus avoiding
the loss of fine-grained data that would otherwise hinder the detection of small objects
during the continuous downsampling process observed in the baseline model. Simultane-
ously, by retaining the deep feature map, Model 1 maintained the capacity to control the
model’s complexity while also somewhat reducing the noise accumulation resulting from
introducing the low-level feature map. Ultimately, Model 1 substantially augmented the
detection accuracy of YOLOv5s for small objects in aerial images while maintaining a small
memory footprint.

3.2.2. Efficient Spatial Pyramid Pooling


The SPP module in YOLOv5s fails to be optimally effective in aerial image small-
object detection, for the following two reasons. First, the main reason for the low accuracy
of small-object detection is the lack of sufficient feature information regarding the small
objects themselves. Second, YOLOv5s uses the SPP module to extract information from
different receptive fields, but this module does not fully reflect the semantic relationship
between global and local information. Therefore, the SPP module’s ability to aggregate
multiscale contextual information is inadequate, which makes it difficult for YOLOv5s to
recognize objects with large-scale variability. To solve these problems, we needed to build
a completely new feature fusion module to effectively integrate multiscale object features
and help the model to capture more abundant and complex features without incurring too
large a computational burden.
Current research shows that one can obtain excellent high-level features by increasing
the depth of a model, which results in significant performance gains. However, this
involves correspondingly higher computational costs, substantially slowing down the
model inference.
The paper introducing the RFB module postulated that in the human visual system,
the size of the population receptive field is proportional to the eccentricity of the receptive
field. By constructing a corresponding structure to model the relationship between the
receptive field and the eccentricity of the model, we could enhance the feature represen-
tation capability of the model’s low-level network. Therefore, inspired by the RFB, we
constructed ESPP by taking the complex backgrounds and large-scale variation in aerial
images into account, as shown in Figure 4.
The implementation of ESPP is illustrated in Algorithm 1. Steps 1 and 2 of Algorithm 1
are aimed at obtaining the output channel number Cout1 of the following ordinary convolu-
tion by parameter a, which can effectively control the size of the module. Step 3 performs
an ordinary 1 × 1 convolution on the input to obtain out1 . Steps 4, 6, and 8 construct
the perceptual fields of 3 × 3, 5 × 5, and 7 × 7 convolutions, respectively, to obtain out2 ,
out3 , and out4 through a series of ordinary 3 × 3 convolutions. Steps 9, 10, 11, and 12
perform a 3 × 3 atrous convolution operation for out1 , out2 , out3 , and out4 , respectively,
and set corresponding atrous rates of 1, 3, 5, and 7. Thus, the dependence of the receptive
field on the eccentricity can be efficiently simulated and then obtained for out5 , out6 , out7 ,
and out8 . Step 14 concatenates the branches of different receptive fields in the channel
dimension to obtain out9 . Steps 16 and 17 perform a cross-scale fusion of previous outputs
Sensors 2023, 23, 6423 9 of 23

and the shortcut. Step 19 adds nonlinearity to the output by the rectified linear unit (ReLU)
activation function.
Sensors 2023, 23, x FOR PEER REVIEW Step 20 returns the final output. It is demonstrated that Algorithm91 of
can23
efficiently deepen the shallow network feature representation of the model to obtain more
boundary information on small objects and improve the small-object detection accuracy.
Efficient Spatial Pyramid Pooling

3*3 Conv
+ Rate=1
1*1 Conv

3*3 Conv
Input Feature Map
+ Rate=3

3*3 Conv

3*3 Conv
+ Rate=5
3*3 Conv
5*5 Conv

+ 3*3 Conv
3*3 Conv Rate=7

7*7 Conv

Figure4.4.The
Figure TheESPP
ESPPmodule
modulewaswas constructed
constructed by
by combining
combining multiple
multiple branches
branches with
with different
different convo-
convo-
lution kernel sizes and atrous convolution layers. The multiple kernels resemble receptive fields
lution kernel sizes and atrous convolution layers. The multiple kernels resemble receptive fields of of
different sizes, while the atrous convolution layers assign a separate atrous rate to each branch to
different sizes, while the atrous convolution layers assign a separate atrous rate to each branch to
simulate the relationship between receptive field size and eccentricity.
simulate the relationship between receptive field size and eccentricity.

The implementation of ESPP is illustrated in Algorithm 1. Steps 1 and 2 of Algorithm


1Algorithm
are aimed1 Efficient Spatial
at obtaining thePyramid
output Pooling
channel(ESPP)number Cout1 of the following ordinary con-
volution
Input: The byinput
parameter
feature a, which
layer, x; Thecan effectively
number control
of input channelsthe of sizex, Cof
in ;the
Themodule.
number ofStep 3 per-
output
channels
forms an of x, Cout ; The
ordinary 1 × parameter
1 convolution that control
on thethe sizeto
input ofobtain
the model, out1a. . Steps 4, 6, and 8 construct
Output:
the The output
perceptual fieldsfeature 3, 5 ×out.
of 3 ×layer, 5, and 7 × 7 convolutions, respectively, to obtain out2, out3,
1: Control the number of output
and out4 through a series of ordinary feature channels of the convolution
3 × 3 convolutions. Steps process9, 10, 11, a. 12 perform a
by and
2: Cout1 = Cout / a. // In general, the parameter a is set to 4 or 8, which gives excellent control over
3 × 3 atrous convolution operation for out1, out2, out3, and out4, respectively, and set corre-
the number of parameters of the model.
sponding atrous rates of 1, 3, 5, and 7. Thus, the dependence of the receptive field on the
3: Make an ordinary 1*1 convolution at x, get the output out1 .
eccentricity can be efficiently
4: Make an ordinary simulated
3*3 convolution and
at x, get thethen obtained
output out2 . for out5, out6, out7, and out8. Step
14 concatenates the branches of different receptive
5: The stacking of two 3*3 convolutions gives the same perceptual fields in the fieldchannel
as one 5*5dimension to ob-
convolution:
tain out9.an
6: Make Steps 16 and
ordinary 3*317 perform aat
convolution cross-scale
out2 , get thefusion
output ofout
previous
3 . outputs and the shortcut.
Step 19stacking
7: The adds nonlinearity to the output
of three 3*3 convolutions by the
gives the rectified linear unit
same perceptual field (ReLU)
as one 7*7activation func-
convolution:
8: Make an ordinary 3*3 convolution at out , get the output out .
tion. Step 20 returns the final output. It3 is demonstrated 4that Algorithm 1 can efficiently
9: Makethe
deepen a 3*3 atrous convolution
shallow network feature with the atrous rate of 1ofatthe
representation outmodel
1 , get the tofirst branch
obtain output
more out5 .
boundary
10: Make a 3*3 atrous convolution with the atrous rate of 3 at
information on small objects and improve the small-object detection accuracy. out 2 , get the second branch output
out6 .
11: Make a 3*3 atrous convolution with the atrous rate of 5 at out3 , get the third branch output out7 .
Algorithm
12: Make a 3*3 1 Efficient Spatial Pyramid
atrous convolution Pooling
with the atrous rate (ESPP)
of 7 at out4 , get the third branch output out8 .
Input: The
13: Unify theinput featureoutputs
four branch layer, x; The
into thenumber of input channels of x, Cin; The number of
same dimension:
14:output
Concatchannels
([out5 , outof x, C
6 , out ; The
out
7 ,out parameterget
8 ], dimension), that
outcontrol
9. the size of the model, a.
15: Integrate feature information:
Output: The output feature layer, out.
16: Make an the
1: Control ordinary
number 1*1 convolution
of output feature at out1 ,channels
get the shortcut.
of the convolution process by a.
17: net = out9 *0.8 + shortcut.
2: Cout1 = Cout / a. // In general, the parameter a is set to 4 or 8, which gives excellent
18: Get the final output:
control
19: over the
out = RELU number
(net). // addof theparameters
non-linearity.of the model.
3: Make an
20: return out. ordinary 1*1 convolution at x, get the output out1.
4: Make an ordinary 3*3 convolution at x, get the output out2.
5: The
Thestacking
structureofof
two
the3*3 convolutions
ESPP method isgives the same
illustrated perceptual
in Figure field
5. The as one 5*5can be
architecture
divided into four main branches. The first branch is a standard 1 × 1 convolution and an
convolution:
atrous convolution
6: Make with
an ordinary 3*3anconvolution
atrous rate of 1, which
at out aims
2, get the to maintain
output out3. the original receptive
fields.
7: TheThe second
stacking ofto the 3*3
three fourth branches consists
convolutions gives theofsame 3 × 3 convolution
serialperceptual layers
field as one 7*7 and
an atrous convolution
convolution: layer, aiming to quickly extract feature information from different
8: Make an ordinary 3*3 convolution at out3, get the output out4.
9: Make a 3*3 atrous convolution with the atrous rate of 1 at out1, get the first branch
output out5.
Sensors 2023, 23, 6423 10 of 23

receptive
Sensors 2023, 23, x FOR PEER REVIEW fields. ESPP could successfully simulate the relationship between the receptive
11 of 24
field size and the eccentricity of the human visual system while making the following
improvements to the detection accuracy of small objects in aerial images.

Previous Layer

3*3 Conv 3*3 Conv 3*3 Conv 1*1 Conv

Conv 3*3, Conv 3*3, Conv 3*3, Conv 3*3,


Rate=7 Rate=5 Rate=3 Rate=1

Concatenation+
Shortcut
1*1Conv

ReLU activation
Figure 5. The structure
structure of
of ESPP:
ESPP:thetheeffect
effectofofconstructing
constructingparallel
parallelconvolutions
convolutionsofof 1×
1 ×1,1,33×× 3,
3,55×× 5,
and 77 ×
× 7 by an ordinary convolution
convolution of 11 × × 1 and a serial
serial convolution
convolution ofof 33 ×
× 3, and the effective
effective
widening of
widening ofthe
thereceptive
receptive field
field by by
thethe superposition
superposition of parallel
of the the parallel convolutional
convolutional layer
layer and theand the
atrous
atrous convolutional
convolutional layer. layer.

Firstly, ESPP
Firstly, ESPP diddid not
not use
use aa 11×× 1 convolution
convolution layer before parallel parallel convolution
convolution but
achieved a dimensionality reduction by an intermediate
achieved a dimensionality reduction by an intermediate parameter. This technique parameter. This technique
circum-
circumvented
vented the issuetheof issue
spatialofresolution
spatial resolution
degradation degradation in feature
in feature maps arisingmaps
fromarising from
superfluous
superfluous convolution
convolution operations. operations.
Preventing Preventing excessive loss
excessive resolution resolution loss as
is crucial, is crucial, as it
it preserves
preservesinformation
detailed detailed information on imageSecondly,
on image boundaries. boundaries. we added a 3 × we
Secondly, added a 3serial
3 convolution × 3
convolution
structure, serial
which structure,
formed whichreceptive
the same formed thefieldsame × 3, 5 × field
for 3receptive 5, andfor7× 3 ×7 3, 5 × 5, and 7
convolutions.
This
× 7 operation
convolutions.increased
This the sampling
operation rate while
increased thereducing
sampling the rate
computational
while reducing overhead,
the
and the serial structure increased the module running speed to
computational overhead, and the serial structure increased the module running speed to some extent. Thirdly, the
atrous rate increased
some extent. Thirdly, from 1, 3, or
the atrous rate5 increased
to 3, 5, or from
7, respectively,
1, 3, or 5 towhich further
3, 5, or captured
7, respectively,
large-scale
which further information, thus enhancing
captured large-scale the detection
information, accuracy for
thus enhancing thesmall objects.
detection accuracy for
smallInobjects.
conclusion, the ESPP module designed in this paper could compensate for the
deficiency in information
In conclusion, the ESPPregarding
module small objects.inOn
designed thepaper
this one hand,
couldESPP utilized higher-
compensate for the
level abstract features as contexts and extracted contextual information
deficiency in information regarding small objects. On the one hand, ESPP utilized higher- from the pixels
surrounding small objects to provide sufficiently detailed information.
level abstract features as contexts and extracted contextual information from the pixels On the other hand,
the architectural design of ESPP provided access to contextual information
surrounding small objects to provide sufficiently detailed information. On the other hand, at multiple
scales and enableddesign
the architectural the spatial-level fusion ofaccess
of ESPP provided local and global information
to contextual informationbetween objects
at multiple
of different
scales scales. Itthe
and enabled is spatial-level
finally demonstrated
fusion of the
localESPP
and can effectively
global informationbenefit the detection
between objects
of small objects and improve the ability of the model to identify objects
of different scales. It is finally demonstrated the ESPP can effectively benefit the detection with considerable
scale variations.
of small objects and improve the ability of the model to identify objects with considerable
scale variations.

3.2.3. Loss Function


The total loss for the object-detection task consisted of three components: the
bounding box regression loss, the confidence prediction loss, and the classification loss.
YOLOv5s uses the binary cross entropy loss (BCELoss) [15] to represent the confidence
Sensors 2023, 23, 6423 11 of 23

3.2.3. Loss Function


The total loss for the object-detection task consisted of three components: the bounding
box regression loss, the confidence prediction loss, and the classification loss. YOLOv5s uses
the binary cross entropy loss (BCELoss) [15] to represent the confidence and classification
prediction loss, while CIoU loss is employed to denote the loss for bounding box regression.
The CIoU loss function considers three geometric factors, including the minimization of
the normalized central point distance and the consistency of the overlap area and aspect
ratio. Furthermore, CIoU loss enables the algorithm to converge quickly and present minor
regression errors in different scenarios.
However, for aerial image object-detection scenarios, the existing CIoU loss func-
tion fails to obtain anchor boxes with a high regression accuracy. The accuracy of the
corresponding object detection is reduced for two main reasons. On the one hand, aerial
images differ from natural images and are characterized by dense object distributions and
drastic scale variations. In other words, aerial images suffer from a non-uniform sample
distribution. On the other hand, the CIoU loss function does not consider the problem
of balancing samples that are difficult and easy to detect. To improve the accuracy of the
existing detector, we introduced a novel α-CIoU loss function.
Alpha-IoU is a new family of intersection over union (IoU) loss functions [40] obtained
by generalizing the power transformations using existing IoU-based losses. We began by
transforming the vanilla IoU loss, which is expressed as:

L IoU = 1 − IoU. (1)

Firstly, we performed a Box–Cox transformation [21] on LIoU to obtain the α-IoU


loss function:
1 − IoU α
Lα-IoU = , α > 0. (2)
α
As shown by the above equation, different forms of the IoU loss function could be
obtained by controlling the power parameter α, such as IoU and IoU2 . Then, we introduced
a power regularization term into Equation (2) to generalize the α-IoU loss function to the
following form:
Lα-IoU = 1 − IoU α1 + ρα2 ( B, B gt ), (3)
where α1 > 0, α2 > 0, and pα2
(B, Bgt )
represents any regularization term calculated based
gt
on B and B . Equation (3) allows the theoretical generalization of most IoU-based loss
functions according to the power parameter of α. Based on Equation (3), we also generalized
the more complex CIoU loss function with multiple regularization terms using the same
power parameter α to obtain α-CIoU:

L IoU = 1 − IoU ⇒ Lα-IoU = 1 − IoU α ,


ρ2 (b,b gt ) ρ2α (b,b gt )
LCIoU = 1 − IoU + c2
+ βv ⇒ Lα-CIoU = 1 − IoU α + c2α
+ ( βv)α , (4)
gt
v= 4
π2
(arctan whgt − arctan wh )2 , β v
= (1− IoU )+v
.

By comparing LIoU to Lα-IoU , we found that α-IoU loss could adapt the loss value of all
objects. The same was true for CIoU and α-CIoU loss. Thus, we took the IoU and α-IoU
loss functions as examples, and the derivation proceeded as follows:

w Lτ = Lα-IoU /L IoU = 1 + ( IoU − IoU α )/(1 − IoU ),


⇒ w Lτ ( IoU = 0) = 1, (5)
⇒ lim w Lτ = α.
IoU →1

When 0 < α < 1, the reweighting factor wLτ decreases with the decrease in IoU. When
α > 1, the reweighting factor wLτ increases monotonically with the increase in IoU; thus,
Sensors 2023, 23, 6423 12 of 23

α-CIoU loss could help the detector to focus on high-IoU objects, which means a greater
focus on high-quality detection boxes with minor regression errors.
The number of high-quality anchor boxes with minor regression errors in object
detection is generally much lower than the number of low-quality anchor boxes, and
low-quality objects can produce excessive gradients that affect the training process [41].
Therefore, we increased the loss weight of high-IoU objects by controlling α > 1, which
could significantly improve the training performance in the late stages and further improve
the accuracy of model localization and detection.
In summary, the introduction of α-CIoU could optimize the positive and negative
object imbalance problem in the bounding box regression task. In other words, α-CIoU
could reduce the optimization contribution of low-quality anchor boxes presenting less
overlap with ground-truth boxes, allowing the regression process to focus on high-IoU
objects. Ultimately, it was demonstrated that α-CIoU effectively improves the regression
accuracy of the bounding box of YOLOv5s by adaptively reweighting the losses and
gradients of objects without increasing the number of parameters and the training/inference
time of the model.

4. Experiments
4.1. Experimental Setup
4.1.1. Datasets and Evaluation Metrics
To confirm the effectiveness of our proposed model, we carried out experiments on
two public aerial image benchmark datasets, DIOR [1] and VisDrone [42].
The DIOR dataset is a large-scale, publicly available image dataset for optical remote-
sensing image object detection containing 23,463 images and over 190,000 instances. The
dataset covers 20 object classes, which include stadiums, dams, baseball fields, etc. The
resolution of the images in the dataset is 800 × 800 pixels. The challenges associated
with object
Sensors 2023, 23, x FOR PEER detection in the DIOR dataset are multifaceted. First and foremost, the sheer
REVIEW 13 of
volume of object classes, instances, and images presents a substantial task. Secondly, the
objects within the dataset vary considerably in scale, leading to significant disparities in
the imaging results. Thirdly, the objects to be detected exhibit a high degree of inter-class
similarity and intra-class diversity, further complicating the detection process. Figure
similarity and intra-class diversity, further complicating the detection process. Figure 6
provides a visual representation of these challenges, showcasing several images for obje
provides a visual representation of these challenges, showcasing several images for object
detection within the DIOR dataset.
detection within the DIOR dataset.

Figure 6. SampleFigure
images6. from
Sample
theimages
DIOR from the DIOR dataset.
dataset.

The VisDrone Thedataset is a UAV-based


VisDrone visual dataset
dataset is a UAV-based of dataset
visual opticalofaerial
opticalimages. The The V
aerial images.
VisDrone dataset encompasses
Drone 10,209 static
dataset encompasses images,
10,209 captured
static images,through
captureda through
variety of UAV- of UAV
a variety
mounted cameras, ensuring
mounted extensive
cameras, coverage.
ensuring The
extensive dataset The
coverage. includes 10 includes
dataset distinct object
10 distinct obje
classes, such asclasses, such asbuses,
pedestrians, pedestrians, buses,Impressively,
and trucks. and trucks. Impressively,
each object each
class object
averagesclass averag
over 50,000 instances, contributing to the comprehensive nature of this dataset. The res
lution of the images in the dataset is as high as 2000 × 1500 pixels. The complexities
object detection in the VisDrone dataset are as follows: Firstly, the dataset presents a va
array of detection challenges. Secondly, the distribution of these detection objects is n
uniform, adding another layer of difficulty. Thirdly, many of these objects are heavily o
Figure 6. Sample images from the DIOR dataset.

The VisDrone dataset is a UAV-based visual dataset of optical aerial images. The Vis-
Drone dataset encompasses 10,209 static images, captured through a variety of UAV-
Sensors 2023, 23, 6423 13 of 23
mounted cameras, ensuring extensive coverage. The dataset includes 10 distinct object
classes, such as pedestrians, buses, and trucks. Impressively, each object class averages
over 50,000 instances, contributing to the comprehensive nature of this dataset. The reso-
over 50,000
lution of the instances,
images in contributing
the dataset istoasthe
highcomprehensive
as 2000 × 1500nature
pixels.ofThethis dataset. The
complexities of
resolution
object of theinimages
detection in the dataset
the VisDrone datasetisare
as as
high as 2000
follows: × 1500
Firstly, thepixels. The
dataset complexities
presents a vast
of object
array detectionchallenges.
of detection in the VisDrone dataset
Secondly, theare as follows:
distribution of Firstly, the dataset
these detection presents
objects is nota
vast array of detection challenges. Secondly, the distribution of these detection
uniform, adding another layer of difficulty. Thirdly, many of these objects are heavily ob- objects is
not uniform, adding another layer of difficulty. Thirdly, many of these objects
scured, further complicating their identification and detection. Figure 7 provides visual are heavily
obscured, of
examples further
thesecomplicating their identification
challenges, illustrating and detection.
several instances Figure
of object 7 provides
detection visual
within the
examples dataset.
VisDrone of these challenges, illustrating several instances of object detection within the
VisDrone dataset.

Figure
Figure 7.
7. Sample
Sample images
images from
from the
the VisDrone dataset.
VisDrone dataset.

According to the definition of the absolute size of objects in MS COCO [8], a common
dataset in
dataset inthe
thefield
fieldofofobject
objectdetection,
detection,small objects
small comprise
objects lessless
comprise 32 ×32
thanthan 32×pixels, medium
32 pixels, me-
objects between 32 × 32 and 96 × 96 pixels, and large objects more than 96 × 96
dium objects between 32 × 32 and 96 × 96 pixels, and large objects more than 96 × 96 pixels. pixels. As
shown in Table 1, on the one hand, the VisDrone dataset and the DIOR dataset
As shown in Table 1, on the one hand, the VisDrone dataset and the DIOR dataset differ differ in the
number
in of large of
the number and smalland
large objects.
smallOnobjects.
the otherOnhand, they could
the other hand,not allow
they the baseline
could not allowmodel
the
to effectively detect small objects. Therefore, using these datasets to verify the advantages of
the proposed model for small-object detection in aerial images was reasonable.

Table 1. The absolute pixel size distribution of the VisDrone and DIOR training sets.

Dataset <322 Pixels 322 –962 Pixels >962 Pixels


VisDrone 164,627 94,124 16,241
DIOR 12,792 7972 5966

We used the criteria proposed by Microsoft in the publicly available image dataset
MS COCO to evaluate the performance of the object detectors [8]. We selected six main
metrics to measure the performance of the proposed model, namely AP50 , AP75 , AP50:95 ,
APS , APM , and APL . AP50 and AP75 denote average precision (AP) values corresponding to
an IoU of 0.5 and 0.75, respectively. AP50:95 is a primary challenge metric relative to the
previous two metrics. It represents the mean AP value under an IoU from 0.5 to 0.95 in
steps of 0.05. APS , APM , and APL correspond to the mean AP values for small, medium,
and large objects, respectively. Since we focused on improving the detection of small objects
in aerial images, APS was used as the main metric for the experiments. To evaluate the
performance of the model more comprehensively in small-object detection, we introduce
some important metrics such as P (precision), R (recall), and the F1 score to measure the
model precision, recall, and the summed mean of the model precision and recall. We also
selected the inference time and parameters to evaluate the detection speed and size of the
model comprehensively.

4.1.2. Implementation Details


Our proposed EL-YOLOv5 model was implemented on PyTorch, and the training and
testing of all variant models were completed on an NVIDIA GeForce RTX 3080Ti GPU
Sensors 2023, 23, 6423 14 of 23

with 12 GB memory. In all experiments, we trained the VisDrone and DIOR datasets for
200 epochs with a batch size of 8. We set the initial learning rate to 0.01 and dynamically
decreased it using a cosine annealing decay strategy. Before the images are input into
the model, we performed a unified pre-processing operation on the images within two
datasets to unify the image size to 640 × 640 pixels and then input to the model for
feature extraction and the subsequent training process. In addition, we used the unified
default data-augmentation strategies and default parameters of the YOLO detector for
all experiments.
According to the needs of realistic scenarios, we chose YOLOv5s as the baseline model
in the experiments. Furthermore, to verify the robustness of the proposed modifications
model, we transferred all the modifications made on S to other scales and compared the
experimental results for analysis.
Other detection models, such as the Scaled-You Only Look Once version 4 (Scaled-
YOLOv4) [43], TPH-YOLOv5, and YOLOv7 were validated using the default settings from
the relevant literature.

4.2. Experimental Results


4.2.1. Experimental Results of the Model Architecture
We conducted comparative experiments on the baseline model and three models
with different modified architectures using DIOR and VisDrone datasets, considering five
metrics (AP50 , AP75 , AP50:95 , APS , and parameters). The results are shown in Tables 2 and 3.

Table 2. Experiment results using different model architectures for the DIOR dataset.

Parameters
Method AP50 (%) AP75 (%) AP50:95 (%) APS (%)
(M)
Baseline
79.4 61.8 57.1 8.9 7.11
Model
Model 1 78.7 59.0 54.7 11.1 7.25
Model 2 77.5 58.4 53.9 10.5 5.44
Model 3 65.2 44.4 42.3 8.7 1.77

Table 3. Experiment results using different model architectures for the VisDrone dataset.

Parameters
Method AP50 (%) AP75 (%) AP50:95 (%) APS (%)
(M)
Baseline
27.4 14.2 14.9 8.5 7.08
Model
Model 1 31.6 17.2 17.8 10.6 7.22
Model 2 31.3 16.4 17.3 10.6 5.43
Model 3 30.9 15.5 16.6 10.3 1.75

By comparing the baseline model and Models 1, 2, and 3, as presented in Figure 3, it


can be observed that all of the modified models have improved their APS by approximately
2% compared to the baseline model. Therefore, introducing low-level feature maps in
YOLOv5s effectively enhanced the detection accuracy of the model for small objects. The
comparison between Model 1 and Models 2 and 3 showed that Model 1 had slightly more
parameters than Models 2 and 3 but significantly outperformed them in terms of AP50:95 .
Our analysis suggested that although reducing deep feature maps and detection heads for
large objects reduced the computational resources, it also reduced the depth and complexity
of the model, which indirectly impacted its performance. It also affected the detection
accuracy for small objects when the detector complexity was overly low.
A comparison of Tables 2 and 3 reveals that for the UAV-based VisDrone dataset, the
introduction of low-level feature maps resulted in a more significant improvement in model
performance. On the contrary, for the remote-sensing DIOR dataset, although the modified
models presented an improved APS compared to the baseline model, the AP50:95 of Models
heads for large objects reduced the computational resources, it also reduced the depth and
complexity of the model, which indirectly impacted its performance. It also affected the
detection accuracy for small objects when the detector complexity was overly low.
A comparison of Tables 2 and 3 reveals that for the UAV-based VisDrone dataset, the
introduction of low-level feature maps resulted in a more significant improvement in
Sensors 2023, 23, 6423 model performance. On the contrary, for the remote-sensing DIOR dataset, although the 15 of 23
modified models presented an improved APS compared to the baseline model, the AP50:95
of Models 1, 2, and 3 all decreased to different degrees. We hypothesized that two primary
factors
1, 2,led
andto 3this
all phenomenon. On the one
decreased to different hand, We
degrees. as illustrated
hypothesized in Figure 8a,b,primary
that two the distri-
factors
bution
led ofto small, medium, and
this phenomenon. Onlarge objects
the one hand,in as
the DIOR dataset
illustrated was8a,b,
in Figure morethebalanced,
distribution
whereas, in the
of small, VisDrone
medium, anddataset, small in
large objects objects dominated.
the DIOR dataset Ourwas modified modelswhereas,
more balanced, could in
detect
theaVisDrone
larger quantity
dataset,ofsmall
small objects.
objects However,Our
dominated. from Figure models
modified 6, we noted
couldthat many
detect a larger
small objects of
quantity in small
the DIOR dataset
objects. were not
However, fromannotated.
Figure 6, we As noted
a result, the
that AP50:95
many in the
small DIOR
objects in the
DIOR
dataset dataset were
decreased. not annotated.
In contrast, As a result,
in the unique contexttheof AP
the50:95 in the DIOR
VisDrone dataset
dataset, decreased.
the model’s
In contrast,
average precision inwas
the significantly
unique context of the VisDrone
enhanced. On the otherdataset,
hand, the
themodel’s average
introduction precision
of high-
was significantly
resolution feature maps enhanced.
provided Onmore
the other hand,
detailed the introduction
information in favorofofhigh-resolution feature
small objects, but
maps
it also provided
introduced more detailed
background information
noise, which wasindetrimental
favor of small objects,
to the but performance.
model’s it also introduced
background noise, which was detrimental to the model’s performance.

(a) (b)
Figure 8. (a)8.Distribution
Figure of objects
(a) Distribution withwith
of objects different widths
different andand
widths heights in the
heights DIOR
in the dataset;
DIOR (b)(b)
dataset; dis-distri-
tribution of objects with different widths and heights in the VisDrone dataset.
bution of objects with different widths and heights in the VisDrone dataset.

Through multiple rounds of experiments, it was clear that Model 1 outperformed the
baseline model in detecting small objects across both datasets. Therefore, after considering
the algorithm models’ average accuracy, complexity, and number of model parameters, we
ultimately chose the network architecture of Model 1.

4.2.2. Experimental Results of ESPP


To demonstrate the benefits of our ESPP design for aerial image object detection, we
selected a series of currently popular spatial pyramid pooling modules for comparison
experiments with ESPP. All included spatial pyramid pooling modules could directly
replace the original SPP method, including SPPF [15], simplified SPPF (SimSPPF) [16],
SPPCSPC [17], atrous spatial pyramid pooling (ASPP) [44], RFB, and ESPP.
We ensured that all hyper-parameters and configurations remained the same and
replaced SPP with each of the abovementioned modules on top of the baseline YOLOv5s
model. We conducted experiments on the DIOR and VisDrone datasets separately, compar-
ing the five metrics AP50 , AP75 , AP50:95 , APS , and the number of parameters. The results
for the impact of the different SPP modules on the performance of YOLOv5s on the DIOR
and VisDrone datasets are shown in Tables 4 and 5.

Table 4. Experimental results using different SPP modules for the DIOR dataset.

Method AP50 (%) AP75 (%) AP50:95 (%) APS (%) Parameters (M)
YOLOv5s + SPP 79.4 61.8 57.1 8.9 7.11
YOLOv5s + SPPF 79.2 61.7 57.1 9.1 7.11
YOLOv5s + SimSPPF 79.4 61.7 57.1 8.8 7.11
YOLOv5s + SPPCSPC 79.1 61.4 56.6 8.7 10.04
YOLOv5s + ASPP 79.2 61.4 56.7 9.0 15.36
YOLOv5s + RFB 78.8 62.1 57.2 8.3 7.77
YOLOv5s + ESPP 79.6 62.9 57.8 9.7 7.44
Sensors 2023, 23, 6423 16 of 23

Table 5. Experimental results using different SPP modules for the VisDrone dataset.

Method AP50 (%) AP75 (%) AP50:95 (%) APS (%) Parameters (M)
YOLOv5s + SPP 27.4 14.2 14.9 8.5 7.08
YOLOv5s + SPPF 27.2 14.2 14.9 8.5 7.08
YOLOv5s + SimSPPF 27.7 14.3 15.1 8.7 7.08
YOLOv5s + SPPCSPC 27.2 14.1 14.9 8.4 10.01
YOLOv5s + ASPP 27.1 14.1 14.9 8.3 15.34
YOLOv5s + RFB 27.1 14.4 15.0 8.5 7.74
YOLOv5s + ESPP 28.4 15.9 16.1 9.2 7.42

As observed in Tables 4 and 5, ESPP performed significantly better than the other
modules on the two aerial image datasets. For the DIOR dataset, ESPP improved the
AP50:95 by 0.7% and the APS by 0.8% compared to SPP. For the VisDrone dataset, ESPP
improved the AP50:95 by 1.2% and the APS by 0.7% compared to SPP. Interestingly, ESPP is
a lightweight module that exhibited only a minor increase in the number of parameters.
The improved accuracy observed in our analysis was due to two main reasons. Firstly,
ESPP improved the representational power of the feature maps through a serial convolution
plus parallel convolution architecture, which enabled the fusion of local and global informa-
tion at the spatial level and improved the average accuracy of the model. Secondly, ESPP
enriched the feature maps with contextual information by introducing atrous convolution
to increase the receptive fields, which was beneficial to small objects. In summary, ESPP is
suitable for application in the network architecture of aerial image small-object detectors.

4.2.3. Experimental Results of EL-YOLOv5


To verify the robustness of the proposed modified model, we transferred all of the
modifications made on S to other scales and compared the experimental results for analysis.
The experimental results are shown in Tables 6 and 7.

Table 6. Performance comparison of YOLOv5 and EL-YOLOv5 across various scales within the
DIOR dataset.

Parameters Inference
Method Scales P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score
(M) Time (MS)
S 88.0 76.1 57.1 8.9 38.4 69.4 0.82 7.11 14.6
M 88.8 77.6 60.0 9.9 39.0 72.7 0.83 21.11 22.1
YOLOv5
L 90.2 77.5 61.8 10.4 40.1 74.6 0.83 46.70 25.2
X 90.3 79.1 63.1 11.2 41.8 76.5 0.84 87.33 29.7
S 83.6 74.2 55.5 10.8 36.4 67.1 0.79 7.59 18.2
M 84.8 76.8 58.5 11.6 38.2 70.7 0.81 22.31 26.7
EL-YOLOv5 L 85.8 77.4 60.5 11.7 40.6 72.8 0.81 49.04 30.9
X 87.8 77.7 61.8 12.0 39.1 74.3 0.82 91.29 37.8

Table 7. Performance comparison of YOLOv5 and EL-YOLOv5 across various scales within the
VisDrone dataset.

Parameters Inference
Method Scales P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score
(M) Time (MS)
S 46.8 36.1 14.9 8.5 22.4 30.2 0.41 7.08 19.7
M 53.9 38.2 17.9 10.6 26.5 32.1 0.45 21.07 24.0
YOLOv5
L 55.4 39.9 19.4 11.6 28.6 40.7 0.46 46.65 28.1
X 56.9 40.9 20.0 12.1 29.5 39.0 0.48 87.26 31.8
S 50.9 39.7 18.4 10.7 27.1 37.9 0.45 7.56 34.0
M 54.1 44.5 21.4 13.6 30.8 40.7 0.49 22.27 37.1
EL-YOLOv5 L 57.9 45.8 22.9 15.1 32.3 42.0 0.51 48.98 41.0
X 56.0 48.4 23.7 15.9 33.2 44.7 0.52 91.21 47.3

Tables 6 and 7 show that EL-YOLOv5s enhanced the APS by 1.9% and 2.2% on two
challenging aerial image datasets compared with the YOLOv5s. Thus, it was clear that
Sensors 2023, 23, 6423 17 of 23

our proposed model effectively solved the original YOLOv5 model’s problem of low
accuracy in detecting small objects in aerial images. Although EL-YOLOv5 showed a slight
increase in the inference time compared to the baseline model, the S-scale EL-YOLOv5
model achieved the requirement of real-time processing while maintaining a high level of
detection accuracy. In addition, regarding the number of parameters, the chip of the UAV
and processor required less than 10 MB for the model in most cases, and our EL-YOLOv5s
fully met the requirements of embedded deployment. Therefore, our EL-YOLOv5s can run
on most UAV processors. Meanwhile, the optimization effect of EL-YOLOv5 was more
visible on the VisDrone dataset, probably originating from the denser distribution of small
objects in this dataset.
By comparing the APS , APM , and APL in Tables 6 and 7, we further analyzed the scale
problem of the objects in the two datasets. On the one hand, based on Table 1, we found that
low-scale objects dominated in the VisDrone dataset. Therefore, the AP50:95 was effectively
improved when our experiments were optimized for small objects, and correspondingly
the APL registered an approximately 8% growth. Meanwhile, it can be found that the
improvement in detection accuracy of large-scale objects was much higher than that of
low-scale objects, which also reflected the difficulty of improving the accuracy of small
objects from the side. On the other hand, we found that the APM and APL both decreased
slightly in the DIOR dataset. Our analysis suggested that the proportion of large-scale
objects was close to that of small objects in the DIOR; therefore, when the experimental
model architecture highlights the optimization of the accuracy for small objects, the APM
and APL would be affected accordingly. Then, for the detection task of large-scale objects in
the satellite remote-sensing object detection scenario, our proposed EL-YOLOv5 exhibits
some limitations.
To demonstrate the detection performance of EL-YOLOv5 for different categories of
objects, we selected the more challenging aerial image dataset VisDrone in small-object
detection and the S-scale YOLOv5 model. Then, we conducted experiments on different
categories of objects in this dataset to obtain relevant metrics such as P and R, and the final
experimental results are shown in Table 8. The arrows in Table 8 reflect the rising state of
the data percentiles.
Our proposed algorithm exhibited different degrees of improvement for different
categories of objects. Our analysis suggested that firstly, modifying the model architecture
can make the model retain more detailed information favorable to small objects; secondly,
the multibranch structure of ESPP can effectively enhance the feature-extraction ability of
the model for small objects. And finally, the α-CIoU loss function can effectively alleviate the
problem of positive and negative sample imbalance, and all three different improvements
significantly increased the detection accuracy of different categories of objects in the dataset.
Since our EL-YOLOv5 achieved an increase in points in all categories, this also reflected the
model’s generalizability from this angle.

Table 8. The detection performance of each category comparison of YOLOv5s and EL-YOLOv5s
within the VisDrone dataset.

YOLOv5s. EL-YOLOv5s
Category
P (%) R (%) mAP50:95 (%) P (%) R (%) mAP50:95 (%)
pedestrian 50.7 39.3 15.4 61.6 41.1 18.7 ↑3.3
people 46.2 34.9 10.1 49.4 31.7 10.5 ↑0.4
bicycle 28.0 15.9 3.76 30.5 15.9 5.32 ↑1.56
car 63.7 74.2 47.1 74.4 79.1 53.6 ↑6.5
van 48.6 38.0 24.1 45.7 45.7 28.1 ↑4.0
truck 52.0 33.7 18.0 51.4 37.2 22.9 ↑4.9
tricycle 42.6 24.8 9.68 44.3 29.4 13.5 ↑3.82
awning-tricycle 26.9 13.5 5.74 25.6 22.2 8.36 ↑2.62
bus 60.2 42.2 26.1 51.4 53.1 37.6 ↑11.5
motor 49.2 38.7 14.7 42.9 44.4 18.0 ↑3.3
categories of objects. Our analysis suggested that firstly, modifying the model architecture
can make the model retain more detailed information favorable to small objects; secondly,
the multibranch structure of ESPP can effectively enhance the feature-extraction ability of
the model for small objects. And finally, the α-CIoU loss function can effectively alleviate
the problem of positive and negative sample imbalance, and all three different improve-
Sensors 2023, 23, 6423 ments significantly increased the detection accuracy of different categories of objects 18 inof 23
the dataset. Since our EL-YOLOv5 achieved an increase in points in all categories, this also
reflected the model’s generalizability from this angle.
Figures 99 and
Figures and 10
10 illustrated
illustratedthethequalitative
qualitativeresults
resultscomparison
comparison between
betweenYOLOv5
YOLOv5 andand
EL-YOLOv5 for the two datasets. By looking at Figures 9c,d and 10c,d,
EL-YOLOv5 for the two datasets. By looking at Figures 9c,d and 10c,d, we found we found that EL-that
YOLOv5 could effectively alleviate the low-scale object missing problem
EL-YOLOv5 could effectively alleviate the low-scale object missing problem in the twoin the two aerial
imageimage
aerial datasets. In conclusion,
datasets. it was ithighly
In conclusion, intuitive
was highly that EL-YOLOv5
intuitive had a great
that EL-YOLOv5 had aad-
great
vantage in small-object detection compared to the baseline
advantage in small-object detection compared to the baseline model. model.

Sensors 2023, 23, x FOR PEER REVIEW 19 of 23


(a) (b) (c) (d)

Figure 9. Qualitative results comparison between YOLOv5 and EL-YOLOv5 for the DIOR dataset.
Figure 9. Qualitative results comparison between YOLOv5 and EL-YOLOv5 for the DIOR dataset.
Ground
Groundtruth
truthand
andprediction
prediction are are marked
marked by by green
greenand
andred
redboxes,
boxes,respectively.
respectively.(a)
(a)An
Anoriginal
original image
image
from
fromthe
theDIOR
DIORdataset;
dataset;(b)
(b)the
theoriginal
originalimage
image with ground
with truth
ground box;
truth (c)(c)
box; thethe
original image
original is detected
image is de-
by YOLOv5;
tected (d) the (d)
by YOLOv5; original image image
the original is detected by EL-YOLOv5.
is detected by EL-YOLOv5.

(a) (b) (c) (d)


Figure10.
Figure 10. Qualitative
Qualitativeresults
resultscomparison between
comparison YOLOv5
between YOLOv5 andand
EL-YOLOv5 for the
EL-YOLOv5 forVisDrone da-
the VisDrone
taset. Ground truth and prediction are marked by green and red boxes, respectively. (a) An original
dataset. Ground truth and prediction are marked by green and red boxes, respectively. (a) An original
image from the VisDrone dataset; (b) the original image with ground truth box; (c) the original im-
image from the VisDrone dataset; (b) the original image with ground truth box; (c) the original image
age is detected by YOLOv5; (d) the original image is detected by EL-YOLOv5.
is detected by YOLOv5; (d) the original image is detected by EL-YOLOv5.
4.3.Ablation
4.3. Ablation Experiments
Experiments
Tovalidate
To validate the
the effectiveness
effectiveness of
of the
the modules
modulespresented
presentedininthis
thispaper,
paper,we
weperformed
performed
ablation experiments on the DIOR and VisDrone datasets. The experimentalresults
ablation experiments on the DIOR and VisDrone datasets. The experimental resultsare
are
shown in Tables 9 and
shown in Tables 9 and 10. 10.

Table9.9.The
Table Theeffects
effects of
of different
different module
module combinations
combinations in
inYOLOv5s
YOLOv5sfor
forthe
theDIOR
DIORdataset.
dataset.
Method P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score Parameters (M)
Method P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score Parameters (M)
YOLOv5s 88.0 76.1 57.1 8.9 38.4 69.4 0.82 7.11
YOLOv5s 88.0 76.1 57.1 8.9 38.4 69.4 0.82 7.11
YOLOv5s
YOLOv5s + Model
+ Model 1 1 84.1 84.175.8 75.8 54.7 54.7 11.111.1 37.2
37.2 66.2
66.2 0.80
0.80 7.25
7.25
YOLOv5s
YOLOv5s + ESPP
+ ESPP 88.9 88.976.1 76.1 57.8 57.8 9.7 9.7 37.4
37.4 70.3
70.3 0.82
0.82 7.44
7.44
YOLOv5s + α-CIOU
YOLOv5s + α-CIOU 87.0 87.075.8 75.8 57.5 57.5 9.8 9.8 38.5
38.5 69.8
69.8 0.81
0.81 7.11
7.11
YOLOv5s + ESPP + α-CIOU 89.1 76.9 58.2 10.2 38.8 70.5 0.83 7.44
YOLOv5s + ESPP + α-CIOU83.6 89.174.2
EL-YOLOv5s 76.9 55.5 58.2 10.810.2 38.8
36.4 70.5
67.1 0.83
0.79 7.44
7.59
EL-YOLOv5s 83.6 74.2 55.5 10.8 36.4 67.1 0.79 7.59

Table 10. The effects of different module combinations in YOLOv5s for the VisDrone dataset.

Method P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score Parameters (M)
YOLOv5s 46.8 36.1 14.9 8.5 22.4 30.2 0.41 7.08 M
YOLOv5s + Model 1 51.8 39.2 17.8 10.6 25.8 34.1 0.45 7.22 M
YOLOv5s + ESPP 46.8 36.9 16.1 9.2 24.5 34.4 0.41 7.42 M
YOLOv5s + α-CIOU 50.3 34.0 16.2 9.5 24.1 31.3 0.41 7.08 M
YOLOv5s + ESPP + α-CIOU 50.8 36.5 16.5 9.9 24.2 35.3 0.43 7.42 M
EL-YOLOv5s 50.9 39.7 18.4 10.7 27.1 37.9 0.45 7.56 M

By looking at Tables 9 and 10, we can see that, firstly, the modification of the model
Sensors 2023, 23, 6423 19 of 23

Table 10. The effects of different module combinations in YOLOv5s for the VisDrone dataset.

Method P (%) R (%) AP50:95 (%) APS (%) APM (%) APL (%) F1 Score Parameters (M)
YOLOv5s 46.8 36.1 14.9 8.5 22.4 30.2 0.41 7.08 M
YOLOv5s + Model 1 51.8 39.2 17.8 10.6 25.8 34.1 0.45 7.22 M
YOLOv5s + ESPP 46.8 36.9 16.1 9.2 24.5 34.4 0.41 7.42 M
YOLOv5s + α-CIOU 50.3 34.0 16.2 9.5 24.1 31.3 0.41 7.08 M
YOLOv5s + ESPP + α-CIOU 50.8 36.5 16.5 9.9 24.2 35.3 0.43 7.42 M
EL-YOLOv5s 50.9 39.7 18.4 10.7 27.1 37.9 0.45 7.56 M

By looking at Tables 9 and 10, we can see that, firstly, the modification of the model
architecture improved the accuracy of small-object detection more noticeably compared
to the modification of ESPP and the loss function. Secondly, our proposed EL-YOLOv5s
model achieved improvements of 1.9% and 2.2% in the APS compared to YOLOv5s for
both datasets, which illustrated that EL-YOLOv5 could indeed address the problem of
low accuracy for small-object detection in aerial images. Thirdly, for the DIOR dataset,
all our modules improved the small-object detection accuracy, but the noise introduced
by the low-level feature maps when modifying the model architecture also affected the
average accuracy of the model to a certain extent. Fourthly, EL-YOLOv5s fully satisfied
the requirement of the number of parameters that could be embedded in the model for a
realistic application scenario.
Delving deeper into the table data, it becomes evident that for the VisDrone dataset, a
significant improvement in the AP50:95 was obtained by EL-YOLOv5s—an increase of 3.5%
when compared with the baseline model. In contrast, for the DIOR dataset, enhancements
merely to the model’s ESPP module and loss function led to a 1.1-point increase in the
AP50:95 index. However, further adjustments to the model architecture in the DIOR dataset
resulted in a decrease in the AP50:95 by 1.6%. This may have been due to the difference in
the distribution of large and small objects between the two datasets. Therefore, for aerial
image datasets that are not dominated by small objects, it would be more beneficial to
enhance the feature representation of the model by improving the module in scenarios that
are more demanding for average model accuracy.

4.4. Comparisons with Other Sota Detectors


The EL-YOLOv5 model was compared with other SOTA detectors on the DIOR
and VisDrone datasets to prove the effective performance of the detectors to which our
method was applied. These detectors included Scaled-YOLOv4, TPH-YOLOv5, YOLOv5,
and YOLOv7.
Figures 11a and 12a show that the detection accuracy of our proposed EL-YOLOv5
model for small objects in both datasets was significantly higher than that of advanced
object detectors such as Scaled-YOLOv4, YOLOv5, and TPH-YOLOv5. Regarding the
comparison between EL-YOLOv5 and YOLOv7, EL-YOLOv5 was more dominant on the
S-scale. While EL-YOLOv5 was somewhat less accurate than YOLOv7 as the number of
parameters increased, the extensive parameter number did not align with the requirements
of embedded deployment. That is, when it comes to deployment within an embedded
environment, the EL-YOLOv5 model exhibited significant advantages over YOLOv7.
model for small objects in both datasets was significantly higher than that of advanced
model for small objects in both datasets was significantly higher than that of advanced
object detectors such as Scaled-YOLOv4, YOLOv5, and TPH-YOLOv5. Regarding the
object detectors such as Scaled-YOLOv4, YOLOv5, and TPH-YOLOv5. Regarding the
comparison between EL-YOLOv5 and YOLOv7, EL-YOLOv5 was more dominant on the
comparison between EL-YOLOv5 and YOLOv7, EL-YOLOv5 was more dominant on the
S-scale. While EL-YOLOv5 was somewhat less accurate than YOLOv7 as the number of
S-scale. While EL-YOLOv5 was somewhat less accurate than YOLOv7 as the number of
parameters increased, the extensive parameter number did not align with the require-
Sensors 2023, 23, 6423 parameters increased, the extensive parameter number did not align with the require-
20 of 23
ments of embedded deployment. That is, when it comes to deployment within an embed-
ments of embedded deployment. That is, when it comes to deployment within an embed-
ded environment, the EL-YOLOv5 model exhibited significant advantages over YOLOv7.
ded environment, the EL-YOLOv5 model exhibited significant advantages over YOLOv7.

(a) (b)
(a) (b)
Figure
Figure 11.
11.(a)
(a)Comparative
Comparativeanalysis
analysis of of
small-object detection
small-object accuracy
detection across
accuracy different
across detectors
different us-
detectors
Figure 11. (a) Comparative analysis of small-object detection accuracy across different detectors us-
ing thethe
using DIORDIORdataset; (b)(b)
dataset; thethe
small-object
small-object detection
detectionaccuracy
accuracy within
withinthe
theDIOR
DIOR dataset
dataset for
fordifferent
different
ing the DIOR dataset; (b) the small-object detection accuracy within the DIOR dataset for different
detectors with
detectorswith parameter
withparameter sizes
parametersizes of
sizesof less
ofless than
lessthan 10
than10 MB.
10MB.
MB.
detectors

(a) (b)
(a) (b)
Figure 12. (a) Comparative analysis of small-object detection accuracy across different detectors
using the VisDrone dataset; (b) the small-object detection accuracy within the VisDrone dataset for
different detectors with parameter sizes of less than 10 MB.

For object-detection tasks in aerial imagery, we must consider the deployment require-
ments of the application area and further control the number of parameters in detector
models. We mentioned near the beginning of this paper that FPGA chips [18] are generally
used to implement cutting-edge technologies such as image processing and object detection.
FPGAs typically possess less than 10 MB of on-chip memory and are devoid of any off-chip
memory or storage. Therefore, lightweight models are more suitable for implementation
using FPGAs, which are not constrained by the width of the storage bandwidth, while
video frames can be processed in real time by FPGAs. Figures 11b and 12b show that our
proposed EL-YOLOv5 model performed optimally in terms of detection accuracy for small
objects when the control model parameter count was less than 10 MB.

5. Conclusions
In this paper, we proposed EL-YOLO, a modified object-detection model applicable to
the corresponding YOLO models, using S-scale YOLOv5 as the baseline model to address
the problem of low accuracy for small-object detection in aerial images. First, we adapted
three model architectures and validated them through multiple rounds of experiments,
analyzing the reasons for their performance and architectural features and selecting the
best-performing model. The best-performing model was proven to be able to maximize
the accuracy in detecting small objects at a low cost in terms of computational resources.
Second, we designed a novel ESPP method based on a human visual perception system
Sensors 2023, 23, 6423 21 of 23

to replace the original SPP module to further enhance the feature-extraction capability of
the model for small targets. Finally, we introduced the α-CIoU loss function to optimize
the positive and negative sample imbalance problem in the bounding box regression task,
making it easier for the model to pinpoint small objects. Several rounds of experimental
validation for our proposed EL-YOLOv5 model on the DIOR and VisDrone datasets were
conducted, and it was finally demonstrated that the embeddable S-scale EL-YOLOv5 model
achieved an APs of 10.8% on the DIOR dataset and 10.7% on the VisDrone dataset, which
represented the highest accuracy among the existing lightweight model results.
Our proposed EL-YOLO model can be further optimized. For datasets where the
proportion of small objects is not dominant, such as DIOR, we can further design new
feature-enhancement modules to improve the feature-extraction capability of the model
for small objects without modifying the model architecture. Our algorithm can also be
combined with edge computing techniques, which can be used to process massive require-
ments near the data sources, ultimately enabling real-time decisions in real-world scenarios.
In addition, other techniques can be well-suited to solving the aerial image object-detection
problems that were not considered in this study, representing directions for future research.

Author Contributions: Conceptualization, M.H.; methodology, M.H. and Z.L. (Ziyang Li); software,
M.H.; validation, M.H., X.W. and H.T.; resources, X.W.; writing—original draft preparation, M.H.;
writing—review and editing, Z.L. (Ziyang Li) and H.T.; supervision, Z.L. (Zeyu Lin) and J.Y.; project
administration, M.H. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China Grants
62262064, 62266043, and 61966035; in part by the Key R&D projects in the Xinjiang Uygur Autonomous
Region under Grant XJEDU2016S106; in part by the Natural Science Foundation of the Xinjiang Uygur
Autonomous Region of China under Grant 2022D01C56; and by the Xinjiang University doctoral
postgraduate innovation project (Grant Nos.XJU2022BS072).
Data Availability Statement: The datasets used in this study are all public datasets.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark.
ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [CrossRef]
2. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 2017, 39, 640–651. [CrossRef] [PubMed]
3. Ma, W.; Guo, Q.; Wu, Y.; Zhao, W.; Zhang, X.; Jiao, L. A Novel Multi-Model Decision Fusion Network for Object Detection in
Remote Sensing Images. Remote Sens. 2019, 11, 737. [CrossRef]
4. Xie, W.Y.; Yang, J.; Lei, J.; Li, Y.; Du, Q.; He, G. SRUN: Spectral Regularized Unsupervised Networks for Hyperspectral Target
Detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1463–1474. [CrossRef]
5. Zhu, D.J.; Xia, S.X.; Zhao, J.Q.; Zhou, Y.; Jian, M.; Niu, Q.; Yao, R.; Chen, Y. Diverse sample generation with multi-branch
conditional generative adversarial network for remote sensing objects detection. Neurocomputing 2020, 381, 40–51. [CrossRef]
6. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
7. Everingham, M.; Eslami, S.M.A.; Van GoolL, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes
Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef]
8. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014.
9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef]
10. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2017, 99, 2999–3007.
11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
12. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 2017, pp. 6517–6525.
13. Farhadi, A.; Redmon, J. Yolov3, An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 18–23 June 2018.
Sensors 2023, 23, 6423 22 of 23

14. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4, Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
15. Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; Chaurasia, A.; Xie, T.; Liu, C.; Abhiram, V.; Laughing; Tkianai; et al. Ultralyt-
ics/yolov5, v5.5-YOLOv5-P6 1280 Models, AWS, Supervisely and YouTube Integrations; Version 5.5; CERN Data Centre & Invenio:
Prévessin-Moëns, France, 2022.
16. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6, A Single-Stage Object Detection
Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976.
17. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7, Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
18. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer
parameters and <0.5MB model size. arXiv 2016, arXiv:1602.07360.
19. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2014, 8691, 346–361.
20. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for
Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [CrossRef]
21. He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding
Box Regression. arXiv 2022, arXiv:2110.13675.
22. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE Transactions on Pattern Analysis & Machine
Intelligence, Venice, Italy, 22–29 October 2017.
23. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. arXiv 2017, arXiv:1712.00726.
24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham,
Switzerland, 2016; Volume 9905, pp. 21–37.
25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 July 2016; pp. 770–778.
26. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; pp. 936–944.
27. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
28. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone that can Enhance Learning
Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580.
29. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of
the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021;
pp. 13728–13737.
30. Wang, D.; Liu, Z.; Gu, X.; Wu, W.; Chen, Y.; Wang, L. Automatic Detection of Pothole Distress in Asphalt Pavement Using
Improved Convolutional Neural Networks. Remote Sens. 2022, 14, 3892. [CrossRef]
31. Kim, M.; Jeong, J.; Kim, S. ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image.
Remote Sens. 2021, 13, 4851. [CrossRef]
32. Liu, Z.; Gu, X.; Chen, J.; Wang, D.; Chen, Y.; Wang, L. Automatic recognition of pavement cracks from combined GPR B-scan and
C-scan images using multiscale feature fusion deep neural networks. Autom. Constr. 2023, 146, 104698. [CrossRef]
33. Wu, J.; Shen, T.; Wang, Q.; Tao, Z.; Zeng, K.; Song, J. Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and
Visible Object Detection. Remote Sens. 2023, 15, 660. [CrossRef]
34. Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. Visdrone-Det2021,
The Vision Meets Drone Object detection Challenge Results. In Proceedings of the 2021 IEEE CVF International Conference on
Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854.
35. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 5987–5995.
36. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5, Improved YOLOv5 Based on Transformer Prediction Head for Object
Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE CVF International Conference on Computer Vision
Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788.
37. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the ECCV 2018, 15th
European Conference, Munich, Germany, 8–14 September 2018.
38. Wan, J.; Zhang, B.; Zhao, Y.; Du, Y.; Tong, Z. VistrongerDet: Stronger Visual Information for Object Detection in VisDrone Images.
In Proceedings of the 2021 IEEE CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada,
11–17 October 2021; pp. 2820–2829.
39. Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. arXiv 2018, arXiv:1711.07767.
Sensors 2023, 23, 6423 23 of 23

40. Yu, J.H.; Jiang, Y.N.; Wang, Z.Y.; Cao, Z.M.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the
24th ACM International Conference on Multimedia, New York, NY, USA, 15–19 October 2016.
41. Chen, Z.; Zhang, F.; Liu, H.; Wang, L.; Zhang, Q.; Guo, L. Real-time detection algorithm of helmet and reflective vest based on
improved YOLOv5. J. Real-Time Image Process. 2023, 20, 4. [CrossRef]
42. Du, D.; Wen, L.; Zhu, P.; Fan, H.; Hu, Q.; Ling, H.; Shah, M.; Pan, J.; Al-Ali, A.; Mohamed, A.; et al. VisDrone-CC2020, The Vision
Meets Drone Crowd Counting Challenge Results. arXiv 2021, arXiv:2107.08766.
43. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4, Scaling Cross Stage Partial Network. In Proceedings of the 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13024–13033.
44. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Con-
volutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848.
[CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like