Drones 07 00304 v2
Drones 07 00304 v2
Article
A Modified YOLOv8 Detection Network for UAV Aerial
Image Recognition
Yiting Li 1,2 , Qingsong Fan 2, * , Haisong Huang 2 , Zhenggong Han 2 and Qiang Gu 2
1 College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China;
[email protected]
2 Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University,
Guiyang 550025, China; [email protected] (H.H.); [email protected] (Z.H.);
[email protected] (Q.G.)
* Correspondence: [email protected]; Tel.: +86-1333-9600-681
Abstract: UAV multitarget detection plays a pivotal role in civil and military fields. Although deep
learning methods provide a more effective solution to this task, changes in target size, shape change,
occlusion, and lighting conditions from the perspective of drones still bring great challenges to
research in this field. Based on the above problems, this paper proposes an aerial image detection
model with excellent performance and strong robustness. First, in view of the common problem that
small targets in aerial images are prone to misdetection and missed detection, the idea of Bi-PAN-FPN
is introduced to improve the neck part in YOLOv8-s. By fully considering and reusing multiscale
features, a more advanced and complete feature fusion process is achieved while maintaining the
parameter cost as much as possible. Second, the GhostblockV2 structure is used in the backbone of the
benchmark model to replace part of the C2f module, which suppresses information loss during long-
distance feature transmission while significantly reducing the number of model parameters; finally,
WiseIoU loss is used as bounding box regression loss, combined with a dynamic nonmonotonic
focusing mechanism, and the quality of anchor boxes is evaluated by using “outlier” so that the
detector takes into account different quality anchor boxes to improve the overall performance of
the detection task. The algorithm’s performance is compared and evaluated on the VisDrone2019
dataset, which is widely used worldwide, and a detailed ablation experiment, contrast experiment,
Citation: Li, Y.; Fan, Q.; Huang, H.; interpretability experiment, and self-built dataset experiment are designed to verify the effectiveness
Han, Z.; Gu, Q. A Modified YOLOv8 and feasibility of the proposed model. The results show that the proposed aerial image detection
Detection Network for UAV Aerial model has achieved obvious results and advantages in various experiments, which provides a new
Image Recognition. Drones 2023, 7, idea for the deployment of deep learning in the field of UAV multitarget detection.
304. https://doi.org/10.3390/
drones7050304
Keywords: UAV; multitarget detection; deep learning; YOLOv8-s; feature fusion
Academic Editor: Anastasios
Dimou
deploy in edge devices, and lightweight object detectors have difficulty improving accu-
racy [13–15]. These factors hinder the development of deep learning methods in the field
of UAV multitarget detection. In June 2021, at the Global Artificial Intelligence Technology
Conference 2021 (GAITC 2021), the Tencent Youtu Lab, and Xiamen University Artificial
Intelligence Research Institute officially released the “Top Ten Artificial Intelligence Trends
in 2021”, pointing out that meeting low edge-end models with computational complexity
and small model size will gradually become a new tool for enterprises to reduce costs and
increase efficiency. It is foreseeable that in the future, increasingly more intelligent enter-
prises will face a new stage, from rapid expansion in the early stage to efficient operation,
and in this process, the marginalized deployment of deep models will undoubtedly become
an important means for them. Therefore, it is of great practical significance to design a
model that takes into account both detection accuracy and light weight to compensate for
the bottleneck of deep learning in the application of UAV aerial images. This paper focuses
on the above problems and improves the versatility and effectiveness of a multitarget
detection model of UAV aerial images.
The following are the main contributions of this work:
1. From the perspective of paying attention to large-size feature maps and introducing
the idea of Bi-PAN-FPN, this work improves the detection ability of the model for
small targets, and at the same time increases the probability and time of multiscale
feature fusion to obtain better feature engineering. This solves the common problem
of easy misdetection and missed detection of small targets in aerial images;
2. Optimizes the backbone network and loss function of the model. The Ghostblock unit
and Wise-IoU bounding box regression loss are integrated to improve the generaliza-
tion performance of the model from the perspectives of feature diversity, long-distance
capture of feature information, and avoidance of excessive penalty of geometric fac-
tors. Suppresses the number of parameters of the model while improving the accuracy
of the model. This solves the long-range information loss problem and the balance
problem of predicting anchors;
3. The feasibility and effectiveness of the constructed model are verified using ablation
experiments. Compared with the original benchmark network, the MAP performance
of the model on the international open-source dataset VisDrone2019 is improved by
9.06% (test set), the number of parameters is reduced by 13.21% (test set), and the
comprehensive ability is improved significantly.
4. The proposed model is compared with six current most mainstream and advanced
deep object detection models to prove the superiority of our proposed model. Further-
more, comparing the interpretability of three excellent models illustrates the reason
for the superiority of this method.
The rest of this paper is organized as follows: Section 2 reviews previous related work.
Section 3 presents an improved aerial image detection model and details the structure and
working mechanism of the model. Section 4 first introduces the experimental environment
and parameter settings and then conducts ablation experiments, comparison experiments,
and interpretability experiments on the international open-source dataset VisDrone2019 to
comprehensively verify the feasibility of the proposed method. Section 5 summarizes the
results of the full text and looks forward to future research directions.
2. Related Work
Target detection from the perspective of UAVs faces many challenges while being
widely used, which has profound practical and research significance. With the continuous
progress of target detection technology, some effective methods have emerged for UAV
image detection tasks [16–20]. For example, reference [16] proposed a drone image object
detection method called UFPMP-Net. In this method, considering the characteristics of
UAV datasets that are small in scale and single-scene compared with natural image datasets,
the unified foreground packing (UFP) module was designed to cluster the subregions given
by the coarse detector to suppress the background. The resulting images were thereafter
Drones 2023, 7, 304 3 of 26
assembled into a mosaic for single inference, which significantly reduced the overall time
cost and improved the accuracy and efficiency of the detector. Reference [17] aimed at the
problem of small target detection in UAV images, proposing a high-resolution detection
network (HRDNet). This network solved the problem that high-resolution images input
to the network can lead to increased computational cost. The network uses two feature
fusion methods, a multidepth image pyramid network (MD-IPN) and a multiscale feature
pyramid network (MS-FPN), to fully optimize feature engineering. It feeds high-resolution
features into a shallow network to reduce computational cost, while low-resolution features
are fed into a deep network to extract more semantics. This processing method enables the
network to improve accuracy in high-resolution image training mode and reduce the harsh
requirements for hardware. Reference [18] proposed a cross-modality fusion transformer
(CFT) combined with an attention mechanism, an efficient cross-modal feature fusion
idea. This method extracts image features based on the transform architecture, which
enables the network to focus on global contextual features. In addition, by designing an
attention mechanism, the network can simultaneously perform intramodal and intermodal
fusion. This significantly improves the comprehensive performance of multispectral target
detection in aerial images. Experiments show that the method has excellent generalization
ability and robustness in a large number of datasets. Reference [19] observed that the
targets under aerial photography have the characteristic of high clustering. It proposed a
clustered detection (ClusDet) network, which completes the end-to-end detection process by
designing a cluster proposal subnetwork (CPNet), a scale estimation subnetwork (ScaleNet),
and a dedicated detection network (DetecNet). When monitoring begins, the network
focuses on aggregated regions rather than directly detecting individual targets. After that,
it is cropped and sent to the fine detector for further detection, which solves the problem
of small target aggregation and uneven distribution in UAV images to a certain extent.
Reference [20] proposed a feature fusion and scaling-based single shot detector (FS-SSD)
to quickly and accurately detect small objects from aerial angles. The method was based
on the SSD detector, which adjusts the feature fusion module by adding an extra branch
of the deconvolution module and average pooling to form a special feature pyramid. In
addition, the method combines the spatial relationship of objects with the detection task,
which further improves the detection accuracy.
Although advanced target detection methods have played a crucial role in promoting
UAV multitarget detection tasks, most of these methods require huge memory overhead
and computing requirements, and it is difficult to directly deploy in low-power image
processors, such as edge devices. The emergence of YOLO series detection networks has
solved this problem. This series of models has currently iterated through eight official
versions and multiple branch versions [21]. The standard YOLO model can usually be
divided into three parts: backbone, neck, and head. Among them, backbone is a feature
extraction network which is used to extract feature information from images [22,23]; neck
can fuse the features extracted from backbone, making the features learned by the network
more diverse and improving the performance of the detection network; head can make
accurate predictions by utilizing previous high-quality feature engineering. Almost every
generation of YOLO models has made corresponding improvements and enhancements
in these three structures. Due to their outstanding performance in detection accuracy
and speed, the YOLO series models have been widely used in industries, remote sensing,
transportation, medicine, and other fields [24]. At present, scholars have conducted corre-
sponding research on the application of YOLO and other lightweight models in the field of
UAV aerial image recognition [25–28]. For example, in reference [25], aiming at the contra-
dictory problem that the resources of the UAV deployment platform are limited but the
requirements for real-time reasoning are relatively high, an adaptive model compression
method was proposed to reduce the number of parameters and computation of the model.
By designing a “transfer factor” in the process of model pruning, this method can judge
whether to prune a certain type of channel through the scale factor and can appropriately
suppress the effect of pruning on the subsequent structure through the transfer factor.
Drones 2023, 7, 304 4 of 26
Thus, the model can automatically prune the convolutional layer channels. The method is
validated in the YOLOv3-SPP3 model. Reference [26] focused on improving the inference
speed of the deep model by comparing the accuracy and real-time performance of several
common detection frameworks; the UAV aerial image detection model UAV-Net was finally
built on the basis of SSD. Due to the improvement of the backbone and neck and the use of
the automatic pruning method, the size of the model is only 0.4 MB, which has excellent
universality. To address the balance between detection accuracy and computational cost,
reference [27] proposed a new large-scale marine target detection method for UAVs based
on YOLOv5. On the one hand, the algorithm introduces the transformer idea to enhance
feature engineering, which improves the detection accuracy of small objects and occluded
objects; on the other hand, the use of linear transformation with a simple structure and fast
calculation replaces part of the convolution structure, reducing the number of parameters
of the model. The experimental results show that compared with other advanced models,
this method has certain advantages in detection accuracy, recall rate, average precision,
and number of parameters. Reference [28] proposed an insulator defect detection method
that integrates mobile edge computing and deep learning. This method is based on the
YOLOv4 detector and uses the lightweight network MobilieNetv3 to replace the original
backbone, which greatly reduces the network parameters. In addition, by improving the
activation function in MobilieNetv3 and optimizing the loss function of the model, the
comprehensive quality of model checking is further improved. In addition, due to the
introduction of the particle swarm optimization idea, the algorithm can efficiently split the
deep neural network within limited time and computing resources.
Drones 2023, 7, 304 object detection: on the one hand, due to the lack of attention to large-scale feature maps, 5 of 26
the detection model may ignore some useful features and reduce the detection quality; on
the other hand, even if the fusion and supplementation of B, P, and N features are consid-
ered, the reuse
the reuse rate ofrate of features
features is low,isand
low,theand the original
original features
features lose some
lose some information
information after aafter
long
aupsampling
long upsampling and downsampling path. Therefore, the following
and downsampling path. Therefore, the following adjustments were made adjustments were to
made to the
the neck neck structure
structure for the UAVfor the UAV
aerial aerial photography
photography dataset: dataset:
First,
First, we
we refocused
refocused on large-scale feature
on large-scale featuremaps.
maps.An Anupsampling
upsamplingprocess
processwaswasadded
added to
to the FPN and fused with the B2 layer features in backbone to improve
the FPN and fused with the B2 layer features in backbone to improve the detection effect of the detection
effect
smallof small targets.
targets. Similar to Similar to the previous
the previous upsamplingupsampling
process inprocess
FPN, thein FPN, the C2fwas
C2f module module
used
was used to further improve the quality of feature extraction after feature
to further improve the quality of feature extraction after feature fusion. The C2f module fusion. The C2fis
module
an improvement on the original C3 module, which mainly refers to the advantage ofad-
is an improvement on the original C3 module, which mainly refers to the the
vantage of the ELAN
ELAN structure structurewith
in YOLOv7 in YOLOv7 with richer
richer gradient gradient information.
information. This moduleThis module
reduces one
reduces
standard one standard convolutional
convolutional layer and makes layer full
anduse
makes fullbottleneck
of the use of the module
bottleneck modulethe
to expand to
expand
gradientthe gradient
branch branch
to obtain to obtain
richer gradient richer
flow gradient
informationflow information
while ensuring while ensuring
light weight. Its
light
basicweight.
structureItsisbasic structure
shown in Figureis shown
1. in Figure 1.
Figure2.2.Improvement
Figure Improvementscheme
schemeatatthe
theneck.
neck.
3.2.
3.2.Improvement
Improvementofofthe theBackbone
Backbone
The
The conventional
conventional convolution
convolution modulemodule and and C2f
C2f module
module were were used
used in inYOLOv8
YOLOv8 to to
achieve high-quality feature extraction and downsampling of images.
achieve high-quality feature extraction and downsampling of images. However, due to However, due to the
addition of an of
the addition upsampling
an upsampling process in the neck
process in thepart
neckand theand
part use of theBi-PAN-FPN, the number
use of Bi-PAN-FPN, the
of parameters and complexity of the model were increased
number of parameters and complexity of the model were increased to a certain to a certain extent. This article
extent.
will
Thisintroduce theintroduce
article will Ghostblock theidea in backbone
Ghostblock ideaand use this structure
in backbone and use to replace
this sometoC2f
structure re-
modules.
place some Ghostblock
C2f modules.is an Ghostblock
optimizationismethod for lightweight
an optimization method convolution
for lightweightGhostNet [32].
convolu-
Its advantages
tion GhostNetare mainly
[32]. reflected inare
Its advantages twomainly
parts. On the one
reflected inhand, Ghostblock
two parts. On thefollows the
one hand,
essence of GhostNet. It first uses conventional convolution to generate
Ghostblock follows the essence of GhostNet. It first uses conventional convolution to gen- the original feature
map
erateand
the then combines
original featurevarious
map and linear
then transformation
combines various operations to enhance the
linear transformation feature
operations
map’s information. This ensures feature diversity while efficiently extracting
to enhance the feature map’s information. This ensures feature diversity while efficiently features. On
the other hand, a decoupled fully connected (DFC) attention mechanism
extracting features. On the other hand, a decoupled fully connected (DFC) attention mech- is proposed [33].
Through
anism is its particularity,
proposed this mechanism
[33]. Through avoids the
its particularity, thislimitations
mechanism ofavoids
traditional attention
the limitations
algorithms
of traditional attention algorithms in terms of computational complexity and capturesover
in terms of computational complexity and captures feature information fea-
long
ture distances.
information The advantages
over of the structure
long distances. improve
The advantages ofthe
thequality
structureof feature
improve engineering
the quality
of
ofthe entireengineering
feature structure. Specifically,
of the entirethe convolution
structure. form used
Specifically, the in GhostNet form
convolution is called
usedthein
cheap operation. Its implementation process is shown in Equations (3) and (4):
GhostNet is called the cheap operation. Its implementation process is shown in Equations
(3) and (4):
Y 0 = X ∗ F1∗1 (3)
Y ' = X * F1*1 (3)
where of
ginning X ∈the , Y ∈R
R implementation out
of F1*1 cheap
; the represents pointwise
operation, convolution;
only pointwise F d p repre-
convolution is
considered to obtain
sents depth-wise a feature map
convolution; and smaller
C out ≤ Cthan
' the actual
. Unlike output standard
conventional in proportion
convolution, at the be-
out
(one-half by default), and then depth-wise convolution acts on these
ginning of the implementation of the cheap operation, only pointwise convolution feature maps to achieve
is con-
asidered
linear transformation process. Finally, the feature maps of the two
to obtain a feature map smaller than the actual output standard in proportionsteps are spliced to
obtain the output result. This processing method significantly reduces
(one-half by default), and then depth-wise convolution acts on these feature maps to the parameter cost
and computational cost by reusing features and discards redundant information that may
achieve a linear transformation process. Finally, the feature maps of the two steps are
exist in conventional convolutions. However, the drawbacks of doing so are also obvi-
spliced to obtain the output result. This processing method significantly reduces the pa-
ous: pointwise convolution loses the interaction process with other pixels in space, which
rameter cost and computational cost by reusing features and discards redundant infor-
results in only the feature map obtained using depth-wise convolution capturing spatial
mation that may exist in conventional convolutions. However, the drawbacks of doing so
information. The representation of spatial information will be significantly weakened, thus
are also obvious: pointwise convolution loses the interaction process with other pixels in
space, which results in only the feature map obtained using depth-wise convolution cap-
turing spatial information. The representation of spatial information will be significantly
Drones 2023, 7, 304 weakened, thus affecting the detection accuracy of the model. In addition, the convolu- 7 of 26
tional structure can only focus on local information, but the self-attention mechanism that
can focus on global information can easily increase the complexity of the model.
The DFC
affecting attentionaccuracy
the detection mechanism
of thecan improve
model. the above
In addition, the problems well.
convolutional The core
structure can idea
isonly focus onuse
to directly local information,
a deeply but the
separable self-attention
structure with amechanism that cantofocus
simple structure onthe
obtain global
atten-
information
tion map with can easilyinformation.
global increase the complexity
The specificofcalculation
the model. process is shown in Equations
The
(5) and (6). DFC attention mechanism can improve the above problems well. The core
idea is to directly use a deeply separable structure with a simple structure to obtain
H
hw = Fh , h ω X h ω , h = 1, 2, ⋅ ⋅ ⋅, H , ω =1, 2, ⋅ ⋅ ⋅, W
the attention map withαglobal
' information.
H The specific calculation process is shown in
' '
(5)
Equations (5) and (6). h =1 '
H
α0hw =α ∑= Fh,h
H W
hw
0 Fω , hω α h' ω =
0 ω W Xh0 ω , h
' 1, ·2,· ⋅·,⋅ ⋅H,
, h1,=2,
' , ω==1,1,2,2,· ⋅·⋅ ·⋅,,W
, Hω W (5)
(6)
h =1 ω = 1
'
W
where X ∈ RC , H ,Wαhw = ∑ isFω,hω
, which Wconsistent
0 α0hω 0 , with
h = 1,the
2, · ·input 1, 2, · · ·, W(3); F is a depth-
·, H, ωin=Equation (6)
ω 0 =1 process divided into horizontal ( KW *1 ) and vertical ( 1*KH )
wise separable convolution
'
where X ∈α RC,H,W
directions; is the attention
, which map in the
is consistent withvertical direction;
the input α is(3);
in Equation theFattention
is a depth-map
'
based on α inconvolution
wise separable the horizontal direction.
process dividedTheintodecoupling ∗ 1) two
horizontal (KofW the (1 ∗greatly
directions
and vertical KH )
simplifies α0 is
directions;the the attention
process map in global
of extracting the vertical direction;ofαfeatures.
information is the attention
At themap
same based
time,ondue
0 in the horizontal direction. The decoupling of the two directions greatly simplifies the
toα the use of deep separable structures such as 1*KH and KW *1 , the complexity of the
process of extracting global information of features. At the same time, due to the use of deep
DFC is greatly
separable reduced
structures such(full
as 1 ∗connection:
Ο( H 2W + HW 2 ) the DFC
K H and KW ∗ 1, the complexity ;ofDFC: Ο ( is HW +reduced
K Hgreatly K W HW ) ).
Ghostblock combines
(full connection: O( Hcheap
2 W + operation with
HW 2 ); DFC: O(DFC,
K H HWwhich greatly
+ KW HW )).reduces the complexity
Ghostblock combines of
the model while taking into account the global information of features.
cheap operation with DFC, which greatly reduces the complexity the model while Its structure
taking is
shown in Figure
into account the 3.
global information of features. Its structure is shown in Figure 3.
Among them, the prediction category loss is essentially the cross entropy loss, and the
expression is:
f BCEL = weight[class](− x [class] + log(∑ exp( x [ j]))) (8)
j
where class is the number of categories; weight[class] denotes the weights for each class;
and x is the probability value after sigmoid activation. DFL is an optimization of the focal
loss function, which generalizes the discrete results of classification into continuous results
through integration. The expression is:
where yi , yi+1 represents the values from the left and right sides near the consecutive labels
y, satisfying yi < y < yi+1 , y = ∑in=0 P(yi )yi ; among the equation, P can be implemented
through a softmax layer, P(yi ), that is, Si .
Different from the CIoU loss used in YoloV8, the Wise-IoU loss function is used here
as the bounding box regression loss [34]. On the one hand, when the labeling quality of
the training data is low, the loss function combines a dynamic nonmonotonic focusing
mechanism to evaluate the quality of the anchor frame by using the “outlier” to avoid
excessive penalties for geometric factors (such as distance and aspect ratio) to the model.
On the other hand, when the prediction box has a high degree of coincidence with the target
box, the loss function makes the model obtain better generalization ability with less training
intervention by weakening the penalty of geometric factors. Based on this, this paper uses
Wise-IoU v3 with a two-layer attention mechanism and a dynamic nonmonotonic FM
mechanism. Its expression is as follows:
Wi Hi ( x p − x gt )2 + (y p − y gt )2
f BBRL = (1 − ) exp( )γ (10)
Su (Wg2 + Hg2 )∗
Drones2023,
Drones 2023,7,7,x304
FOR PEER REVIEW 9 9ofof26
26
To date, the improved aerial image detection model based on Yolov8 is shown in
Figure 5. Compared with the original YOLOv8, the neck, backbone, and loss functions
Figure
have4.4.
Figure Schematic
been diagram
improved.
Schematic Theofof
diagram the
theWise-IoU
specific solution.
changes
Wise-IoU are located in the graphic labels in the figure.
solution.
To date, the improved aerial image detection model based on Yolov8 is shown in
Figure 5. Compared with the original YOLOv8, the neck, backbone, and loss functions
have been improved. The specific changes are located in the graphic labels in the figure.
(g) Very small target (h) Complex background (i) Night + small target + occlusion
Figure 6. Samples example of the VisDrone dataset.
Figure 6. Samples example of the VisDrone dataset.
There are 10 categories of objects in the dataset. This paper divides the entire dataset
There are 10 categories of objects in the dataset. This paper divides the entire dataset
into a training set (6471 samples), validation set (548 samples), and test set (1610 samples)
into a training set (6471 samples), validation set (548 samples), and test set (1610 samples)
according to the dataset division method of the VisDrone 2019 challenge. Considering that
according to the dataset division method of the VisDrone 2019 challenge. Considering
the sample image contains a large number of small targets, to make the detection process
that the sample image contains a large number of small targets, to make the detection
take into account the requirements of real-time and accuracy, the sample size was normal-
process take into account the requirements of real-time and accuracy, the sample size was
ized to 640 × 640. Such a size can make the model truly deployable to edge devices without
normalized to 640 × 640. Such a size can make the model truly deployable to edge devices
without destroying too much of the useful information of the image. In terms of hardware
and software, we used an Intel(R) Core(TM) i9-12,900 K processor, 16 cores, and 24 threads,
a main frequency of 3.19 GHz, 32 GB running memory, graphics processor GeForce RTX
3090Ti, and 24 GB video memory; the deep learning model framework used Pytorch1.9.1
and Torchvision 0.10.1; YOLOV8’s benchmark version was Ultralytics 8.0.25. To ensure the
fairness and comparability of the model effects, all ablation experiments and various model
training processes in the comparison experiments did not use any pretraining weights. In
addition, considering that only edge devices can be used to realize real-time target detection
Drones 2023, 7, 304 11 of 26
and reasoning on UAVs, such limitations require a small number of model parameters, less
memory occupation, and a short inference time. Therefore, YOLOv8-s was used as the
benchmark model for improvement and promotion. This model follows all the ideas of
the v8 series and is only scaled in network width and depth. The important parameters
of the training process were set as shown in Table 1. In the table, image scale, image
flip left-right, mosaic, and image translation are all data enhancement methods, and the
following parameters indicate the probability of their occurrence.
Parameters Setup
Epochs 150
Batch Size 8
Optimizer SGD
NMS IoU 0.7
Initial Learning Rate 1 × 10−2
Final Learning Rate 1 × 10−4
Momentum 0.937
Weight-Decay 5 × 10−4
Image Scale 0.5
Image Flip Left-Right 0.5
Mosaic 1.0
Image Translation 0.1
α (Wise-IoU) 1.9
δ (Wise-IoU) 3
Close Mosaic Last 10 epochs
Table 2. Cont.
Summarizing
Summarizing the subcategory
the subcategory results
results and and overall
overall results inresults in 2Tables
Tables and 32and
andFigure
3 and Figure 7,
the following conclusions
7, the following conclusions can be drawn: can be drawn:
1. The 1. A model
The A(i.e., the benchmark
model model) performed
(i.e., the benchmark poorly. Its
model) performed accuracy
poorly. indicators
Its accuracy indicators
were in the lowest position, but the FPS was in the first place, reaching 182/fs-1.
were in the lowest position, but the FPS was in the first place, reaching 182/f.s−1. This This
indicates indicates
that eventhatif the
evennumber of model
if the number parameters
of model was reduced
parameters (only (only
was reduced 9.6599.659
mil- million)
lion) in the
inimproved
the improved model,
model,it would stillstill
it would increase thethe
increase number
numberof network
of networklayers and
layers and some
some inference time.
inference TheThe
time. FPSFPS index of the
index improved
of the improved model
modelreached
reached 167/f.s-1, −1, which can
167/f.swhich
can also ensure real-time
also ensure requirements
real-time in actual
requirements deployment.
in actual deployment.
2. After integrating the B, C, and D structures, the model improved performance in
several aspect—focusing on small target features, multiplexing multiscale features,
suppressing information loss during long-range feature transmission, and taking
into account anchor boxes of different qualities, feature engineering was significantly
strengthened. This can be seen from each single-category indicator. In most cases,
whenever a structure was added, the performance of the P, R, and AP indicators of
the model were improved to a certain extent. However, after incorporating the D
structure, the model’s indicator data in some categories were not as good as before.
That is, in some cases, A + B + C was better than A + B + C + D, but this did not affect
the overall trend.
pressing information loss during long-range feature transmission, and taking into ac-
count anchor boxes of different qualities, feature engineering was significantly
strengthened. This can be seen from each single-category indicator. In most cases,
whenever a structure was added, the performance of the P, R, and AP indicators of
Drones 2023, 7, 304 the model were improved to a certain extent. However, after incorporating 14 the D
of 26
structure, the model’s indicator data in some categories were not as good as before.
That is, in some cases, A + B + C was better than A + B + C + D, but this did not affect
the overall trend.
3. On the whole, the sequential improvement of the three structures made the model in
3. On the whole, the sequential improvement of the three structures made the model in
the VisDrone dataset improve the accuracy continuously, the number of parameters
the VisDrone dataset improve the accuracy continuously, the number of parameters
gradually decreased, and the final model size was only 19.157 MB. This shows that
gradually decreased,
the improvement andbaseline
to the the finalmodel
modelissize was and
feasible onlyeffective,
19.157 MB. Thisinto
taking shows that
account
the improvement to the baseline model is feasible and effective, taking into
the accuracy and speed of edge device detection scenarios. The detection effect of account
the accuracy
some scenes and speedinof
is shown edge 8.device detection scenarios. The detection effect of
Figure
some scenes is shown in Figure 8.
The detection samples selected in Figure 8 are all test set samples. Regardless of the
scenario, the constructed model had strong detection ability, and the robustness met the
actual engineering needs. However, in the detection tasks of small targets and dense targets,
the model inevitably missed detection and false detection. For example, due to the high
similarity between the truck class and the bus class in the high-altitude perspective, there
were many misdetected targets in the detection process; if the car class and pedestrian class
are too small, they may be considered by the model as backgrounds and missed detections.
Drones 2023, 7, 304 15 of 26
By comprehensively analyzing the relevant data in Table 4 and Figure 9, the perfor-
mance and comparison results of each model can be summarized as follows:
1. MobilNetv2-SSD had the worst overall performance. This model had the lowest
number of parameters, only 3.94 million. In the target detection task, its R index was
the lowest in both the validation set and the test set, which means that the model had a
large number of missed detections. However, its p value showed a high performance,
and it can be seen that the objects detected by the model were easier to identify, except
for the missed objects. The above situation is mainly because the VisDrone dataset has
high requirements for the target detection model in terms of shooting angle, target
size, and environmental complexity. MobilNetv2-SSD often has high applicability in
simpler tasks, but it is not suitable for this type of complex task.
2. YOLOX-s also had the above problems. The R value of the model on this dataset was
relatively low, and the missed detection rate was high. The p value achieved the best
results in both the validation set and the test set, which made the YOLOX-s model
achieve better results (better than those of YOLOv4-s). However, the model had the
worst FPS performance. The performance of YOLOv4-s was relatively mediocre, and
it was only better than MobilNetv2-SSD in the detection task, but the R value of this
model was greatly improved compared with the previous two models, and the missed
detection rate decreased.
3. The two lightweight models, YOLOv5-s and YOLOv7-tiny, both achieved excellent
performance in the test set. Especially for YOLOv5-s, after the official iteration of
results in both the validation set and the test set, which made the YOLOX-s model
achieve better results (better than those of YOLOv4-s). However, the model had the
worst FPS performance. The performance of YOLOv4-s was relatively mediocre, and
it was only better than MobilNetv2-SSD in the detection task, but the R value of this
model was greatly improved compared with the previous two models, and the missed
Drones 2023, 7, 304 detection rate decreased. 16 of 26
3. The two lightweight models, YOLOv5-s and YOLOv7-tiny, both achieved excellent
performance in the test set. Especially for YOLOv5-s, after the official iteration of mul-
tiple versions,
multiple the overall
versions, performance
the overall was greatly
performance wasimproved. The P andThe
greatly improved. R indicators
P and R
ofindicators of the were
the two models two models were inbalanced
in a relatively a relatively balanced
state, and thestate, and the
detection ratedetection
and the
rate and
correct ratethe correct
were rate were
relatively relatively YOLOv7-tiny
coordinated. coordinated. YOLOv7-tiny
was the smallestwas model.
the smallest
The
model. The of
performance performance
YOLOv5-s of andYOLOv5-s and YOLOv7-tiny
YOLOv7-tiny in the test setinwasthe second
test set only
was second
to the
only to themodel
lightweight lightweight
proposed model in proposed
this paper,inandthisthey
paper, andalso
were theysuitable
were also
forsuitable
target de-for
target detection tasks in
tection tasks in UAV aerial images. UAV aerial images.
4.4. TheTheMAPMAPindex
indexofofthethemodel
model proposed in in this
thispaper
paperwas wasoptimal
optimalinin both
boththethe
validation
valida-
setset
tion and thethe
and testtest
set.set.
Compared
Compared with the
with thenon-lightweight
non-lightweightmodel modelYOLOv5-m,
YOLOv5-m,the theP,
P,R,R,and
andMAPMAP metrics
metrics allall performed
performed comparably
comparably or or better
better than
than YOLOv5-m.
YOLOv5-m. From Fromthe the
three indicators of FPS, parameters, and model size, the model performance was the
three indicators of FPS, parameters, and model size, the model performance was in in
thefirst echelon,
first echelon,which
which shows
showsthatthat
the the
model
modelhad had
the best
the comprehensive
best comprehensive performance.
perfor-
In theIn
mance. UAVthe aerial imageimage
UAV aerial targettarget
detection task involved
detection in this
task involved inpaper, the proposed
this paper, the pro-
model met the needs of actual production scenarios in terms
posed model met the needs of actual production scenarios in terms of detection of detection accuracy
accu-
andand
racy deployment
deployment difficulty andand
difficulty hadhad
considerable
considerable robustness
robustnessandandpracticability.
practicability.
9. Comparison of
Figure 9.
Figure ofthe
thenormalization
normalizationeffect of experimental
effect indicators
of experimental (comparison
indicators experiment).
(comparison experi-
ment).
4.4. Interpretability Experiments
Deep learning is often referred to as the “black box”. Although deep learning models
are widely used in various types of engineering fields, due to the lack of interpretability of
algorithms, deep learning has not made great progress in some high-tech fields. Therefore,
deep learning interpretability is the mainstream direction of artificial intelligence research.
UAVs play a pivotal role in intelligent agriculture, the military, and other fields, and
the in-depth discussion of interpretability is a key link in the establishment of their in-
depth models. The experiment selected lightweight deep models that performed well in
Section 4.3 as validation objects (YOLOv5-s, YOLOv7-tiny, and the model in this article).
After fully discussing their performances in the confusion matrix of the VisDrone dataset,
we used Gradient weighted Class Activation Mapping (Grade CAM) to visually analyze
the attention areas of the three models [41]. Figure 10 shows the confusion between the
categories of the three models.
Drones 2023, 7, 304 17 of 26
It can be seen that all three models had a large missed detection rate (that is, each
category was identified as a background category). The detailed analysis shows that the
recognition of cars and buses was relatively good; the difficult-to-recognize categories
included bicycles and people; trunks, tricycles, and awning-tricycles had the worst recogni-
tion effect. In view of the above characteristics, Grad-CAM was selected in the experiment
to show some special categories of attention, and the reasons for viewing the performance
of the three models were explained from the perspective of interpretability. Grad-CAM
is based on the gradient calculated by backpropagation of class confidence scores and
generates corresponding weights. Since the weight contains category information, it has
great positive significance for the final detection performance. Specifically, we will focus
Drones 2023, 7, x FOR on
PEER REVIEW
the display of the output layer effect of the backbone part of each model, and based 17 of 2
on this, we will analyze the attention areas in trunk, people, and bicycle detection. The
experimental results are shown in Figure 11.
As shown4.4.in Figure 11, when
Interpretability detecting the same class in the same image, YOLOv7-tiny,
Experiments
YOLOv5-s, and the model in this paperreferred
Deep learning is often showed tothe evolution
as the of attention
“black box”. Although from “narrow”
deep learning model
to “wide”. YOLOv7-tiny, which had the worst performance among the three models,
are widely used in various types of engineering fields, due to the lack of interpretability had a
certain area of
of algorithms, deep learning has not made great progress in some high-techThis
concern for similar targets but could not cover a large number of targets. fields. There
was especially true
fore, when
deep thereinterpretability
learning were occlusions andmainstream
is the tiny types of targets;ofYOLOv5-s
direction had
artificial intelligence re
a greater improvement
search. UAVs than YOLOv7-tiny.
play a pivotal roleThe focus areaagriculture,
in intelligent in YOLOv5-s thewas significantly
military, and other fields
improved. Thisand played a crucial
the in-depth role in comprehensively
discussion of interpretability improving
is a key link the in
detection accuracy of thei
the establishment
of various categories; the Grad-CAM map of the model in this paper
in-depth models. The experiment selected lightweight deep models that is the best, and the
performed wel
dark red partsin (focus
Sectionareas)
4.3 asare the sameobjects
validation type of(YOLOv5-s,
targets. In the visual interface
YOLOv7-tiny, andof thethemodel
bicyclein this arti
class, it wascle).
the After
only fully
modeldiscussing
among the theirthree models that
performances in fully considered
the confusion the bicycle
matrix of the VisDron
class on both dataset, we used Gradient weighted Class Activation Mapping (Grade CAM) ato visually
sides of the street. This is in line with the original intention of designing
model to detect tiny the
analyze objects. In summary,
attention areas of the
the model in this paper
three models achieves
[41]. Figure 10 the
showsbestthe
results
confusion be
with interpretability.
tween the categories of the three models.
(a)
(b)
(c)
Figure 10.matrix
Figure 10. Confusion Confusion matrix
diagram of diagram
the threeofmodels.
the three(a)models. (a) YOLOv5-s
YOLOv5-s confusionconfusion matrix diagram.
matrix diagram.
(b) YOLOv7-tiny confusion matrix diagram. (c) Confusion matrix of the model
(b) YOLOv7-tiny confusion matrix diagram. (c) Confusion matrix of the model in this paper. in this paper.
of the three models were explained from the perspective of interpretability. Grad-CAM is
based on the gradient calculated by backpropagation of class confidence scores and gen-
erates corresponding weights. Since the weight contains category information, it has great
positive significance for the final detection performance. Specifically, we will focus on the
Drones 2023, 7, 304
display of the output layer effect of the backbone part of each model, and based on this,
19 of 26
we will analyze the attention areas in trunk, people, and bicycle detection. The experi-
mental results are shown in Figure 11.
(a)
(b)
(c)
(d)
Figure 11. Grad-CAM visualization. (a) Original image. (b) YOLOv7-tiny Grad-CAM map. (c)
Figure 11. Grad-CAM visualization. (a) Original image. (b) YOLOv7-tiny Grad-CAM map.
YOLOv5-s Grad-CAM map. (d) Our model Grad-CAM diagram.
(c) YOLOv5-s Grad-CAM map. (d) Our model Grad-CAM diagram.
Figure12.
Figure 12.DJI
DJIMavic
Mavic33drone
droneshooting
shootinglive
livescenes.
scenes.
There was verification in Section 4.3 that MobilNetv2-SSD is not suitable for complex
tasks such as drone multi target detection. Therefore, referring to Section 4.3, YOLOv4-
s, YOLOv5-s, YOLOX-s, YOLOv7-tiny, and YOLOv8-s were selected as the comparison
objects in this section. The paper names YOLOv4-s, YOLOv5-s, YOLOX-s, YOLOv7-tiny,
YOLOv8-s, and the proposed model as A, B, C, D, E, and F. The comparison indicators
are consistent with those in Table 3 of the ablation experiment, and all models do not
use pretraining weights for training. The experimental results are shown in Table 6 and
Figure 14.
Onboard memory (GB) 8
Shot Hasselblad, Telephoto camera
Angle of view (°) 84, 15
Equivalent focal length (mm) 24, 162
Drones 2023, 7, 304 Aperture f/2.8–f/11, f/4.4 21 of 26
Pixel (w) 2000, 1200
Figure14.
Figure 14.3on
3onofofthe
thenormalization
normalizationeffect
effect
ofof experimental
experimental indicators
indicators (self-built
(self-built dataset).
dataset).
Based
Basedon
onthe
thecomprehensive
comprehensiveanalysis of the
analysis relevant
of the data data
relevant in Table 6 and 6Figure
in Table 14, the 14,
and Figure
performance effects
the performance and comparison
effects resultsresults
and comparison of eachofmodel can be can
each model summarized as follows:
be summarized as fol-
lows:
1. YOLOv4 s and YOLOv7-tiny performed similarly on the self-built dataset, both
1. obtainingsrelatively
YOLOv4 low map values
and YOLOv7-tiny on thesimilarly
performed test set. Although YOLOv7-tiny
on the self-built dataset,had theob-
both
relatively lowest model size and number of parameters, its universal
taining relatively low map values on the test set. Although YOLOv7-tiny had the rel-performance
was notlowest
atively excellent.
modelHowever,
size and these two models
number can still its
of parameters, be universal
used in occasions wherewas
performance
precision
not requirements
excellent. However,are not critical.
these Both can
two models had still
morebethan
used 150/f.s −1 in FPS,
in occasions and are
where preci-
capable of being deployed in Edge device;
sion requirements are not critical. Both had more than 150/f.s-1 in FPS, and are capable
2. YOLOv5-s,
of YOLOX-s,
being deployed and YOLOv8-s
in Edge device; all achieved excellent results, approximately at
2. the same level of detection
YOLOv5-s, YOLOX-s, and YOLOv8-saccuracy, butall
YOLOv8-s
achievedhad the best
excellent accuracy.
results, In terms of at
approximately
FPS, YOLOv5-s and YOLOv8-s both exceeded 300/f.s−1, but YOLOX-s did not reach
the same level of detection accuracy, but YOLOv8-s had the best accuracy. In terms of
100 in this indicator, indicating that the former two have considerable advantages in
FPS, YOLOv5-s and YOLOv8-s both exceeded 300/f.s-1, but YOLOX-s did not reach
detection accuracy and speed;
100 in this indicator, indicating that the former two have considerable advantages in
3. The model in this article led the original YOLOv8-s by nearly two percentage points in
detection accuracy and speed;
the map of the test set, and led the worst performing YOLOv7-tiny by 18.4 percentage
points. At the same time, the FPS reached 294/f.s−1, achieving a good balance
between detection accuracy and speed. This also indicates that the model in this
paper achieved the best detection performance in various scenarios and datasets, and
has strong universality. The partial detection performance of this model on the test
set is shown in Figure 15. It can be seen that for small targets, there is basically no
missed detection phenomenon in the model. However, in some cases, redundant
detection boxes may appear, and in a few cases, similar backgrounds may be mistaken
for targets.
between detection accuracy and speed. This also indicates that the model in this paper
achieved the best detection performance in various scenarios and datasets, and has
strong universality. The partial detection performance of this model on the test set is
shown in Figure 15. It can be seen that for small targets, there is basically no missed
Drones 2023, 7, 304
detection phenomenon in the model. However, in some cases, redundant detection 23 of 26
boxes may appear, and in a few cases, similar backgrounds may be mistaken for tar-
gets.
5. Conclusions
This paper proposes an aerial image detection model based on YOLOv8-s, which
can accurately detect aerial image targets in real time under the premise of satisfying
the deployment of edge devices. This model overcomes the negative effects of shooting
angle, light, background, and other factors on the detection task. Specifically: First, in
view of the common problem that small targets in aerial images are prone to misdetection
and missed detection, the idea of Bi-PAN-FPN is introduced to improve the neck part in
YOLOv8-s. By fully considering and reusing multiscale features, a more advanced and
complete feature fusion process is achieved while maintaining the parameter cost as much
as possible. Second, the GhostblockV2 structure is used in the backbone of the benchmark
model to replace part of the C2f module, which suppresses information loss during long-
distance feature transmission while significantly reducing the number of model parameters;
finally, WiseIoU loss is used as bounding box regression loss, combined with a dynamic
nonmonotonic focusing mechanism, and the quality of anchor boxes is evaluated by using
“outlier” so that the detector takes into account different quality anchor boxes to improve
the overall performance of the detection task. In this paper, the authoritative dataset
VisDrone in the field of international drone vision is used as the experimental verification
object; ablation experiments, comparison experiments, and interpretability experiments
Drones 2023, 7, 304 24 of 26
are designed; and the feasibility and effectiveness of the proposed method are expounded
from multiple perspectives. The results show that the proposed improved method does
play an obvious role in aerial image detection. Compared with the baseline model, the
MAP performance of our method on the test set is improved by 9.06%, and the number of
parameters is reduced by 13.21%. Compared with the other six comparison algorithms, the
method in this paper achieved the best performance in terms of accuracy. The performance
of this method has strong interpretability. In addition, the model also achieved the optimal
detection accuracy (MAP: 91.7%) on the self-built dataset, with an FPS of up to 293/f.s−1.
In general, the proposed method is suitable for deployment in complex working conditions,
and also has considerable universality and robustness.
However, a problem was also exposed during the experiment: from the ablation
experiments, the model in this paper cannot achieve better results than other structures
in all small categories. For example, the performance of tricycle and bus is not as good as
that of A + B + C, and the performance of van is not as good as that of A + B. In future
research, we will focus on the above problems, combined with customized detection tasks,
and explore the adaptive adjustment of the model structure from the perspectives of model
hyperparameters and network composition. In addition, training deep learning networks
often requires a large number of labeled images, which is often unrealistic in aerial image
detection tasks. In future research, we will focus on using unsupervised theory to act on
public datasets and self-built datasets by reducing the data distribution differences between
the source domains and target domains, reducing the dependence of deep learning on
labeled data.
Author Contributions: Conceptualization, Y.L. and Q.F.; methodology, Y.L.; software, Y.L.; validation,
Y.L., Q.F. and Z.H.; formal analysis, Q.G.; investigation, Q.G.; resources, H.H.; data curation, Y.L.;
writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.;
supervision, Q.F.; project administration, Y.L. and H.H.; funding acquisition, Y.L. All authors have
read and agreed to the published version of the manuscript.
Funding: This research was funded by the Youth Science and Technology Talent Growth Project of
Guizhou Provincial Department of Education (No. KY [2022] 199), the Research Fund of Guizhou
University of Finance and Economics (No. 2021KYYB08), the National Natural Science Foundation
of China (No. 52165063), the Guizhou Provincial Science and Technology Plan Project (No. ZK
[2021]337), the Open Fund Project supported by the Key Laboratory of Advanced Manufacturing
Technology Ministry of Education, China (No. QianJiaoJi [2022]436), the Guizhou Province Graduate
Research Fund (YJSCXJH [2021] 068), and Guizhou Provincial Science and Technology Plan Project
(ZK[2023]029).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Adaimi, G.; Kreiss, S.; Alahi, A. Perceiving Traffic from Aerial Images. arXiv 2020, arXiv:2009.07611.
2. Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle Detection from UAV Imagery with Deep Learning: A Review.
IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [CrossRef] [PubMed]
3. Byun, S.; Shin, I.-K.; Moon, J.; Kang, J.; Choi, S.-I. Road Traffic Monitoring from UAV Images Using Deep Learning Networks.
Remote Sens. 2021, 13, 4027. [CrossRef]
4. Chang, Y.-C.; Chen, H.-T.; Chuang, J.-H.; Liao, I.-C. Pedestrian Detection in Aerial Images Using Vanishing Point Transformation
and Deep Learning. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece,
7–10 October 2018; pp. 1917–1921.
5. Božić-Štulić, D.; Marušić, Ž.; Gotovac, S. Deep Learning Approach in Aerial Imagery for Supporting Land Search and Rescue
Missions. Int. J. Comput. Vis. 2019, 127, 1256–1278. [CrossRef]
Drones 2023, 7, 304 25 of 26
6. Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. Rrnet: A Hybrid Detector for Object Detection in Drone-Captured
Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea,
27–28 October 2019; pp. 100–108.
7. Chen, Y.; Lee, W.S.; Gan, H.; Peres, N.; Fraisse, C.; Zhang, Y.; He, Y. Strawberry Yield Prediction Based on a Deep Neural Network
Using High-Resolution Aerial Orthoimages. Remote Sens. 2019, 11, 1584. [CrossRef]
8. Chen, Y.; Li, J.; Niu, Y.; He, J. Small Object Detection Networks Based on Classification-Oriented Super-Resolution GAN for UAV
Aerial Imagery. In Proceedings of the 2019 Chinese Control and Decision Conference (CCDC), Nanchang, China, 3–5 June 2019;
pp. 4610–4615.
9. Cai, W.; Wei, Z. Remote Sensing Image Classification Based on a Cross-Attention Mechanism and Graph Convolution. IEEE
Geosci. Remote Sens. Lett. 2020, 19, 1–5. [CrossRef]
10. Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A Global-Local Self-Adaptive Network for Drone-View Object
Detection. IEEE Trans. Image Process. 2020, 30, 1556–1569. [CrossRef]
11. Domozi, Z.; Stojcsics, D.; Benhamida, A.; Kozlovszky, M.; Molnar, A. Real Time Object Detection for Aerial Search and Rescue
Missions for Missing Persons. In Proceedings of the 2020 IEEE 15th International Conference of System of Systems Engineering
(SoSE), Budapest, Hungary, 2–4 June 2020; pp. 519–524.
12. Hong, S.; Kang, S.; Cho, D. Patch-Level Augmentation for Object Detection in Aerial Images. In Proceedings of the IEEE/CVF
International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 127–134.
13. Dong, J.; Ota, K.; Dong, M. UAV-Based Real-Time Survivor Detection System in Post-Disaster Search and Rescue Operations.
IEEE J. Miniat. Air Space Syst. 2021, 2, 209–219. [CrossRef]
14. Hsieh, M.-R.; Lin, Y.-L.; Hsu, W.H. Drone-Based Object Counting by Spatially Regularized Regional Proposal Network. In
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4145–4153.
15. Liao, J.; Piao, Y.; Su, J.; Cai, G.; Huang, X.; Chen, L.; Huang, Z.; Wu, Y. Unsupervised Cluster Guided Object Detection in Aerial
Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11204–11216. [CrossRef]
16. Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward Accurate and Efficient Object Detection on Drone Imagery. In Proceedings of
the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 1026–1033.
17. Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. In Proceedings of the 2021 IEEE
International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6.
18. Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2021,
arXiv:2111.00273.
19. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered Object Detection in Aerial Images. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320.
20. Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small Object Detection in Unmanned Aerial Vehicle Images Using Feature Fusion
and Scaling-Based Single Shot Detector with Spatial Context Analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770.
[CrossRef]
21. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A Survey of Deep Learning-Based Object Detection. IEEE Access 2019, 7,
128837–128868. [CrossRef]
22. Cai, W.; Ning, X.; Zhou, G.; Bai, X.; Jiang, Y.; Li, W.; Qian, P. A Novel Hyperspectral Image Classification Model Using Bole
Convolution With Three-Direction Attention Mechanism: Small Sample and Unbalanced Learning. IEEE Trans. Geosci. Remote
Sens. 2023, 61, 1–17. [CrossRef]
23. Li, J.; Li, B.; Jiang, Y.; Tian, L.; Cai, W. MrFDDGAN: Multireceptive Field Feature Transfer and Dual Discriminator-Driven
Generative Adversarial Network for Infrared and Color Visible Image Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 1–28.
[CrossRef]
24. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2019, 111, 257–276. [CrossRef]
25. Chen, Y.; Li, R.; Li, R. HRCP: High-Ratio Channel Pruning for Real-Time Object Detection on Resource-Limited Platform.
Neurocomputing 2021, 463, 155–167. [CrossRef]
26. Ringwald, T.; Sommer, L.; Schumann, A.; Beyerer, J.; Stiefelhagen, R. UAV-Net: A Fast Aerial Vehicle Detector for Mobile
Platforms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach,
CA, USA, 16–17 June 2019; pp. 544–552.
27. Li, Y.; Yuan, H.; Wang, Y.; Xiao, C. GGT-YOLO: A Novel Object Detection Algorithm for Drone-Based Maritime Cruising. Drones
2022, 6, 335. [CrossRef]
28. Deng, F.; Xie, Z.; Mao, W.; Li, B.; Shan, Y.; Wei, B.; Zeng, H. Research on Edge Intelligent Recognition Method Oriented to
Transmission Line Insulator Fault Detection. Int. J. Electr. Power Energy Syst. 2022, 139, 108054. [CrossRef]
29. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
30. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
31. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790.
Drones 2023, 7, 304 26 of 26
32. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589.
33. Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance Cheap Operation with Long-Range Attention. arXiv
2022, arXiv:2211.12905.
34. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023,
arXiv:2301.10051.
35. Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J. VisDrone-DET2021: The Vision
Meets Drone Object Detection Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854.
36. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
37. Fang, Y.; Guo, X.; Chen, K.; Zhou, Z.; Ye, Q. Accurate and Automated Detection of Surface Knots on Sawn Timbers Using
YOLO-V5 Model. BioResources 2021, 16, 5390–5406. [CrossRef]
38. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding Yolo Series in 2021. arXiv 2021, arXiv:2107.08430.
39. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object
Detectors. arXiv 2022, arXiv:2207.02696.
40. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching
for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea,
27 October–2 November 2019; pp. 1314–1324.
41. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-Cam: Visual Explanations from Deep Networks
via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy,
22–29 October 2017; pp. 618–626.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.