0% found this document useful (0 votes)
61 views11 pages

UAV Target Detection Algorithm Based On Improved YOLOv8

The document presents a UAV target detection algorithm based on an improved YOLOv8 model. The model embeds a small target connection layer to better capture semantic information of small targets. It also introduces a global attention mechanism in the backbone to utilize feature information from different dimensions and improve detection performance. Experimental results on the VisDrone2021 dataset show the modified model increases mAP for small target detection by 4.4% compared to the baseline, outperforming methods like SSD and YOLO.

Uploaded by

essamabdelhamied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views11 pages

UAV Target Detection Algorithm Based On Improved YOLOv8

The document presents a UAV target detection algorithm based on an improved YOLOv8 model. The model embeds a small target connection layer to better capture semantic information of small targets. It also introduces a global attention mechanism in the backbone to utilize feature information from different dimensions and improve detection performance. Experimental results on the VisDrone2021 dataset show the modified model increases mAP for small target detection by 4.4% compared to the baseline, outperforming methods like SSD and YOLO.

Uploaded by

essamabdelhamied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 21 September 2023, accepted 13 October 2023, date of publication 18 October 2023, date of current version 25 October 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3325677

UAV Target Detection Algorithm Based


on Improved YOLOv8
FENG WANG , HONGYUAN WANG , ZHIYONG QIN, AND JIAYING TANG
School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213000, China
Corresponding author: Hongyuan Wang ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 61976028, and in part by the
Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant SJCX22_1431.

ABSTRACT Since UAVs usually fly at higher altitudes, resulting in a more significant proportion of small
targets after imaging, this poses a challenge to the target detection algorithm at this stage; in addition, the
high-speed flight of UAVs causes a sense of blurring on the detected objects, which leads to difficulties in
target feature extraction. To address the two problems presented above, we propose a UAV target detection
algorithm based on improved YOLOv8. First, the small target detection structure (STC) is embedded in the
network, which acts as a bridge between shallow and deep features to improve the collection of semantic
information of small targets and enhance detection accuracy. Second, using the feature of global information
of UAV imaging-focused targets, the global attention GAM is introduced to the bottom layer of YOLOv8m’s
backbone to prevent the loss of image feature information during sampling and thus increase the algorithm’s
detection performance by feeding back feature information of different dimension. The modified model
effectively increases the detection of tiny targets with an mAP value of 39.3%, which is 4.4% higher than
the baseline approach, according to experimental results on the VisDrone2021 dataset, and outperforms
mainstream algorithms such as SSD and YOLO series, effectively increasing the detection performance of
UAVs for small targets.

INDEX TERMS UAV target detection, global attention mechanism, small target detection.

I. INTRODUCTION resulting in a lack of balance between positive and negative


As UAVs are widely used in remote sensing imagery, samples of data for training the network model.
agricultural rescue, disaster relief, video shooting, disaster Current target detection algorithms are generally classified
surveillance, industrial target detection [1], detection of small into candidate region-based target detection algorithms and
targets in UAV capture scenes is gradually becoming a recent regression-based target detection algorithms. Region-based
popular task. UAVs detect a large percentage of small targets Convolutional Neural Networks is an object detection method
and can provide limited resolution [2]. One aspect poses that combines deep convolutional neural networks and region
a great challenge to target recognition due to the realistic recommendation, which can also be called a two-stage
capture, where there are occlusions and overlaps between target detection algorithm. The first stage completes the
targets, and due to the high speed and low altitude flight of recommendation of the region frame, and the second stage
UAVs that produce motion blur so that the imaging images is the target recognition of the region frame. The main ones
can be used with fewer features. On the other hand, the are R-CNN [3], D2Det [4], etc. At present, Faster R-CNN [5]
existing neural network has certain limitations that are not is one of the more widely used detection methods for object
good enough to capture the effective features of small targets, detection in deep convolutional neural networks, However,
due to the enormous computational cost of the network
The associate editor coordinating the review of this manuscript and structure parameters, it has a sluggish detection speed and
approving it for publication was M. Venkateshkumar . so cannot match the requirements for real-time detection in

2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
116534 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

several application areas. Especially for embedded systems, but the addition of TPH adds a huge amount of parameters,
the required computation time is too long. Similarly, many which affects the network’s computing speed. Xia et al.
methods sacrifice detection accuracy for detection speed. [14] developed a way to convert small UAV detection into
To solve the problem of the coexistence of accuracy and predicted residual images by integrating residual blocks into
speed, the methods of YOLO [6] and SSD [7] have emerged. U-Net and combining different scale feature fusion for image
Such methods use the idea of regression-based methods to reconstruction to enhance the detection performance of UAV
directly regress the coordinates of the area frame and object collected data. Fang et al. [15] updated a way to convert
class at this location among multiple locations of the input UAV detection into predicted residual images by learning a
image. In target detection tasks, the YOLO series algorithms nonlinear mapping from the input image to the residual image
have been popular in various research, and with the latest and introduced multiscale feature fusion for comprehensive
YOLOv8 [8] series, they have shown excellent performance aggregation to enhance the detection capability.
in the field of target detection. This work provides an
enhanced YOLOv8 algorithm based on the industrial UAV
target detection job and obtains good results on the UAV
dataset VisDrone2021.This paper’s primary contributions are
as follows:
1) Adding the Small target connection (STC) layer to
stitch the shallower feature map with the deeper feature
map to improve feature fusion of different parts, reduce
model semantic information loss on the way to sampling, and
improve algorithm detection performance on UAV captured
images.
2) The global attention mechanism (GAM) is introduced
in the backbone network module Backbone to enable
the algorithm to capture feature information in multiple
dimensions, which fully exploits the visual representation
of the perceptual field in each dimension and achieves
performance improvement in the real acquisition process. FIGURE 1. As shown above, three convolutional structures and n
BottleNeck modules are used in the C3 module structure.

II. RELATED WORK


A. SMALL TARGET DETECTION B. ATTENTION MECHANISM
Along with the rapid growth of deep learning technologies in Xu et al. [16] pioneered the merging of computer vision and
recent years, image-based small target detection algorithms attention mechanisms with their proposed visual attention
have been gradually improved by the unremitting efforts of theory. Later, Jie et al. [17] proposed an SE architecture
researchers at home and abroad. Wang et al. [9] introduced module to perform dynamic channel feature calibration at the
a feature fusion module based on Single Shot MultiBox network side to improve the representation. Hu et al. [18]
Detector (SSD), which increased tiny target identification proposed spatial attention that uses parametric Gather-Excite
accuracy. Pang et al. [10] suggested a tiny target identification to aggregate contextual information between feature map
approach based on Faster-RCNN with multiscale fusion to neighbors and adjust the feature map based on the processing
enhance detection accuracy by enriching the target data information. Inspired by the above methods, a series of
acquired by UAVs, but this resulted in a complex model, studies such as CBAM [19], CoordAttention [20], SCSE [21],
a high number of parameters, and a slower detection time. and others fused channel attention with spatial attention to
LIM [11] proposed a context- and attention-based small achieve better results. The number of attention parameters is
target detection algorithm, which focuses more on small significantly improved after fusion, and some studies choose
targets in the acquired images and improves the detection to simplify the parametric model for lightweight consid-
performance of small targets under certain conditions, but erations; Gcnet [22] proposes a lightweight global context
the implementation conditions are harsh and difficult to (GC) module and ECA-Net [23] cuts the number of model
implement in realistic scenarios of industrial UAVs. Qing parameters by introducing one-dimensional convolution.
et al. [12] added a Transformer structure to the YOLOv5 Comparing the existing deep learning detection models, the
backbone network combined spatial attention and channel accuracy and speed of their detection methods are improving.
attention, and finally performed multi-scale feature fusion to Liao et al. [24] proposed a novel differentiated attention
improve the detection accuracy. Zhu et al. [13] integrated guidance network that adaptively enhances the discriminative
Transformer Prediction Heads (TPH) into YOLOv5 for features between UAV targets and complex backgrounds,
accurate object localization in high-density scenes. The and optimizes the real infrared UAV detection capability by
integrated model can effectively enhance detection accuracy, introducing a new spatially-aware channel attention SCA

VOLUME 11, 2023 116535


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

to integrate the model. To address the shortcomings of the reduced the model’s hyperparameters to reduce network
above literature, this paper introduces the latest YOLOv8 complexity, and brought the computational speed closer to
and uses it as a baseline system to propose a small target the high-speed labels of YOLO series techniques.
detection algorithm based on an improved YOLOv8 UAV For the UAV aerial photography scene with a large
perspective for the characteristics of small target detection background base and small targets, the detection targets have
tasks. To limit the rise in the number of parameters caused the problem of mutual occlusion, This work provides an
by global attention, only one layer is added in the back- enhanced YOLOv8 method based on small targets to reduce
bone module for connecting the spatial pyramidal pooling false and missing detections throughout the detection phase.
structure. The improved YOLOv8 framework consists of three main
modules, including Backbone, Neck, and Head, which used
adaptive anchor frames to compute the best anchor frame
values in different training sets after preprocessing the image
data by Mosaic [28] data enhancement. The Backbone
module included a CBS structure and a C2f structure for
extracting feature information from the input image; the Neck
module combined the feature pyramid FPN [29] with the path
aggregation network PAN [30], which passed down the strong
semantic features from the higher levels for strengthening
the semantic information. the PAN complements the FPN
from the bottom up by passing down the feature information
from the bottom, and the combined PAN-PFN module
enhances the network feature integration and outputs the
obtained image features to the Head module; the Head
FIGURE 2. As shown in the figure above is the C2f module, which is
designed with reference to the C3 module as well as the idea of ELAN, module, through the two heads, outputted the output of cls and
so that YOLOv8 can obtain richer gradient flow information while reg respectively and finally predicted the category using the
ensuring lightweight.
bounding box generated by making predictions on the image
features. The structure of the modified network is shown
in Fig.4.
III. METHOD
YOLOv8 is the most recent version of Ultralytics’ YOLO A. YOLOv8 BASED ON GAM ATTENTION MECHANISM
object recognition and picture segmentation model. To The feature extraction module acts in the model to extract
achieve the model’s lightweight, YOLOv8 replaced the C3 the local features of the input image, which increased the
module with a C2f module based on YOLOv5. C3 module, difficulty of detection because the targets in the UAV example
which is mainly designed with the idea of CSPNet extraction images are usually small. To improve the model’s detection
shunt, while combining the idea of residual structure, the accuracy, the Global Attention Mechanism (GAM) [31] was
so-called C3 Block, where the CSP main branch gradient introduced in the backbone network module Backbone, This
module is the BottleNeck module, while the number of captures crucial information in all three dimensions and
stacking is controlled by the parameter n. As shown in significantly reduced the detection difficulty of small targets
Fig.1, three convolutional structures (Conv+BN+SiLU) and in UAV example situations.
n BottleNeck modules are used in the C3 block. And the The GAM attention can amplify the global dimensional
C2f module learned the advantages of the C3 module and target interaction features with reduced loss of critical
the ELAN [25] module in YOLOV7 [26] by more branching information. The GAM model used novel channel-space
cross-layer links, allowing it to obtain richer gradient flow attention to replace the sub-module of the original CBAM
information while maintaining light weight, and the structure model. The process of extracting features is shown in Fig.5,
of the C2f module is shown in Fig.2. and the following equations define the intermediate stages
YOLOv8 introduced the Decoupled-Head [27] structure and outputs for a particular input feature mapping:
to extract the target location and category information
separately, learned them separately through different network F2 = Mc (F1 ) ⊗ F1 (1)
branches, and then fused them. This module successfully F3 = Ms (F2 ) ⊗ F2 (2)
reduced the number of parameters and computational com-
plexity while improving the model’s generalization and where Mc and Ms denote the channel attention module
robustness. The Decoupled-Head structure is shown in Fig.3. and the spatial attention module respectively; and ⊗ denotes
In the Head module, the Anchor Free mechanism of YOLOX a multiplicative operation; F1 , F2 , and F3 denote
was also introduced to directly predict the edges of small input features, intermediate features, and output features
targets, filtered out the noise interference terms of the labels, respectively.

116536 VOLUME 11, 2023


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

FIGURE 3. The structure of the decoupled head is shown above. The computation of the model structure leads to the
classification and regression branches, where the regression branch is represented using the integral form of the
distribution focus loss.

FIGURE 4. The improved structure is mainly composed of three main modules, including Backbone, Neck, and Head, and we add GAM attention to its
backbone layer to amplify the global dimensional target interaction features while reducing the loss of key information. In the Neck structure,
we improve the downsampling of the original framework and make the model more focused on small target detection by adding a small target
detection layer.

1) CHANNEL ATTENTION SUBMODULE


The channel attention submodule utilizes a three-dimensional and then a two-level Multilayer Perceptron (MLP) structured
arrangement to retain information from three dimensions, module is used to expand the dependencies of channel-space

VOLUME 11, 2023 116537


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

FIGURE 5. Shown above is the structural diagram of GAM attention, which borrows the sequential channel spatial
attention mechanism from CBAM attention and redesigns the sub-modules to output the final result for a given
feature map using channel module and spatial module processing respectively.

elements in different dimensions. The module structure is detection algorithms are better for large and medium targets
shown in Fig.6. than small targets. Therefore, too many small targets in the
dataset will prevent the detection algorithm from giving full
2) SPATIAL ATTENTION SUBMODULE
play to its performance.
In the process of target detection, the shallow network has
In the spatial attention sub-module, to facilitate focusing
a small sensory field and weak semantic information, but
more spatial information, the model uses two convolutional
strong detail representation ability. As the network model
layers to perform information fusion in space. The structure
becomes larger and deeper, the detail representation ability of
chooses to eliminate the pooling operation to further protect
the sensory field of the network structure will be weakened,
the local feature information because the maximum pooling
so the model is too deep or too shallow will affect the
operation will lose some of the spatial information. Although
accuracy of target detection. Because the detection effect
this may increase the number of parameters, it is more
of the YOLO series algorithm is reduced due to the small
complete in collecting spatial information and is not easy to
size of small target samples, and because the upsampling
ignore part of the feature mapping. The module structure is
multiple of YOLOv8 is relatively large, it is difficult for the
shown in Fig.7.
deeper feature map to learn the feature information of small
In the UAV example scene target detection task, the
targets, this paper proposes to improve the detection of the
features such as long imaging distance, high speed, and
small target language by adding a small target detection layer
low altitude flight producing motion blur resulting in small
STC after stitching the shallow feature map. In this research,
target objects and difficulty in capturing features lead to
we propose that a small target detection layer STC be added
a large amount of training for invalid areas, which in
to connect the shallow feature map with the deep feature map
turn affects the training efficiency of the network. The
to increase the gathering of semantic information of small
network structure is shown in Fig.8. After adding global
targets. By adding the STC module to connect the deep and
attention GAM to the feature extraction layer of the
shallow networks, the network will pay greater attention to
YOLOv8 target detection model, the feature information
small target detection, which can significantly improve the
of different dimensions is collected and fed back to
algorithm’s detection performance for UAV-captured photos.
reduce the loss of imaging feature information during the
The STC module is shown in Fig.10. The STC module is a
sampling process, and the visual representation of the
structure derived from the original YOLOv8 module, which
perceptual field of each dimension is fully utilized, which can
is used to solve the problem of feature information loss
achieve the performance improvement in the real acquisition
due to large upsampling multiples of the original YOLOv8.
process.
Three detection heads have defaulted on the original model,
which corresponds to the detection dimensions of feature map
B. STC MODULE IN NECK MULTISCALE FEATURE FUSION sizes of 80*80, 40*40*, and 20*20*, respectively, and the
The VisDrone2021 [32] dataset used for the experiments in Head part outputs feature maps of a total of 6 scales for
this paper has a large proportion of small targets. The width- classification and regression. The category prediction branch
height distribution of the targets in the dataset is shown in and box prediction branch of 3 different scales are spliced
Fig.9, and the parameters of the horizontal and vertical axes and dimensionally transformed. As shown in the original
are width and height, respectively. It can be seen that the YOLOv8 figure, the head part shows three detected output
distribution of points near the coordinate origin is dense and layers, we add a fourth upsampling layer and connect it
the color is the darkest, which indicates the largest number to the output layer to improve the accuracy of small target
of small targets in the data and fits the research problem of detection, which corresponds to a detection scale of 160*160.
this paper. In the aerial photography scene of UAVs, most structurally, the STC structure consists of a C2f module,

116538 VOLUME 11, 2023


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

FIGURE 6. The channel attention submodule uses a three-dimensional arrangement to preserve three-dimensional
information. It then amplifies the channel-space dependencies across dimensions with a two-layer multilayer
perceptron (MLP).

FIGURE 7. In the spatial attention sub-module, two convolutional layers are used for spatial information fusion to
focus on spatial information. Pooling operations were removed from the sub-module to further preserve feature
mapping and to prevent a significant increase in parameters, group convolution with channel blending was used.

FIGURE 9. The width-height distribution of the targets in the dataset is


shown in the figure above, and the parameters of the horizontal and
vertical axes are width and height, respectively. The dense number of
FIGURE 8. As shown in the figure, we added a layer of GAM global points in the region close to the origin of the coordinates indicates the
attention to the feature extraction layer of the backbone module of detection of a majority of small targets in the dataset.
yolov8 to reduce the loss of image feature information during the
sampling process.
IV. EXPERIMENT
The experimental environment is the Ubuntu 16.04 LTS oper-
an upsampling layer, a general convolutional layer Conv, and ating system, and the network of NVIDIA GTX2080Ti GPU
a 2-fold upsampling layer from 80*160. 2-fold upsampling is with 11GB of video memory is used for the experimental
changed from 80*80 to 160*160, then the concat operation is operation. And using Python 3.8, and using Pytorch version
performed on the channel dimension, and finally, the output 1.11.0 [33], and torchvision 0.12.0 as experimental deep
processing of the image is carried out through the decoupled learning frameworks.
head. After adding the STC detection layer to the Neck
module, the detection accuracy for small targets was greatly A. DATASET
improved, although the computational effort of the model was The VisDrone2021 dataset is utilized in this paper to train,
increased. validate, and test the experiments. The AISKYEYE team

VOLUME 11, 2023 116539


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

gathered the data at Tianjin University’s Machine Learning


and Data Mining Laboratory in China. The datasets are
collected by drones at different locations, environments,
objects, and densities and cover a rich variety of categories.
The allocation ratio for the training set is 6471 pictures,
548 for the validation set, and 1610 for the test set. The
10 categories in image types are pedestrians, bicycles, people,
cars, trucks, open tricycles, tricycles, vans, motorcycles, and
buses, where small target objects are predominant.

TABLE 1. Comparison of the results of different specifications of YOLOv8.

FIGURE 10. As shown in the figure, the STC structure consists of two C2f
modules, an upsampling layer, and a general convolutional layer, and the
input image size is changed from 80*80 to 160*160 after 2-fold
upsampling, and then the concat operation is performed in the channel
dimension, and finally, the output of the image is processed through the
decoupling header.

B. PARAMETER SETTING AND EVALUATION INDEX


The baseline model we use for improvement is the
YOLOv8m model, The data is augmented with Mosaic For the same network structure, The depth and width of each
data, SGD is selected as the optimizer, and the performance network model grow with model size, and the experimental
evaluation metrics of the model are mAP50, Precision, findings are displayed in Table 1. Params denote the number
Recall, etc. In the UAV detection scenario, it has always been of parameters of the model, Depth and Width correspond
the accuracy of the detection algorithm that measures the to the depth and width of the model, GFLOPs are the
good or bad performance of the model detection. In multi- billion floating point operations per second of the model,
class target detection, the value of the parameter mAP can be and mAP is the average detection accuracy of all target
calculated based on the obtained P-R curve, and the accuracy categories, and FPS denotes the number of images that
and recall are calculated as shown in the following equations: can be processed by the network model per second and
is used to measure the detection speed of the model. The
TP comparison of the experimental results shows that the mAP
P= (3)
TP + FP of the model gradually increases as the depth and width of
TP the network increase, but the number of model parameters
R= (4)
TP + FN and the GFLOPs also increase, and the FPS of the model
where TP denotes the number of samples for which the model gradually decreases, slowing down the detection speed of
predicts positive cases, FP denotes the number of negative the algorithm. Therefore, we obtained the conclusion that
samples for which the model predicts positive cases, and FN the depth and width of the network affect the size of the
denotes the number of positive samples for which the model network, and as the size increases, the complexity of the
predicts negative cases. The area obtained by intersecting the model increases, leading to worse real-time performance
plotted P-R curve with the horizontal and vertical axes is the and increased detection time. YOLOv8n has the smallest
mean accuracy, and the mAP is calculated as shown below: network depth and width, and the fastest detection speed,
Z 1 but the resultant loss is the accuracy of detection mAP.
AP = P (R)dR (5) while YOLOv8x, as the version with the highest mAP,
0 has the largest width and depth, with The corresponding
N
X AP (i) detection speed being the slowest. Compared with other
mAP = (6) model versions of various specifications, YOLOv8m can
N
i=1 balance the detection speed and accuracy metrics, and the
where: N denotes the total number of categories; AP (i) is the detection performance is balanced enough for each of them.
AP value of the ith category. Therefore, this paper selects the balanced YOLOv8m model
as the baseline network to improve and measure the detection
C. ANALYSIS OF EXPERIMENTAL RESULTS effect.
1) NETWORK MODEL SELECTION
In this paper, the most suitable models are selected by 2) ABLATION EXPERIMENTS
experimentally comparing five model specifications of The experiments in this paper use the current mainstream
YOLOv8, from large to small versions X, L, M, S, and N. dataset for UAV target detection: VisDrone2021. Multiple

116540 VOLUME 11, 2023


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

TABLE 2. Results of ablation experiments on the VisDrone2021 validation set.

sets of experiments were done in the same experimental the detection of individual component modules and is not a
context to evaluate the impact of the enhanced module on the simple accumulation in accuracy.
baseline model, and the experimental results are detailed in In addition, to verify the location of adding the attention
Table 2. mechanism in the experimental phase and the effect of
Performance of the global attention mechanism. In Table 2, adding multiple layers of attention on the model detection
with the addition of the global attention GAM, the mAP performance. The attention was added after the first layer
improves by 0.6% and the accuracy improves by 2%, and the C2f module, after the first layer plus the second layer
accuracy of the algorithm is improved with the addition of a C2f module, after the first layer and the third layer C2f
small number of parameters to the model. The effectiveness module, as well as before the SPPF module for the method
of global attention on top of the model is verified. of this paper, respectively; and named G1, G2, G3, G4 in
Performance analysis of STC for small target detection the order of the experiments, respectively. the experimental
layer. Comparing the analysis of the data in Table 2 after results are shown in Table 3. Comparing the data in the
adding the STC module with the baseline model, it can table, it can be seen that the best experimental effect is
be found that the small target detection (STC) module the case that the attention mechanism is added in front of
has a more effective overall improvement. The improved the model SPPF structure in the paper, and after each layer
model has improved by 3.9% and 2.9% in accuracy and of GAM attention is added, the number of parameters is
recall, respectively. Because the network will pay more therefore increased. Still, the experimental accuracy shows
attention to small target recognition after adding the STC a small increase, but there is a slight decrease in the practical
module to connect the deep and shallow networks, it can accuracy when adding the two C2f layers of attention
effectively improve the algorithm’s detection performance for and one SPPF layer of attention compared to adding only
UAV-captured photos. Analyzing the experimental results, one layer of SPPF attention. The experiments show the
we know that the overall mAP improves by 4.1%, and optimal performance of the attention addition method in this
in the car category it improves by 5.1%. Combining paper.
the experimental results in Table 2, we think the model
improvement is effective.
3) COMPARISON EXPERIMENT
To verify the effectiveness of the algorithm in this paper
TABLE 3. Comparison of experimental results for multilayer attention on the dataset VisDrone2021, other classical state-of-the-art
addition. (SOTA) models were selected for comparison, including the
classical YOLO series network, the Anchor Free algorithm
CenterNet, the two-stage algorithm Faster-RCNN and some
of the current stage. The improved algorithms are evaluated
by the mAP values of each category accuracy and the
overall algorithm, and the comparison results are shown in
Table 4. The algorithm proposed in this paper had a more
effective reflection in detection accuracy than the current
Performance analysis of the combined modules. Because stage SOTA model, and the mAP values are improved by
of the improvement of each of the above modules, we com- 15.46%, 12.02%, and 8.19% compared with the classical
bined the global attention module and the small target YOLOv3, YOLOv4, and YOLOX models, respectively. The
detection module and compared the combined modules with mAP value of the popular algorithm TPH-YOLOv5 is
the baseline model. Among them, mAP, accuracy, and recall improved by 1.98%, in which the detection algorithm in this
are improved by 4.4%, 4.2%, and 2.8% on the baseline model, paper achieves 78.5% and 62.5% accuracy in the car and bus
respectively, and by 3.8%, 2.2%, and 2.9% on adding the categories, respectively, which far exceeds other models in
global attention module alone, and by 0.3% and 0.4% on the the same category. The improved algorithm model achieves
small target detection module. Compared with the previous 39.3% mAP on the dataset VisDrone2021, This indicates
experiments on separate modules, all accuracy increased, the network structure’s high performance for UAV target
successfully verifying that the combined module outperforms detection tasks.

VOLUME 11, 2023 116541


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

TABLE 4. Comparative analysis of different algorithms on the VisDrone validation set.

FIGURE 11. Visualization of evaluation parameters.

D. VISUALIZATION RESULTS ANALYSIS


In Figure 11, we show the evaluation metrics of the results
of the improved model on the VisDrone2021 dataset, and we
make graphs to visualize and analyze the evaluation metrics
of the results, which are, in order from left to right and top to
bottom, the precision, the recall, the P-R, and the reconciled
mean of the precision and the recall of the results visualized
and analyzed. The analysis shows that the improved model
can achieve the best detection accuracy while maintaining a
comparably high recall, and the improved algorithm predicts
more accurate results in comparison.
FIGURE 12. A graph of the visualization results of the improved model for
As shown in Figure 12, using global attention with the detection on the dataset VisDrone2021 is shown in Fig.
STC module in a real-world scenario of UAV detection
allows the model to have a wider range of attention and It can be seen that the upgraded network can capture global
a more complete attention region in the parts where the interdependencies and increase the model’s feature extraction
targets are concentrated. From the visualization results, capability.

116542 VOLUME 11, 2023


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

V. CONCLUSION [10] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ‘‘Libra R-CNN:
In this paper, a UAV detection algorithm based on small Towards balanced learning for object detection,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 821–830.
target detection and attention mechanism was proposed to [11] J.-S. Lim, M. Astrid, H.-J. Yoon, and S.-I. Lee, ‘‘Small object detection
address the problem that the small size of the target in UAV using context and attention,’’ in Proc. Int. Conf. Artif. Intell. Inf. Commun.
inspection imaging leads to unsatisfactory feature extraction (ICAIIC), Apr. 2021, pp. 181–186.
and thus affects the detection effect of the algorithm on [12] Z. Feng, Z. Xie, Z. Bao, and K. Chen, ‘‘Real time dense small
target detection algorithm for UAV based on improved YOLOv5,’’ Acta
small targets. The algorithm first introduces a global attention Aeronaut. Sin, pp. 1–15, 2022.
GAM in the backbone network module Backbone to capture [13] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, ‘‘TPH-YOLOv5: Improved
imaging features in multiple dimensions and improve the YOLOv5 based on transformer prediction head for object detection on
drone-captured scenarios,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
detection difficulty of small targets in UAV example scenes. Workshops (ICCVW), Oct. 2021, pp. 2778–2788.
The model then redeems the loss of semantic information [14] H. Fang, M. Xia, G. Zhou, Y. Chang, and L. Yan, ‘‘Infrared small UAV
on the way of sampling by adding the STC module to connect target detection based on residual image prediction via global and local
the network’s shallow and deep structure, fully captures dilated residual networks,’’ IEEE Geosci. Remote Sens. Lett., vol. 19,
pp. 1–5, 2022.
the global information and rich contextual information, and [15] H. Fang, L. Ding, L. Wang, Y. Chang, L. Yan, and J. Han, ‘‘Infrared small
effectively improves the algorithm’s detection effect on small UAV target detection based on depthwise separable residual dense network
targets. Experimental results show that the mAP, recall, and and multiscale feature fusion,’’ IEEE Trans. Instrum. Meas., vol. 71,
pp. 1–20, 2022.
accuracy of the algorithm in this paper are all improved
[16] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,
over the baseline model, and the improved algorithm model R. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption
achieves 39.3% mAP on the dataset VisDrone2021. Although generation with visual attention,’’ in Proc. Int. Conf. Mach. Learn., 2015,
the method in this research has improved the detection of pp. 2048–2057.
[17] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in
small targets in UAV photography, due to the substantial Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
rise in the number of improved parameters, there is still pp. 7132–7141.
potential for improvement in the detection of false detections [18] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, ‘‘Gather-excite:
and missed detections aimed at avoiding small targets. The Exploiting feature context in convolutional neural networks,’’ in Proc. Adv.
Neural Inf. Process. Syst., vol. 31, 2018, pp. 1–11.
next step is planned to study how to find a balance between [19] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional
accuracy and lightness, to improve the detection performance block attention module,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
of small targets while ensuring the lightness of the model, pp. 3–19.
to better serve real-time industrial UAV inspection, and to [20] Q. Hou, D. Zhou, and J. Feng, ‘‘Coordinate attention for efficient mobile
network design,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
effectively respond to different scene transformations. (CVPR), Jun. 2021, pp. 13708–13717.
[21] A. G. Roy, N. Navab, and C. Wachinger, ‘‘Recalibrating fully convolutional
networks with spatial and channel ‘squeeze & excitation’ blocks,’’ IEEE
REFERENCES Trans. Med. Imag., vol. 38, no. 2, pp. 540–549, Feb. 2019.
[22] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, ‘‘GCNet: Non-local networks
[1] X. Zihao, W. Hongyuan, Q. Pengyu, D. Weidong, Z. Ji, and C. Fuhua, meet squeeze-excitation networks and beyond,’’ in Proc. IEEE/CVF Int.
‘‘Printed surface defect detection model based on positive samples,’’ Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980.
Comput., Mater. Continua, vol. 72, no. 3, pp. 5925–5938, 2022. [23] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ‘‘ECA-Net:
[2] B. Jiang, R. Qu, Y. Li, and C. Li, ‘‘Survey of object detection in UAV Efficient channel attention for deep convolutional neural networks,’’ in
imagery based on deep learning,’’ Acta Aeronautica et Astronautica Sinica, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
vol. 42, no. 4, pp. 137–151, 2021. pp. 11531–11539.
[3] M. M. Fernandez-Carrobles, O. Deniz, and F. Maroto, ‘‘Gun and [24] H. Fang, Z. Liao, X. Wang, Y. Chang, and L. Yan, ‘‘Differentiated
knife detection based on faster R-CNN for video surveillance,’’ in attention guided network over hierarchical and aggregated features for
Pattern Recognition and Image Analysis. Madrid, Spain: Springer, 2019, intelligent UAV surveillance,’’ IEEE Trans. Ind. Informat., vol. 19, no. 9,
pp. 441–452. pp. 9909–9920, Sep. 2023.
[4] J. Cao, H. Cholakkal, R. M. Anwer, F. S. Khan, Y. Pang, and L. [25] Y. Wang, H. Wang, and Z. Xin, ‘‘Efficient detection model of steel
Shao, ‘‘D2Det: Towards high quality object detection and instance strip surface defects based on YOLO-V7,’’ IEEE Access, vol. 10,
segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. pp. 133936–133944, 2022.
(CVPR), Jun. 2020, pp. 11482–11491. [26] C.-Y. Wang, A. Bochkovskiy, and H.-Y. Mark Liao, ‘‘YOLOv7: Trainable
[5] U. Mittal, P. Chawla, and R. Tiwari, ‘‘EnsembleNet: A hybrid approach for bag-of-freebies sets new state-of-the-art for real-time object detectors,’’
vehicle detection and estimation of traffic density based on faster R-CNN 2022, arXiv:2207.02696.
and YOLO models,’’ Neural Comput. Appl., vol. 35, no. 6, pp. 4755–4774, [27] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘YOLOX: Exceeding YOLO
Feb. 2023. series in 2021,’’ 2021, arXiv:2107.08430.
[6] P. Jiang, D. Ergu, F. Liu, Y. Cai, and B. Ma, ‘‘A review of YOLO algorithm [28] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, ‘‘YOLOv4: Optimal
developments,’’ Proc. Comput. Sci., vol. 199, pp. 1066–1073, Jan. 2022. speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
[7] K. R. Akshatha, A. K. Karunakar, S. B. Shenoy, A. K. Pai, N. H. Nagaraj, [29] B. Pu, Y. Lu, J. Chen, S. Li, N. Zhu, W. Wei, and K. Li, ‘‘MobileUNet-
and S. S. Rohatgi, ‘‘Human detection in aerial thermal images using FPN: A semantic segmentation model for fetal ultrasound four-chamber
faster R-CNN and SSD algorithms,’’ Electronics, vol. 11, no. 7, p. 1151, segmentation in edge computing environments,’’ IEEE J. Biomed. Health
Apr. 2022. Informat., vol. 26, no. 11, pp. 5540–5550, Nov. 2022.
[8] J.-H. Kim, N. Kim, and C. S. Won, ‘‘High-speed drone detection based on [30] G. Wan, H. Fang, D. Wang, J. Yan, and B. Xie, ‘‘Ceramic tile surface
YOLO-V8,’’ in Proc. IEEE Int. Conf. Acoustics, Speech Signal Process. defect detection based on deep learning,’’ Ceram. Int., vol. 48, no. 8,
(ICASSP), 2023, pp. 1–2. pp. 11085–11093, Apr. 2022.
[9] Q. Wang, H. Zhang, X. Hong, and Q. Zhou, ‘‘Small object detection based [31] Y. Liu, Z. Shao, and N. Hoffmann, ‘‘Global attention mechanism:
on modified FSSD and model compression,’’ in Proc. IEEE 6th Int. Conf. Retain information to enhance channel-spatial interactions,’’ 2021,
Signal Image Process. (ICSIP), Oct. 2021, pp. 88–92. arXiv:2112.05561.

VOLUME 11, 2023 116543


F. Wang et al.: UAV Target Detection Algorithm Based on Improved YOLOv8

[32] W. Xu, C. Zhang, Q. Wang, and P. Dai, ‘‘FEA-swin: Foreground ZHIYONG QIN received the B.E. degree from the
enhancement attention Swin transformer network for accurate UAV-based Huaide College, Changzhou University, in 2022,
dense object detection,’’ Sensors, vol. 22, no. 18, p. 6993, Sep. 2022. where he is currently pursuing the M.E. degree.
[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, His research interests include computer vision and
Z. Lin, N. Gimelshein, and L. Antiga, ‘‘PyTorch: An imperative style, object detection.
high-performance deep learning library,’’ in Proc. Adv. Neural Inf. Process.
Syst., vol. 32, 2019, pp. 1–12.

FENG WANG received the Bachelor of Science


degree from Hubei University, in 2019. He is
currently pursuing the master’s degree in engi-
neering with Changzhou University. His research
interests include computer vision, defect detection,
and object detection.

HONGYUAN WANG received the Ph.D. degree JIAYING TANG received the B.E. degree from
in computer science from the Nanjing University Tianjin Chengjian University, in 2022. She is cur-
of Science and Technology. He is currently a rently pursuing the M.E. degree with Changzhou
Professor, a Ph.D. Supervisor, and a Senior University. Her research interests include com-
Member of CCF. His research interests include puter vision and image segmentation.
computer vision, pattern recognition, and intelli-
gent systems.

116544 VOLUME 11, 2023

You might also like