Eng 04 00025
Eng 04 00025
1 Department of Engineering and Technology, Texas A&M University-Commerce, Commerce, TX 75428, USA
2 Department of Computer Science and Information Systems, Texas A&M University-Commerce,
Commerce, TX 75428, USA
* Correspondence: [email protected]; Tel.: +1-903-886-5174
Abstract: The rapidly increasing number of drones in the national airspace, including those for
recreational and commercial applications, has raised concerns regarding misuse. Autonomous drone
detection systems offer a probable solution to overcoming the issue of potential drone misuse, such
as drug smuggling, violating people’s privacy, etc. Detecting drones can be difficult, due to similar
objects in the sky, such as airplanes and birds. In addition, automated drone detection systems
need to be trained with ample amounts of data to provide high accuracy. Real-time detection is also
necessary, but this requires highly configured devices such as a graphical processing unit (GPU). The
present study sought to overcome these challenges by proposing a one-shot detector called You Only
Look Once version 5 (YOLOv5), which can train the proposed model using pre-trained weights and
data augmentation. The trained model was evaluated using mean average precision (mAP) and recall
measures. The model achieved a 90.40% mAP, a 21.57% improvement over our previous model that
used You Only Look Once version 4 (YOLOv4) and was tested on the same dataset.
Keywords: YOLOv5; autonomous drone detection; image recognition; machine learning; mAP;
unmanned aerial vehicle (UAV)
1. Introduction
Drones are becoming increasingly popular. Most are inexpensive, flexible, and
lightweight [1]. They are utilized in a variety of industries, including the military, con-
Citation: Aydin, B.; Singha, S. Drone struction, agriculture, real estate, manufacturing, photogrammetry, sports, and photogra-
Detection Using YOLOv5. Eng 2023, phy [2,3]. There were 865,505 drones registered as of 3 October 2022, with 538,172 of them
4, 416–433. https://doi.org/ being recreational [4]. Drones can take off and land autonomously, intelligently adapt to
10.3390/eng4010025 any environment, fly to great heights, and provide quick hovering ability and flexibility [5].
Increased usage of drones, on the other hand, poses a threat to public safety; for example,
Academic Editor: Antonio
Gil Bravo
their capacity to carry explosives may be used to strike public locations, such as governmen-
tal and historical monuments [6]. Drones can also be used by drug smugglers and terrorists.
Received: 30 November 2022 Moreover, the increasing number of hobbyist drone pilots could result in interference with
Revised: 24 January 2023 activities, such as firefighting, disaster response efforts, and so on [7]. A list of threats that
Accepted: 28 January 2023 drones currently pose and a discussion of how drones are being weaponized are offered
Published: 1 February 2023
in [8]. For instance, in April 2021, two police officers in Aguililla, Michoacan, Mexico were
assaulted by drones artillados carrying explosive devices, resulting in multiple injuries [9].
Thirteen tiny drones attacked Russian soldiers in Syria, causing substantial damage [10].
Considering the possibility of drones being used as lethal weapons [11], authorities shut
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
down the London Gatwick airport for 18 hours due to serious drone intrusion, causing
This article is an open access article
760 flights with over 120,000 people to be delayed [12].
distributed under the terms and
Detecting drones may be difficult due to the presence of similar objects in the sky, such
conditions of the Creative Commons as aircrafts, birds, and so forth. The authors of [13] used a dataset made up of drones and
Attribution (CC BY) license (https:// birds. To create the dataset, they gathered drone and bird videos and extracted images using
creativecommons.org/licenses/by/ the MATLAB image processing tool. After gathering 712 photos to train the algorithms,
4.0/). they utilized an 80:20 train:test split to randomly choose the training and testing images.
They examined the accuracies of three different object detectors utilizing an Intel Core
i5–4200M (2.5GHZ0), 2GB DDR3 L Memory, and 1TB HDD, reaching 93%, 88%, and 80%
accuracy using the CNN, SVM, and KNN, respectively. The suggested technique examined
included drone-like objects, i.e., birds in the dataset; however, it required 14 minutes and
28 seconds to attain 93% accuracy for just 80 epochs using the CNN methodology. As a
result, their proposed approach was not feasible for real-time implementation.
Our previously proposed technique using fine-tuned YOLOv4 [14] overcame the
speed, accuracy, and model overfitting issues. In that study, we collected 2395 images of
birds and drones from public sources, such as Google, Kaggle, and others. We labeled
the images and divided them into two categories: drones and birds. The YOLOv4 model
was then trained on the Tesla K80 GPU using the Google deep learning VM. To test the
detecting speed, we recorded two drone videos of our own drones at three different heights.
The trained model obtained an FPS of 20.5 and 19.0. The mAP was 74.36%. In terms of
speed and accuracy, YOLOv5 surpassed prior versions of YOLO [1]. In this study, we
compared the performance increase using fine-tuned YOLOv5 for the same dataset used
in [14] for drone detection using fine-tuned YOLOv4. YOLOv5 recently demonstrated
improved performance in identifying drones. The authors of [1] presented a method
for detecting drones flying in prohibited or restricted zones. Their deep learning-based
technique outperformed earlier deep learning-based methodologies in terms of precision
and recall.
Our key contributions to this study were the addition of a data augmentation technique
to artificially overcome data scarcity difficulties, as well as the prevention of overfitting
issues utilizing a random train:test split of 70:30, the fine-tuning of the original YOLOv5
based on our collected customized dataset, the testing of the model on a wide variety of
backgrounds (dark, sunny), and the testing of different views of images. The model was
tested on our own videos using two drones-DJI Mavic Pro, DJI Phantom; videos were taken
at three common altitudes—60 ft, 40 ft, and 20 ft.
Paper Organization
The rest of the research study is structured as follows. Section 2 provides background
for our research. Section 3 addresses the research materials and methodologies. Section 4
covers the findings of this study. Section 5 discusses the model’s complexity and uncertainty.
Section 6 depicts the performance improvement and gives an argumentative discussion.
Section 7 brings our paper to a conclusion.
2. Background
In the past, various techniques, such as radar, were used to detect drones [15]. How-
ever, it is very difficult for radar to do so, due to the low levels of electromagnetic signals
that drones transmit [16]. Similarly, other techniques, such as acoustic and radio frequency-
based drone detection, are costly and inaccurate [17]. Recently, machine learning-based
drone detectors, such as SVM and artificial neural network classifiers, have been used to
detect drones, achieving better success than radar and acoustic drone detection systems [18].
The YOLO algorithm has outperformed competitor algorithms, such as the R-CNN and
SSD algorithms, due to its complex feature-learning capability with fast detection [18]. In
fact, the YOLO algorithm is now instrumental in object detection tasks [19]. Many com-
puter vision tasks use YOLO due to its faster detection with high accuracy, which makes
the algorithm feasible for real-time implementation [20]. One of the latest developments,
YOLOv5, has greatly improved the algorithm’s performance, offering a 90% improvement
over YOLOv4 [21]. In the present research, we used YOLOv5 to build an automated drone
detection system and compared the results against our previous system with the YOLOv4.
UAV detection systems are designed using various techniques. We have reviewed
only those studies closely related to our methodology. UAV detection can be treated as an
object detection problem in deep learning. Deep learning-based object detection techniques
can be divided into one-stage and two-stage detection algorithms [22]. An example of a
Eng 2023, 4 418
two-stage object detection technique is R-CNN [23]; examples of one-stage object detection
techniques are YOLO [24], SSD [25], etc. The authors of [26] explained the mechanism of
how object detectors work in general. Two-stage detectors use candidate object techniques,
while one-stage detectors employ the sliding window technique. Thus, one-stage detectors
are fast and operate in real-time [27]. YOLO is easy to train, faster, more accurate than its
competitors, and can immediately train an entire image. Thus, YOLO is the most frequently
used and reliable object detection algorithm [28]. It first divides an image into SXS grids
and assigns a class probability with bounding boxes around the object [28]. It then uses
a single convolutional network to perform the entire prediction. Conversely, R-CNNs
begin by generating a large number of region proposals using a selective search method.
Then, from each region proposal, a CNN is utilized to extract features. Finally, the R-CNN
classifies and defines bounding boxes for distinct classes [28].
The authors of [28] used YOLOv2 to detect drones and birds, and achieved precision
and recall scores above 90. The authors of [27] proposed a drone detection pipeline with
three different models: faster R-CNN with ResNet–101, faster R-CNN with Inceptionv2, and
SSD. After 60,000 iterations, they achieved mAP values of 0.49, 0.35, and 0.15, respectively.
One example of an SSD object detector is MobileNet. MobileNetV2 was used as a classifier
in [29]; the authors proposed a drone detection model where the methodology consisted of
a moving object detector and a drone-bird-background classifier. The researchers trained
the drone-vs-bird challenge dataset on the NVIDIA GeForce GT 1030 2GB GPU with a
learning rate of 0.05. At an IoU of 0.5, their highest precision, recall, and F1 scores were
0.786, 0.910, and 0.801, respectively, after testing on three videos. The authors of [30] used
YOLOv3 to detect and classify drones. The authors of [30] collected different types of drone
images from the internet and videos to build a dataset. Images were annotated in the YOLO
format in order to train a YOLOv3 model. An NVIDIA GeForce GTX 1050 Ti GPU was used
to train the dataset with chosen parameter values, such as a learning rate of 0.0001, batch
size of 64, and 150 total epochs. The best mAP value was 0.74. PyTorch, an open-source
machine learning programming language, was used to train and test the YOLOv3 model.
The authors of [31] used YOLOv4 to automatically detect drones in order to integrate
a trained model into a CCTV camera, thus reducing the need for manual monitoring.
The authors collected their dataset from public resources such as Google images, open-
source websites, etc. The images were converted into the YOLO format using free and
paid image annotation tools. They fine-tuned the YOLOv4 architecture by customizing
filters, max batches, subdivisions, batches, etc. After training the YOLOv4 model for
1300 iterations, the researchers achieved a mAP of 0.99. Though their mAP value was very
high, they trained only 53 images and did not address model overfitting, resulting in a
greater improvement scope.
The authors of [1] presented an approach based on YOLOv5. They utilized a dataset
of 1359 drone images obtained from Kaggle. They fine-tuned the model on a local system
with an 8 GB NVDIA RTX2070 GPU, 16 GB of RAM, and a 1.9 GHz CPU. They employed a
60:20:20 split of the dataset for training, testing, and validation. They trained the model on
top of COCO pre-trained weights and obtained a precision of 94.70%, a recall of 92.50%,
and a mAP of 94.1%.
In Equation (1),
(1), xx and
and yyare
arethe 𝑦 bounding
theyth bounding boxbox
of of xth𝑥grid.
thethe grid.y
∪ the
∪ x is is probability
the proba-
bility score for the 𝑦 bounding box of the 𝑥
score for the yth bounding box of the xth grid. P x,y equalsgrid. 𝒫 equals 1 when there
, 1 when there is a target is a target
and 0
and 0 when
when there isthere is no target
no target in the yinththe 𝑦 bounding
bounding box. Thebox.
IoUThe
IOU IoU 𝐼𝑂𝑈 is the IoUisbetween
ground truth the IoU
predicted
between
the groundthetruth
groundandtruth and the predicted
the predicted class.
class. Higher Higher
IoUs meanIoUs more mean more accurately
accurately predicted
predicted bounding
bounding boxes. boxes.
The loss function
The loss function of of YOLOv5
YOLOv5 is is the
the combination
combination of of loss
loss functions
functions for for the
the bounding
bounding
box, classification, and
box, classification, confidence. Equation
and confidence. Equation (2)
(2) represents
represents thethe overall
overall loss function of
loss function of
YOLOv5 [32]:
YOLOv5 [32]:
𝑙𝑜𝑠𝑠 = 𝑙𝑜𝑠𝑠 + 𝑙𝑜𝑠𝑠 + 𝑙𝑜𝑠𝑠 (2)
lossYOLOv5 = lossbounding box + lossclassi f ication + losscon f idence (2)
𝑙𝑜𝑠𝑠 is calculated using Equation (3):
lossbounding box is calculated using Equation (3):
𝑙𝑜𝑠𝑠 = ∑ ∑ 𝐸 , ℎ (2 − 𝐾 𝑋𝑛 )[ 𝑥 − 𝑥 + 𝑦 −𝑦 + 𝑤 −𝑤 + ℎ −ℎ ] (3)
2 2 2 2
b 2 d
lossbounding box = λif ∑ a=0 ∑c=0 Ea,c h g (2 − Ka Xn a ) x a − x 0 a + y a − y0 a + wa − w0 a + h a − h0 a
g c c c c
(3)
In Equation (3), the width and height of the target object are denoted using h0 and w0.
x a and y a indicate the coordinates of the target object in an image. Finally, the indicator
function (λif ) shows whether the bounding box contains the target object.
Eng 2023, 4 420
b2 d
lossclassi f ication = λclassification ∑ a=0 ∑c=0 Ea,c ∑C∈c L a (c) log( LL a (c))
g
(4)
l
b2 d b2 d
losscon f idence = λconfidence ∑ a=0 ∑c=0 Ea,c (ci − cl )2 + λg ∑ a=0 ∑c=0 Ea,c (ci − cl )2
con f idence g
(5)
In Equations (4) and (5), λconfidence indicates the category loss coefficient, λclassification
the classification loss coefficient, cl the class, and c the confidence score.
shortage issue.
shortage issue. WeWe set
set the
the parameters
parameters as as follows:
follows: flip:
flip: horizontal,
horizontal, hue:
hue: between
between −25−25 de-
de-
grees
3 boxes
grees and
andwith+25 degrees,
+25 10% cutout:
size cutout:
degrees, each, and 3 boxes
mosaic:
3 boxes with
with1.10% 10% size
Datasize each,
augmentation and mosaic:
ensured
each, and mosaic: 1. Data augmenta-
data augmenta-
1. Data variability,
tionensured
tion ensured
artificially datavariability,
data variability,
generating 5749 totalartificially
images,
artificially generating
and randomly
generating 5749total
5749 total images,
splitting
images, and
theand randomly
entire split-
datasetsplit-
randomly into
ating the
70:30 entire dataset
train:test split. into a
Figure70:30
2 train:test
shows the split. Figure
augmented 2 shows
dataset. the
We
ting the entire dataset into a 70:30 train:test split. Figure 2 shows the augmented dataset. augmented
trained thedataset.
model
We4000
for
We trained
trained themodel
the model
iterations and for
for 4000the
saved
4000 iterations andsaved
best weight
iterations and saved
to testthe
the best
the
best weight
model
weight totest
using
to test the
thethe model
testing
model using
images
using
the
and testing
videos. images
During and videos.
training, weDuring
used training,
%tensorboard we used to %tensorboard
log the runs
the testing images and videos. During training, we used %tensorboard to log the runs that to
that log the runs
autogenerated that
autogenerated
the learning the
curves learning
in order curves
to in
evaluateorder
the to evaluate
model’s the model’s
performance
autogenerated the learning curves in order to evaluate the model’s performance beyond performance
beyond the beyond
evaluation
theevaluation
evaluation
metrics.
the metrics.
Figuremetrics. Figure
3 showsFigure 33shows
a flowchartshows aaflowchart
of the flowchart ofthe
overall conducted
of theoverall
overall conductedexperiment.
experiment.
conducted experiment.
Figure2.
Figure
Figure 2.Augmented
2. Augmenteddataset.
Augmented dataset.
dataset.
Figure3.
Figure
Figure 3.Overall
3. Overallconducted
Overall conductedexperiment
conducted experimentflowchart.
experiment flowchart.
flowchart.
4. Results
4.Results
4. Results
We evaluated
Weevaluated
We evaluatedthe the trained
thetrained
trainedmodelmodel
modelusingusing
usingthethe mAP,
themAP, precision,
mAP,precision, recall,
precision,recall, and
recall,and F1-scores.
andF1-scores.
F1-scores.We We
We
used FPS as the evaluation metric to evaluate the speed of detection in the videos. Table 111
used
used FPS
FPS as
as the
the evaluation
evaluation metric
metric to
to evaluate
evaluate the
the speed
speed of
of detection
detection in
in the
the videos.
videos. Table
Table
shows the
shows the mAP,
mAP,precision,
the mAP, precision,recall,
recall,andandF1-scores. The
F1-scores. The model
The model
model waswas evaluated
was evaluated
evaluated on on
on aaa testing
testing
testing
shows precision, recall, and F1-scores.
dataset
dataset from
from aarandom
random train:test
train:test split.
split. The
Thetesting
testingimages
images had
haddata
datavariabilities in
variabilities terms
in terms of
dataset from a random train:test split. The testing images had data variabilities in terms
different backgrounds
of different
different backgrounds (e.g.,(e.g.,
bright, dark,dark,
bright, blur, etc.)
blur,and weather
etc.) conditions
and weather
weather (e.g., cloudy,
conditions (e.g.,
of backgrounds (e.g., bright, dark, blur, etc.) and conditions (e.g.,
sunny,
cloudy, foggy,
sunny, etc.), as well
foggy, etc.),asasimages
well as with
imagesmultiple
with classes. To
multiple track To
classes. thetrack
evaluation
the metrics,
evaluation
cloudy, sunny, foggy, etc.), as well as images with multiple classes. To track the evaluation
we plotted
metrics, wethe values
plotted theacross
values iterations. Figure 4Figure
acrossiterations.
iterations. shows the overall training summary of
metrics, we plotted the values across Figure 44shows
shows theoverall
the overalltraining
training sum-
sum-
the model.
maryof ofthe The
themodel. loss
model.The curves
Theloss indicate
losscurves a
curvesindicate downward
indicateaadownward trend, meaning
downwardtrend, that
trend,meaning during
meaningthat training,
thatduring
duringtrain- the
train-
mary
losses
ing, were
the minimized
losses were both for training
minimized both for and validation.
training and The metrics
validation. The curves
metricsshow
curvesupward
show
ing, the losses were minimized both for training and validation. The metrics curves show
trends,
upward meaning
trends, the performance
meaning the of the model
performance of improved
the model over the iterations
improved over theduring training.
iterations dur-
upward trends, meaning the performance of the model improved over the iterations dur-
We
ing plotted
training. theWeprecision-recall
plotted the curve to evaluate
precision-recall curve the
to model’s prediction
evaluate the model’s preciseness
prediction (see
pre-
ing training. We plotted the precision-recall curve to evaluate the model’s prediction pre-
Figure 5).
ciseness(see The curve
(seeFigure
Figure5). tended
5).The
Thecurvetowards
curvetended the right
tendedtowards top
towardsthe corner,
theright meaning
righttop
topcorner, that the
corner,meaning values
meaningthat were
thatthe the
ciseness
mostly close to one (i.e., the rate of misclassification was very low when using this model).
Eng 2023, 4, 7
Eng 2023, 4, 7
values were mostly close to one (i.e., the rate of misclassification was very low when using
Eng 2023, 4 422
values were mostly close to one (i.e., the rate of misclassification was very low when using
this model).
this model).
Table 1. Overall and individual evaluation metrics results.
Table
Table 1.
1. Overall
Overalland
andindividual
individual evaluation
evaluation metrics
metrics results.
results.
Class Precision Recall mAP50
Class
Class Precision
Precision Recall
Recall mAP50
mAP50
All 0.918 0.875 0.904
All
All
0.918
0.918
0.875
0.875
0.904
0.904
Bird 0.860 0.766 0.820
Bird
Bird 0.860
0.860 0.766
0.766 0.820
0.820
Drone 0.975 0.985 0.987
Drone
Drone 0.975
0.975 0.985
0.985 0.987
0.987
Figure6.
Figure
Figure 6. Drone
6. Drone predictions
Drone predictions for
predictions for test
fortest images.
testimages.
images.
Eng 2023, 4, 9
Figure 7.
Figure 7. At
At 20ft,
20ft, DJI Mavic
Mavic Pro.
Figure 7. At 20ft, DJI
DJI Mavic Pro.
Pro.
Figure 8.
Figure 8. At
At 40ft,
40ft, DJI
DJI Mavic
Mavic Pro.
Pro.
Eng 2023, 4 424
Eng 2023, 4, 10
Figure 9.
Figure 9. At
At60ft,
60ft,DJI
DJIMavic
MavicPro.
Pro.
Figure 11.
Figure 11. At
At40ft,
40ft,DJI
DJIPhantom
PhantomIII.III.
Eng 2023,
Eng 4 4,
2023, 11 425
Figure 12.
Figure 12. At
At 60
60 ft,
ft,DJI
DJIPhantom
PhantomIII.
III.
Appendix A
Appendix A contains
containsmoremorepredictions
predictions based
based onon
thethe
trained YOLOv5
trained YOLOv5 model. While
model. While
the model
the model worked
workedwell wellononthethe
majority
majorityof the test test
of the images, therethere
images, were were
a few ainstances of
few instances
misclassification. Figures A7 and A8 show two misclassifications in which
of misclassification. Figures A7 and A8 show two misclassifications in which the model the model mis-
identified certain
misidentified drone-like
certain objects
drone-like as drones
objects alongside
as drones correctcorrect
alongside predictions in thesein
predictions im-
these
ages. Blurred photos might be one of the causes of such misclassification. We
images. Blurred photos might be one of the causes of such misclassification. We can addresscan address
this problem
this problem by
byemploying
employingmore moretraining
trainingphotos,
photos,which is outside
which the the
is outside scope of this
scope of study.
this study.
The prediction confidence scores were poor, hovering around 10%. We might perhaps
The prediction confidence scores were poor, hovering around 10%. We might perhaps
establish a confidence score threshold to avoid such misclassification while increasing the
establish a confidence score threshold to avoid such misclassification while increasing the
number of training images. There were just a few “bird” classes. Figure A3 depicts an
number of training images. There were just a few “bird” classes. Figure A3 depicts an
example of correct “drone” and “bird” predictions. However, because of the uncertainty
example of correct “drone” and “bird” predictions. However, because of the uncertainty of
of both classes in a single video frame, we were unable to do any drone and bird detection
both classes in a single video frame, we were unable to do any drone and bird detection
in videos.
in videos.
5. Model Complexity and Parameter Uncertainty
5. Model Complexity and Parameter Uncertainty
To do a quicker prediction, we employed YOLOv5, which mainly relies on GPU im-
To do a quicker prediction, we employed YOLOv5, which mainly relies on GPU
plementation. GPU implementation complicates CPU deployments. Data augmentation
implementation. GPU implementation complicates CPU deployments. Data augmentation
techniques such as rotation and flipping were used to artificially supplement the dataset
techniques such as rotation and flipping were used to artificially supplement the dataset
for improved training and performance. The parameter uncertainty in our experiment in-
for improved training and performance. The parameter uncertainty in our experiment
cluded sampling errors, overfitting, and so forth. Too many classes from one class may
included sampling
create sampling errors,
error, overfitting,
whereas training and so forth.
a smaller Too of
number many
imagesclasses
withfrom
higherone class may
parame-
create sampling error, whereas training a smaller number of images with
ters may result in overfitting. We used pre-trained model weights that were trained on higher parameters
may resultdataset,
the COCO in overfitting. We used
and we trained ourpre-trained
fine-tuned modelmodel onweights
top of thethat were trained
pre-trained weights.on the
COCO dataset, and we trained our fine-tuned model on top of the pre-trained weights.
6. Discussion
6. Discussion
Using deep learning for the detection of drones has become a common topic in the
Using
research deep learning
community, due tofor
thethe detectionimportance
substantial of drones of has become drones
restricting a common topic in the
in unauthor-
research community, due to the substantial importance of restricting drones
ized regions; however, improvement is still needed. The authors of [30] proposed a drone in unauthorized
regions;
detectionhowever,
methodologyimprovement
using deepislearning,
still needed. The authors
employing YOLOv3ofto[30] proposed
detect a drone
and classify
detection methodology using deep learning, employing YOLOv3 to
drones. More than 10,000 different categories of drones were used to train the algorithm,detect and classify
drones. More than 10,000 different categories of drones were used
and a mAP of 0.74 was achieved at the 150th epoch. Though they used a YOLO-based to train the algorithm,
and a mAPtheir
approach, of 0.74
studywas
didachieved
not considerat the 150th
testing theepoch.
modelThough they used
using videos, a YOLO-based
different weather
approach,
conditions,their study did notmost
and backgrounds; consider testing the
importantly, theymodel using
did not test videos, different
their model usingweather
im-
conditions, andlike
ages of objects backgrounds;
drones. Themost importantly,
authors of [33] usedthey didlearning-based
deep not test their model usingand
techniques images
of objects
Faster like drones.
R-CNN Thecreated
on a dataset authorsfromof [33] usedcollected
videos deep learning-based techniques
by the researchers. and Faster
The following
R-CNN on a dataset created
image augmentation from
techniques videos
were collected
employed: by the researchers.
geometric The following
transformation, illumination image
augmentation techniques were employed: geometric transformation, illumination variation,
and image quality. The researchers did not calculate the mAP values and instead plotted a
Eng 2023, 4 426
precision-recall (AUC) curve to evaluate the performance. Using a synthetic dataset, their
model achieved an overall AUC score of 0.93; for a real-world dataset, their model achieved
an overall AUC score of 0.58. The dataset was trimmed from video sequences, and thus
had no objects much of the time. In our previous research, we analyzed the performance of
our proposed methodology using YOLOv4 and showed that the proposed methodology
outperformed existing methodologies in terms of mAP, precision, recall, and F-1 scores.
Using YOLOv4, we were able to achieve a mAP of 0.7436, precision of 0.95, and recall
of 0.68. Most importantly, we included another evaluation metric, FPS, to evaluate the
performance, achieving an average FPS of 20.5 for the DJI Phantom III videos and 19.0 FPS
for the DJI Mavic Pro videos, all at three different high altitudes (i.e., 20 ft, 40 ft, and
60 ft). We tested the model using a highly variable dataset with different backgrounds (e.g.,
sunny, cloudy, dark, etc.), various drone angles (e.g., side view, top view, etc.), long-range
drone images, and multiple objects in a single image. Our previous methodology achieved
such an improvement due to the real-time detection capability of YOLOv4 acting as a
single-stage detection process, and the various new features of YOLOv4 (e.g., CSP, CmBN,
mish activation, etc.), which sped up detection. Furthermore, the default MOSAIC = 1 flag
automatically performed the data augmentation. In this research, we employed Google
CoLab and Google Deep Learning VM for parts of the training and testing. In addition to
YOLOv4, YOLOv5 showed performance improvement, as shown in [1]. They obtained
a precision of 0.9470, a recall of 0.9250, and a mAP of 0.9410. Although their evaluation
metrics were higher than ours, our dataset was bigger. Furthermore, we had binary classes,
whereas they just had a “drone” class. They did not employ data augmentation, whereas
we used a data augmentation technique to build a collection of over 5700 images. As a result
of the variability in the dataset and the addition of new classes with data augmentation,
our suggested technique is resilient and scalable in real-world scenarios.
Our results for the present research outperformed our previous methodology, achiev-
ing a mAP of 0.904. Because of the lightweight design, YOLOv5 recognized objects faster
than YOLOv4. YOLOv4 was created using darknet architecture; however, YOLOv5 is built
with a PyTorch framework rather than a darknet framework. This is one of the reasons
we obtained more accuracy and speed than earlier methodologies. In addition to the
architecture itself, we fine-tuned the last layers of the original YOLOv5 architecture so that
it performed better on our customized dataset. Other than the layer tuning, we customized
the default values of the learning rate, momentum, batch size, etc. We trained the model for
100 iterations since we trained the custom dataset on top of the transferred weights for the
COCO dataset. In addition to mAP, we achieved a precision of 0.918, recall of 0.875, and F-1
score of 0.896. In terms of F-1 score and recall, we also outperformed the previous model.
We further tested the new model on two videos, using a Tesla T4 GPU. For the DJI Mavic
Pro, we achieved a maximum FPS of 23.9 ms, and for the DJI Phantom III, a maximum
FPS of 31.2 ms. Thus, in terms of inference speed, we also outperformed the previous
model’s performance. We achieved this improvement due to the new feature additions
included in YOLOv5, such as the CSPDarknet53 backbone, which resolved the gradient
issue using fewer parameters, and thus was more lightweight. Other helpful feature ad-
ditions included the fine-tuning of YOLOv5 for our custom dataset, data augmentation
performed to artificially increase the number of images, and data preprocessing to make
training the model smoother and faster. The evaluation metric, F1 score, is a weighted sum
of precision and recall. Precision is the accuracy of positive class prediction, whereas recall
is the proportion of true positive classes. The greater the F1 score, the better the model in
general. The correctness of bounding boxes in objects is measured by mAP, and the greater
the value, the better. The speed of object detection is measured in frames per second (FPS).
Table 2 compares the performance of the previous and proposed models in terms of four
evaluation metrics: precision, recall, F1 score, mAP50, and FPS.
Eng 2023, 4 427
7. Conclusions
In this research, we compared the performance of one of the latest versions of YOLO,
YOLOv5, to our previously proposed drone detection methodology that used YOLOv4. To
make a fair comparison, we employed the same dataset and the same computing configu-
rations (e.g., GPU). We first fine-tuned the original YOLOv5, as per our customized dataset
that had two classes: bird and drone. We further tuned the values of the hyperparameters
(e.g., learning rate, momentum, and decay) to improve the detection accuracy. In order to
speed up the training, we used transfer learning, implementing the pre-trained weights
provided with the original YOLOv5. The weights were trained on a popular and commonly
used dataset called MS COCO. To address data scarcity and overfitting issues, we used data
augmentation via Roboflow API and included data preprocessing techniques to smoothly
train the model. To evaluate the model’s performance, we calculated the evaluation metrics
on a testing dataset. We used precision, recall, F-1 score, and mAP, achieving 0.918, 0.875,
0.896, and 0.904 values, respectively. We outperformed the previous model’s performance
by achieving higher recall, F-1 score, and mAP values (a 21.57% improvement in mAP).
Furthermore, we tested the speed of detection on videos of two different drone models,
the DJI Phantom III and the DJI Mavic Pro. We achieved maximum FPS values of 23.9
and 31.2, respectively, using an NVIDIA Tesla T4 GPU. The videos were taken at three
altitudes—20 ft, 40 ft, and 60ft—to test the capability of the detector for objects at high
altitudes. In future work, we will use different versions of YOLO and larger datasets. In
addition, other algorithms for object detection will be included to compare the performance.
Various drone-like objects such as airplanes will be added as classes alongside birds to
improve the model’s ability to distinguish among similar objects.
Author Contributions: Conceptualization, S.S. and B.A.; methodology, B.A. and S.S.; software, S.S.;
validation, B.A. and S.S.; formal analysis, S.S and B.A.; investigation, B.A. and S.S.; resources, S.S.;
data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, B.A.;
visualization, S.S.; supervision, B.A.; project administration, S.S. and B.A.; funding acquisition, B.A.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The datasets used or analyzed during the current study are available
from the corresponding author upon reasonable request.
Conflicts of Interest: We declare that there is no conflict of interest.
Abbreviations
AUC Area under the ROC Curve
CCTV Closed-Circuit Television
CLAHE Contrast Limited Adaptive Histogram Equalization
CNN Convolutional Neural Network
COCO Common Objects in Context
CPU Central Processing Unit
CSPNet Cross-Stage Partial Network
Eng 2023, 4 428
Eng 2023, 4, 14
Appendix A
Appendix A
In aa variety
In variety of
of photos,
photos, our
our classifier
classifier effectively
effectively identified
identified drone
drone and
and bird
bird objects.
objects. We
We
evaluatedimages
evaluated imageswith
withintricate
intricate backgrounds
backgrounds andand various
various climatic
climatic conditions.
conditions. Here Here we
we have
have
the the detection
detection results,
results, where where the images
the images are displayed
are displayed together
together withwith their
their correspond-
corresponding
ing class
class names names and class
and class probabilities.
probabilities. YOLOv5YOLOv5 generated
generated the predictions
the predictions in batches.
in batches. Thus,
Thus, predictions
predictions are shown
are shown allfigure.
all in one in one Additionally,
figure. Additionally,
we testedweimages
testedthat
images
havethat have
“drone”
“drone”
and “bird” andin“bird” in oneIn
one image. image. In augmented
augmented trainingtraining
images,images,
0 refers0torefers to “bird”
“bird” and 1
and 1 refers
to “drone”.
refers to “drone”.
Figure A1.
Figure A1. First
First batch
batch prediction
prediction by
byYOLOv5.
YOLOv5.
Eng 2023,44,
Eng2023, 4, 429
15
Eng 2023, 15
Eng 2023, 4, 15
FigureA2.
Figure
Figure A2.Second
Secondbatch
batchprediction
predictionby
byYOLOv5.
YOLOv5.
Figure A2.
A2. Second
Second batch
batch prediction
prediction by
by YOLOv5.
YOLOv5.
Figure A3.
Figure A3. Bird
Bird and
and drone
drone in
in images
images predicted
predicted by
by YOLOv5.
YOLOv5.
Figure A3.Bird
FigureA3. Birdand
anddrone
dronein
inimages
imagespredicted
predictedby
byYOLOv5.
YOLOv5.
Figure A4.
Figure A4. First
First batch
batch of
of augmented
augmented training
training image
image predicted
predicted by
by YOLOv5.
YOLOv5.
FigureA4.
Figure A4.First
Firstbatch
batchof
ofaugmented
augmentedtraining
trainingimage
imagepredicted
predictedby
byYOLOv5.
YOLOv5.
Eng 2023,44,
Eng2023, 430
16
Eng 2023, 4, 16
FigureA5.
Figure A5.Second
Secondbatch
batchof
ofaugmented
augmentedtraining
trainingimage
imagepredicted
predictedby
byYOLOv5.
YOLOv5.
Figure A5. Second batch of augmented training image predicted by YOLOv5.
FigureA7.
Figure A7.Instance1
Instance1of
ofmisclassified
misclassifiedimage
imagepredicted
predictedby
byYOLOv5.
YOLOv5.
Figure A7. Instance1 of misclassified image predicted by YOLOv5.
11. Laksham, K.B. Unmanned aerial vehicle (drones) in public health: A SWOT analysis. J. Fam. Med. Prim. Care 2019, 8, 342–346.
[CrossRef] [PubMed]
12. Xun, D.T.W.; Lim, Y.L.; Srigrarom, S. Drone detection using YOLOv3 with transfer learning on NVIDIA Jetson TX2. In Proceedings
of the 2021 Second International Symposium on Instrumentation, Control, Artificial Intelligence, and Robotics (ICA-SYMP),
Bangkok, Thailand, 20–22 January 2021; pp. 1–6.
13. Mahdavi, F.; Rajabi, R. Drone Detection Using Convolutional Neural Networks. In Proceedings of the 2020 6th Iranian Conference
on Signal Processing and Intelligent Systems (ICSPIS), Mashhad, Iran, 23–24 December 2020; pp. 1–5. [CrossRef]
14. Singha, S.; Aydin, B. Automated Drone Detection Using YOLOv4. Drones 2021, 5, 95. [CrossRef]
15. Jian, M.; Lu, Z.; Chen, V.C. Drone detection and tracking based on phase-interferometric Doppler radar. In Proceedings of the
2018 IEEE Radar Conference (RadarConf18), Oklahoma City, OK, USA, 23–27 April 2018; pp. 1146–1149. [CrossRef]
16. Elsayed, M.; Reda, M.; Mashaly, A.S.; Amein, A.S. Review on Real-Time Drone Detection Based on Visual Band Electro-Optical
(EO) Sensor. In Proceedings of the 2021 Tenth International Conference on Intelligent Computing and Information Systems
(ICICIS), Cairo, Egypt, 5–7 December 2021; pp. 57–65. [CrossRef]
17. Shi, Q.; Li, J. Objects Detection of UAV for Anti-UAV Based on YOLOv4. In Proceedings of the 2020 IEEE 2nd International
Conference on Civil Aviation Safety and Information Technology (ICCASIT), Weihai, China, 14–16 October 2020; pp. 1048–1052.
[CrossRef]
18. Taha, B.; Shoufan, A. Machine Learning-Based Drone Detection and Classification: State-of-the-Art in Research. IEEE Access 2019,
7, 138669–138682. [CrossRef]
19. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for
Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Com-
puter Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. Available online: https:
//openaccess.thecvf.com/content/ICCV2021W/VisDrone/html/Zhu_TPH-YOLOv5_Improved_YOLOv5_Based_on_
Transformer_Prediction_Head_for_Object_ICCVW_2021_paper.html (accessed on 30 November 2022).
20. Quoc, H.N.; Hoang, V.T. Real-Time Human Ear Detection Based on the Joint of Yolo and RetinaFace. Complexity 2021,
2021, e7918165. [CrossRef]
21. Karthi, M.; Muthulakshmi, V.; Priscilla, R.; Praveen, P.; Vanisri, K. Evolution of YOLO-V5 Algorithm for Object Detection:
Automated Detection of Library Books and Performace validation of Dataset. In Proceedings of the 2021 International Conference
on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 24–25 September
2021; pp. 1–6. [CrossRef]
22. Liu, L.; Ke, C.; Lin, H.; Xu, H. Research on Pedestrian Detection Algorithm Based on MobileNet-YoLo. Comput. Intell. Neurosci.
2022, 2022, e8924027. [CrossRef] [PubMed]
23. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp.
580–587. Available online: https://openaccess.thecvf.com/content_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014
_CVPR_paper.html (accessed on 30 November 2022).
24. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. Available online: https://openaccess.thecvf.
com/content_cvpr_2017/html/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.html (accessed on 30 November 2022).
25. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of
the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M.,
Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37.
26. Saqib, M.; Khan, S.D.; Sharma, N.; Blumenstein, M. A study on detecting drones using deep convolutional neural networks. In
Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce,
Italy, 29 August–1 September 2017; pp. 1–5. [CrossRef]
27. Nalamati, M.; Kapoor, A.; Saqib, M.; Sharma, N.; Blumenstein, M. Drone Detection in Long-Range Surveillance Videos. In
Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei,
Taiwan, 18–21 September 2019; pp. 1–6. [CrossRef]
28. Aker, C.; Kalkan, S. Using deep networks for drone detection. In Proceedings of the 2017 14th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [CrossRef]
29. Seidaliyeva, U.; Akhmetov, D.; Ilipbayeva, L.; Matson, E.T. Real-Time and Accurate Drone Detection in a Video with a Static
Background. Sensors 2020, 20, 3856. [CrossRef] [PubMed]
30. Behera, D.K.; Raj, A.B. Drone Detection and Classification using Deep Learning. In Proceedings of the 2020 4th International
Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; pp. 1012–1016. [CrossRef]
31. Mishra, A.; Panda, S. Drone Detection using YOLOV4 on Images and Videos. In Proceedings of the 2022 IEEE 7th International
conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2022; pp. 1–4. [CrossRef]
Eng 2023, 4 433
32. Xu, Q.; Zhu, Z.; Ge, H.; Zhang, Z.; Zang, X. Effective Face Detector Based on YOLOv5 and Superresolution Reconstruction.
Comput. Math. Methods Med. 2021, 2021, e7748350. [CrossRef] [PubMed]
33. Chen, Y.; Aggarwal, P.; Choi, J.; Kuo, C.-C.J. A deep learning approach to drone monitoring. In Proceedings of the 2017 Asia-Pacific
Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15
December 2017; pp. 686–691. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.