Research of Effective UAV Detection Using Acoustic Data Recognition
Research of Effective UAV Detection Using Acoustic Data Recognition
Research of Effective UAV Detection Using Acoustic Data Recognition
Supervisors
candidate of technical sciences,
associate professor
L.B. Ilipbayeva
doctor PhD,
professor
E.T. Matson
(Purdue University)
Republic of Kazakhstan
Almaty, 2023
1
CONTENTS
NORMATIVE REFERENCES……………………………………………. 4
SYMBOLS AND ABBREVIATIONS…………………………………….. 5
INTRODUCTION…………………………………………………………… 6
1 STATE OF THE ART: UAV DETECTION WITH ACOUSTIC DATA 10
1.1 UAV detection systems………………………………………………. 10
1.2 Acoustic data-based UAV detection…………………………………. 12
1.2.1 The Role of Classification for UAV Acoustic Data Recognition……. 14
1.3 Related works on UAV sound detection and classification methods… 15
1.3.1 Pre-processing methods of the UAV acoustic data recognition system 18
1.3.2 Machine Learning algorithms for UAV acoustic data recognition…… 20
1.3.3 Deep learning algorithms for acoustic data recognition……………… 21
1.4 Problem Statement: The protection system for strategic areas from
unidentified UAVs based on acoustic recognition…………………… 27
1.4.1 Suspicious UAVs with high-risk cases: Loaded and Unloaded UAVs 27
1.4.2 UAV distance Identification………………………………….……… 28
1.4.3 Multiple model UAV recognition……………………………………. 28
2 UAV ACOUSTIC DATA PREPARATION…………………………….. 29
2.1 UAV sound recording in different positions and models…………… 29
3 MATHEMATICAL VIEW ON THE SIGNAL PRE-ANALYSIS STEP
IN TIME AND FREQUENCY DOMAINS……………………….………. 32
3.1 Foundational principle of sound data representation……………….. 32
3.2 Acoustic Data in Time Domain……………………………………… 34
3.3 Short-Time Fourier-Transform (STFT)……………………………… 36
3.4 Mel-Scale Spectrograms…………………………………………….. 39
3.5 An efficient signal processing proposal: the KAPRE method………. 40
4 DEEP LEARNING METHODS FOR UAV ACOUSTIC DATA
RECOGNITION……………………………………………………………. 42
4.1 Convolutional Neural Networks (CNNs) in Sound Recognition
Problems…………………………………………………………….. 43
4.2 Recurrent Neural Networks (RNNs) in Sound Recognition………... 45
4.2.1 Simple Recurrent Neural Networks (RNNs) in Sound Recognition... 45
4.2.2 Long-term short-term memory (LSTM) for sound recognition…..… 46
4.2.2.1 Bidirectional Long Short-Term Memory (LSTM)……………….…. 49
4.2.3 Gated Recurrent Neural Networks (GRU) for Sound Recognition..... 50
5 REAL-TIME UAV ACOUSTIC DATA RECOGNITION AND
CLASSIFICATION SYSTEM……………………………………………… 52
5.1 The proposed real-time Drone Sound recognition system…………. 52
5.1.1 Adaptation of UAV Sound Recordings for Real Time System……. 52
5.1.2 Processing of UAV acoustic signals using the KAPRE method:
Melspectrogram……………………………………………...……… 56
5.1.3 Real-time and RNN network-based UAV sound recognition
architecture…………………………………………………….……. 58
5.2 Results and discussion of the Proposed System…………………….. 62
2
CONCLUSION………………………………………………………………. 71
REFERENCES……………………………………………………….……… 72
APPENDIX A – Model and layers of the CNN algorithm based on the
publication……………………….……………………………………………. 78
APPENDIX B – Visualization of the Stacked BiLSTM-CNN model
presented in publication.………………… ………………………………….. 79
APPENDIX C – Composition of the initial dataset of the study……………. 80
APPENDIX D – Studying the sounds of background objects and UAVs with
6 classes in the Time domain…………………...……………………………. 81
APPENDIX E – Experimental studies at the stage of audio data adaptation... 82
APPENDIX F – Plot of the Power level of UAV sound 83
signals………….…..
APPENDIX G – Investigation of spectrograms with various
hyperparameters during the experiment ……………………………………… 84
APPENDIX H – Implementation of the proposed system in the Python
program……………………………………………………………………….. 85
APPENDIX I – Confusion matrix from an experiment on recognizing UAVs
at close range and a certain state………………………………...……………. 86
APPENDIX K – Conducting experimental studies at international research
institutions…………………………………………………………..………… 87
APPENDIX L – Publication of experimental studies at the conference…….. 90
APPENDIX M – Minute on the acceptation of a scientific project by the
"Zhas Galym 2022-2024"……………………………………………………... 91
3
NORMATIVE REFERENCES
4
SYMBOLS AND ABBREVIATIONS
5
INTRODUCTION
7
Methods of the research. The research of this thesis was carried out on the
basis of a combination of analytical and empirical methods. In particular, the
experimental approach was employed to collect UAV sounds for the study's first
objective. Additionally, Fast Fourier analysis, Short-Time Fourier Transform and Mel
spectrogram filters were employed to analyze the audio signals that were gathered.
Moreover, the Convolutional Neural Networks (CNNs) and Recurrent Neural
networks (RNNs) deep learning methods were extensively used to achieve the last
objective.
The scientific novelty of the work.
The novelty of this study is to development of an architecture of a UAV
acoustic data recognition system with the integration of a modified Melspectrogram.
The theoretical and practical significance of the work. In this dissertation
work, types of recurrent neural networks for recognition of UAV acoustic data were
extensively investigated. The proposed system is recommended for national security
systems, in particular the security of people, densely populated areas, airports,
government buildings, kindergartens, schools, universities, national borders, customs
and strategic places.
Research publications:
1. Multi-label UAV sound classification using Stacked Bidirectional LSTM //
2020 Fourth IEEE International Conference on Robotic Computing (IRC), (Taichung,
2020. – P. 453-458).
2. Stacked BiLSTM - CNN for Multiple label UAV sound classification //
2020 Fourth IEEE International Conference on Robotic Computing (IRC), (Taichung,
2020. – P. 470-474).
3. Effectiveness of the System of Unmanned Aerial Vehicles Detection on the
Basis of Acoustic Signature // Vestnik KazNRTU. Vestnik KazNRTU. – 2020. –
Vol. 4, Issue 140. – P. 300-307 (ISSN1680-9211).
4. Investigation of Acoustic Signals in Uav Detection Tasks for Various
Models (2021-08-17).
5. Survey on Different Drone Detection methods in the Restricted Flight Areas
// Vestnik KazNRTU. – 2019 (ISSN1680-9211).
6. Practical Study of Recurrent Neural Networks for Efficient Real-Time
Drone Sound Detection: A Review // Drones. – 2023. – №7. – Р. 26.
Acknowledgments. I would like to thank all who supported me with this
research and the writing of this research thesis, especially Professors L. Ilipbayeva
and E. Matson, who guided me toward a systematic approach to experimentation,
Professor John S. Gallagher, who originally provided the UAV audio data, and my
lab-mate U. Seidaliyeva, who worked on a related project in Vision-based UAV
detection and gave moral support during the study. I would like to express my deep
gratitude to my family and the memory of my mother Gulzhamila Utebayeva, who
has always inspired me and provided the basis for ambitious goals in science.
Structure and scope of the thesis.
This dissertation consists of 5 parts: "State of the Art: UAV detection with
acoustic data", " UAV acoustic data preparation", "Mathematical view on the signal
8
pre-analysis step in time and frequency domains", “Deep Learning methods for UAV
acoustic data recognition” and "Real-time UAV acoustic data recognition and
classification system''.
9
1 STATE OF THE ART: UAV DETECTION WITH ACOUSTIC DATA
a b
a – A loaded drone in the city area; d –A loaded drone on a path of power lines
As we can see above, there has been an increase in cases where unmanned
aerial vehicles (UAVs) have been widely used in hostilities in recent years. The work
[29] provides an extended list of incidents with drones in the military and other
different situations. Generally, non-military UAVs have often been implicated in
incidents where they have endangered aircraft as well as people or property on the
13
ground. Because a swallowed drone can quickly damage an aircraft's engine, there
have been safety concerns. Several confirmed collisions and hazards have involved
amateur drone operators, too, who have flown in violation of air safety laws. These
views claim that the identification and categorization of UAVs will always be of
paramount importance. And the acoustic sensor method can be an effective solution
to the problem of drone detection and classification. Due to the advent of
multifunctional technologies that have allowed drone users to create their own
drones, and the near impossibility of monitoring them, other methods are impractical.
The military can identify drones with very sophisticated radar systems, but these
systems are expensive, and their practical design is not suitable for urban
environments. In addition, there are a number of integrated commercial solutions that
use various complex sensor systems such as radar, RF, cameras, and thermal sensors
[3, p. 149-160; 6, p. 138682; 7, p. 3856-3-3856-7; 20, p. 26-20-25]. But the drone
incidents mentioned above require the definition of models or types, distances of
drones to objects, and their loads. And the acoustic sensor method is suitable for the
optimal solution of these problems from a technical point of view. That is, if drones
are studied by their sound signatures, then it is technically possible to determine their
model, state, and position. This is because different drone models have different
motors that make different humming sounds, which in turn produce different
frequency responses. As a result, enough data can be collected for processing using
deep learning methods in artificial intelligence. Also, if the drones are loaded with
special mass, even if they are the same model, the sound data will change due to the
weight on its engine. Summing up the mentioned factors and possibilities, the study
of sound recognition by drones shows that this is an effective solution. The use of
deep learning and machine learning methods, which are modern and productive
branches of artificial intelligence, is considered the most reliable solution for
processing such collected data. The recognition of these objects based on the
collection of a sufficiently large number of quantitative patterns and data from the
same object with high accuracy can be achieved by training their patterns using
neural networks.
16
models. Analyzes of scientific papers related to acoustic sensors based on machine
learning and deep learning were carried out (table 1). And research on individual
methods of each direction is discussed in the following sections.
The literature review focused on research done with acoustic sensors using
machine learning and deep learning methods. The results of the discussion proved, as
can be seen from the table, that Deep Learning methods provide high performance,
and among them only the CNN network has been studied more. And very little
research has been done by other authors on RNNs with binary classification. A
17
scientific study was carried out on the LSTM network, which is a type of RNN
network, and published [15, p. 456-457; 16, p. 473]. From this we can see that a
complete study of RNNs by other authors has not been carried out, although RNNs
have been successfully used for audio signals. However, drone sounds must go
through preparatory processes that can be taken into account during training in order
to apply these algorithms. Namely, the acoustic structure of the UAV detection
system consists of the following main parts: data preparation, preprocessing and
classification. Data preparation is associated with the collection of acoustic data from
various types of UAVs using acoustic sensors, i.e., microphones. Pre-processing
considers getting ready audio data to the Network by extracting features from audio
representations. The classification task concerns training datasets using machine
learning or deep learning methods [15, p. 454]. And the following sections discuss
literature reviews related to these parts: pre-processing methods that prepare input
acoustic data for neural networks, machine learning, and deep learning methods.
18
Figure 3 – Types of features during preprocessing of audio signals
Figure 4 – The most studied networks in deep learning for UAV detection
Figure 5 – Accuracy and loss models for the initial database of training using CNN
This demonstrates that as the database grows, the neural network recognition
system's identification accuracy has an optimal performance score and the ability to
train the model, as shown by the second graph.
Figure 6 – Accuracy and loss models for the modified database of training using
CNN
At this stage, a feature vector of randomly selected items in the test data, i.e.
audio image data values, is presented in table 2 to accurately represent the prediction
results. The main goal is to test the system created as in works [1, p. 863; 2, p. 243]
against a database of many different drone models, to study its reliability in a real-
time system, and to identify the main problems.
24
The study in this publication required the addition of several layers of CNN
work and many more epochs, using data from a newly collected database to test the
new target. This showed that the study with the CNN model is limited by such
reasons as a large number of trainable parameters and the use of excess time for
training them. Considering the above studies, it was noticeable that with the help of
an acoustic signature, it is possible to perform a binary classification of the UAV, as
well as determine the load of the drone. In the course of which a research task was
carried out on two targets in the form of a publication: the first goal was to create a
classification system with many models, and the second goal was to research LSTM
one of the types of RNN, that is, the development of a multi-level bidirectional long-
term short-term memory (LSTM) with two hidden layers to categorize the sounds of
multiple UAVs. The collected data used in the research in [16, p. 471] was gathered
for three primary classes of multiple UAV detection, including background noise, the
sound of unloaded drones of different models in the scene, and loaded drones in the
scene, (figure 7).
25
This research project's primary objective was to create a multi-label
classification system. It is a classification challenge for many labels due to the
dataset's architecture (figure 8). A frequency of 44100 Hz and a microphone bit depth
with a resolution of 16 bits were used to record UAV sounds (DJI phantom I, DJI
phantom II, Syma x20, 6 axis Gyro, tarantula, etc.).
Uncompressed WAV files have been used to store the audio recordings.
Additionally, the set of data included 3 primary classes for all audio data: "loaded,"
"unloaded," and "no drone". Modeling clay weighing 0.5 KG is carried as a
supplementary payload by "loaded" class UAVs. The "P1" class of UAVs was
designated as the "unloaded" class so the testing findings indicated that they are too
fragile to support any sizeable load.
This research's secondary objective was to construct a Recurrent neural
network. The model has an input layer, four input dense layers wrapped in
TimeDistributed layers, two stacked bidirectional LSTM layers, a dropout layer, a
flat layer, and a dense layer. Bi-directional layers are used to enclose hidden LSTM
layers. The hidden LSTM layer employed 128 memory cells [49-52].
This computation process is completed by multiplying by two by the
bidirectional shell, which adds another layer. The "categorical crossentropy" loss
function is tailored for the multi-label classification problem in the implementation of
the suggested model. The weights are optimized using the "Adam" gradient descent
implementation, and then during model training and validation, the classification
"accuracy" is determined. According to the model's evaluation, the network's
predicted skill on training dataset was 94.02%. The best precision, however, was only
attained in epoch 49 with an accuracy of 94.09% in 57 s, which presents a challenge
for the creation of real-time systems [15, p. 457]. As a follow-up to the work from
[15, p. 457-458], the construction of LSTM architecture with a mixture of
Convolutional Neural Networks (CNN) for UAV sound categorization challenges
was examined further in [16, p. 471]. As a result, utilizing the LSTM-CNN
architecture, the study [16, p. 472] tackled the issue of categorizing a UAV's sound
with several labels.
1.4 Problem Statement: The protection system for strategic areas from
unidentified UAVs based on acoustic recognition
As a result of their many recreational uses, delivery systems, military strikes,
reconnaissance, and cross-border political objectives, UAVs are gradually becoming
more significant. Additionally, there are terrorist operations and criminal smuggling,
including the smuggling of items through borders, restricted locations, and prisons.
The issue of drones being used widely and illegally to take pictures or videos from
unusual perspectives [1, p. 863; 2, p. 243; 3, p. 149; 6, p. 138670; 7, p. 3856-2] is
also brought up. As a consequence, it is critical to identify drones that are loaded.
Based on the results of the literature review, it can be concluded that the drone
recognition systems investigated thus far have attempted to address the issues of
binary classification, classification that distinguishes between various models, and
classification that establishes the load of only one model. Further reporting of
incidents using drones requires a serious investigation to determine their states,
positions, distances in the case of different models in real-time.
1.4.1 Suspicious UAVs with high-risk cases: Loaded and Unloaded UAVs
Cases such as the case of a drone with a load that fell on high-voltage power
lines, Amazon drones that flew into a technical fault during transportation, the
incident in China of drones that were launched to throw fireworks and fell on people
during a holiday, the frequency of suspicious drones that are often launched in
various countries for the purpose of military intelligence show that the suspicious
cases of drone use are becoming more frequent. And promotes their timely
identification in protected areas where human life is important. Their use with
harmful loads in the transportation of contraband goods poses a great danger. Drones,
which are used as additional weapons or for special reconnaissance, are considered a
factor that raises suspicions, especially for densely populated areas and strategic
territories. At the same time, drones that are used as an interest or hobby are often
loaded with additional power banks for long flights and high-resolution cameras for
taking high-quality pictures and videos. In one of these situations, there is a risk of
being managed clumsily. Falling into an occupied or hazardous area, whether due to
inflexible management, can result in significant losses. As a result of these incidents,
27
the problem of both loaded and unloaded drone recognition is very important, and in
turn, there is a need to develop a system for recognizing both.
28
2 UAV ACOUSTIC DATA PREPARATION
a b c
a – Flight of a loaded drone over a field; b – Loaded drone parking; c – Microphone
placement
Figure 10 – UAV audio recording from DJI Phantom series with payload flying over
an arable field
The data was collected over several different seasons. A freight train,
motorcycles, cars, Gator trucks and background noises with human voices were heard
passing nearby while some of the UAVs were launched. Wind, canopy rustling, and
other ambient noises were also heard during the testing period, and their data was
also collected to distinguish UAV noises from false negatives. During recording, the
DJI Phantom 2 was used to launch a loaded UAV carrying a 0.5 kg payload of
sculpting clay. The Syma X20 UAV model, which is often used for leisure activities,
was delivered using a 0.425 kg metal power pack in both loaded and unloaded
situations. When assessing the load on these recreational or amateur drones, the
29
possibility of harm from a control error from them was taken into account. In
addition, additional reasonably priced UAV models have been tested for unloaded
UAV enclosures, including Tarantula x6 and Syma x5c ranging from 1 to 40 meters.
Other UAV models, including DJI Phantom 1, 2, 3, 4, DJI Phantom 4 Pro, Mavic
Pro, and Qazdrone, were also launched with parameters as in table 3 and their noises
were added to the dataset.
All of these UAVs were recorded using 16-bit microphone depth resolution at
44,100 Hz, moving up and down, forward and backward at varying speeds depending
on their technological characteristics, starting fairly close to the microphone and a
nearby parking lot. And the remaining information was gathered from free and open
resources. The process of collecting audio files from public sources such as
"www.zaplast.com" and "www.sound-ideas.com" required much more effort. This is
due to the fact that our prediction system was based on an acoustic sensor capable of
listening at a frequency of 44100 Hz and a depth of 16 bits, and the sounds of the
loaded UAV were detected only on amateur videos and processed using a special
converter at a frequency of 44100 Hz and a bit depth of 16 bits. The rest of the data
from open sources were also converted from various formats to a sampling rate of
44000 Hz with 16-bit depth resolution and "mono" microphone mode with the
extension ".wav". Since our model was created to receive audio data through the wav
extension. The DJI Phantom 2 and its loaded states were the only UAV model
considered in earlier studies [1, p. 864-867; 2, p. 243-244] that had this limitation.
This study aims to investigate the effect of acoustic data from various UAV models
on the problem of complex UAV sounds and their load states. This study aims to
investigate how the acoustic data of different UAV models influence UAV load
recognition across different models and weights. In general, all UAV recording
information was collected and divided into three categories such as "Unloaded",
"Loaded" and "Background noise". The three folders include all of these recorded
30
and collected sounds. Recorded drone noises ranged in duration from a few seconds
to more than five minutes. Table 4 provides a general overview of the duration of the
sounds collected for each class, in seconds.
UAV sounds from open sources included several sounds in "stereo" mode.
During the experiment, some of the sounds emitted by Qazdrone, DJI Phantom 2, DJI
Phantom 4 and DJI Phantom 4 Pro drones were recorded using microphones from
Apple products such as the Apple iPhone 13B iPad AIR 2020. Using a specially
created filter, all data files were changed to 44 100 Hz and "mono" mode.
The next two sections are devoted to theoretical solutions for recognizing these
initial acoustic database, and the last section explores the development of the system
itself and its practical solution.
31
3 MATHEMATICAL VIEW ON THE SIGNAL PRE-ANALYSIS STEP
IN TIME AND FREQUENCY DOMAINS
32
Figure 11 – The process of sound formation and the human perception system
So, a pressure wave is created when an object vibrates, which produces sound.
The surrounding medium (air, water, or solid) is subject to a pressure wave that
induces vibrational motion in the particles. The sound is transmitted further through
the medium as a result of the adjacent particles moving as a result of the particles'
vibration. Vibrant air particles cause tiny components of the human ear to vibrate,
which causes the ear to detect sound waves [57]. And the trajectory of these particles
is similar to a sine waves. In connection with the existence of this physical
phenomenon, the study of sound in the form of waves is generally accepted. And
sound waves are often simplified to describe sinusoidal waves that have common
properties such as frequency, wavelength, amplitude, sound pressure or intensity, etc
[58-60].
35
Figure 13 – Representation of the audio signals as a function 𝑥(𝑡)
There is one strict rule to keep in mind when conceptualizing the audio signal
processing. The sampling rate must be calculated using the Nyquist-Shannon theorem
when signals with sampled values are considered even in the time domain. The
Nyquist-Shannon theorem is a crucial link between continuous-time signals and
discrete-time signals in the field of signal processing. It creates a necessary condition
for a sample rate that allows a discrete series of samples to fully capture the
information from a continuous-time signal with a finite bandwidth. The Nyquist-
Shannon sampling theorem offers a requirement for discretizing an analog signal into
uniformly spaced samples, making it possible to reconstruct the analog signal from a
discrete signal. It also includes removal of aliasing's effect. The process of aliasing
blends together several signals. The sampling theorem states that the sampling
frequency 𝐹𝑠 should be more than twice the maximum frequency component, where
𝑓𝑚𝑎𝑥 is the maximum frequency component of the analog signal, equation (2).
And in digital devices that have special programs for automatic processing of
audio signals, this law is preserved for the sampling rate. It is also necessary to
perform signal processing while preserving the conditionality of this Nyquist theorem
in programming environments where machine learning is being studied during
information recovery from recorded audio file formats and their processing.
Since the time domain only provides information about the amplitude over
time, it is not possible to obtain more extensive information. Therefore, by
considering them in the frequency range, you can get more information needed for
sound recognition. The next subsection will consider the study of audio signals in the
Frequency domain.
36
aforementioned DFT transform, which helps to move from the time domain to the
frequency domain, is defined by the following equation (3) below:
−2𝜋𝑖𝑘𝑛
𝑋𝑘 = ∑𝑁−1
𝑛=0 𝑥𝑛 𝑒 𝑁 (3)
37
Figure 14 – The fundamental basis of the STFT calculation process for audio signals
Moreover, windows are employed to produce brief sound fragments that last
only a few milliseconds. Any finite sound with a beginning and an end can be
thought of as a windowed piece of time in general. There are numerous window
forms that are possible such trapezoidal, triangular, polynomial windows, and "sine"
windows. DFT often employs Hann and Hamming windows [66]. That is, intentional
small segments of time or frame are obtained with a certain time. This is called the
length of the frame. This frame length adheres to a constantly accepted stable length
for all other future segments. The very first frame is taken from the zero point of the
coordinate with the length of the frame. As stated above, the frame length is between
20-40 ms. The second frame does not start from the end of the initial frame. The
second frame will be calculated with a certain time step, which begins relative to the
coordinate of the starting point. This is called the "hop" step.
Typically, the "hop" length is 10 milliseconds. This is also called the step size
of the frame. Thus, the next frames are calculated by this rule relative to the previous
frames on the basis of sliding until the end of the given signal, (figure 15) in [67].
Figure15 shows that the frames are overlapped during the calculation:
Ẋ𝑖 ∈ 𝐶 𝐾 (5)
The discrete Fourier transform is quickly calculated with the formula below as
in equation (6):
−𝑗2𝜋𝑘𝑛
Ẋ𝑖 (𝑘) = ∑𝑁
𝑛=1 𝑋𝑖 (𝑛)𝑔(𝑛)𝑒 𝑁 , 𝑘 = 1, … , 𝐾 (6)
Here, 𝑁 is size of the frame signal. And 𝐾 is the number of FTs to be executed
on entire signal. These results are obtained with frequency indicators, that is, they
show spectra. The changing spectra are then often plotted as a function of time using
a tool called a spectrogram. Values obtained according to formula 6 are complex
numbers. Therefore, the absolute values of these complex numbers are obtained. And
it gives real numbers, equation (7):
The results obtained are called periodograms. Thus, the basis of audio signal
processing using STFT while preserving the time and frequency indicators of the
original audio signal was considered. All these measured quantities are mathematical
methods of the signal. And neural deep learning networks work on the basis of the
auditory system of the human ear as a part of artificial intelligence. Therefore, it is
necessary to process the scale of the studied signal into the logic of the system that
the human ear works with. This scale is called the Mel scale [20, p. 13]. The next
section deals with the theory of Mel spectrogram processing.
41
4 DEEP LEARNING METHODS FOR UAV ACOUSTIC DATA
RECOGNITION
Sound and speech recognition using neural networks has a long history. Neural
networks, a subfield of machine learning and the basis of deep learning algorithms,
are sometimes referred to as artificial neural networks (ANNs) [69]. And the initial
studies of sound recognition showed that machine learning methods, such as vector
support machines (SVM), the KNN classification algorithm, K-means and random
forest algorithm, were studied by a significant pace, as is noticeable from the
subsection of the literature review. And recent studies were widely used to ensure
effective results with deep learning methods. The concept of the name of deep
learning was formed from the thickness of the hidden layers of neural networks.
Traditional machine learning techniques are dominated by convolutional neural
networks (CNN), deep feedforward neural networks (DFN), and recurrent neural
networks (RNN) in difficult forecast problems [20, p. 2-3].
In recent years, they have attracted attention with the significant improvements
in acoustic recognition systems provided by deep feedforward networks. Given that
sound is inherently a dynamic process, it seems natural to consider recurrent neural
networks (RNNs) as an effective model. In neural networks, recurrent neural
networks, are effective models for sequential data [70]. Due to the consecutive
occurrence of its connected data points, an audio waveform is a sort of sequential
data. Recurrent Neural Networks (RNNs) are able to learn characteristics and long-
term dependencies on sequences and data over time. If we conduct a comparative
analysis, CNN's ability to learn sequential dependencies has allowed them to gain
popularity in applications such as audio processing [42, p. 412-414], speech
recognition, machine vision, and image, and video captioning. However, audio
signals are constantly changed over time. The consistent and time-varying nature of
sounds makes the RNN networks an ideal model for studying the features. Since a
RNN has a recurrent hidden state, whose activation at each step depends on that of
the preceding phase, it can handle consecutive inputs, unlike a feedforward neural
network [49].
Taking into account the factors discussed above, the study aims to explore
RNNs in more depth in this thesis. Before studying the RNN network, it was also
planned to consider the recognition of drone data using the CNN network for
comparison in a practical basis. The results provided by the CNN architecture,
explored by previous research work in this area, were compared with the study of
RNN network architectures. However, the fact that theoretical predictions and
theoretical knowledge about the recognition of audio signals presupposed the
effectiveness of the RNN network in advance. So, this section briefly outlines the
theoretical foundations of CNNs and provides a detailed mathematical description of
RNN network architectures.
42
4.1 Convolutional Neural Networks (CNNs) in Sound Recognition
Problems
Convolutional Neural Networks (CNN) are one of the Deep Learning networks
used in various fields such as Object Recognition, Computer Vision, Audio
Recognition and natural language processing (NLP) [31, p. 302]. The primary
structural characteristic of a CNN is the presence of a standard neural network, which
consists of a sampling layer and numerous convolutional layers. Convolutional neural
networks are mostly developed for two-dimensional feature-based image recognition.
Its input can employ feature layering to accomplish learning and presentation using
2D images. So, there can be many layers in a convolutional neural network, and each
layer will learn to recognize different aspects of the image. Each training image is
subjected to filters at various resolutions, and the result of each convolved image is
utilized as the input to the following layer. Beginning with relatively basic properties
like brightness and borders, the filters can get more complicated until they reach
characteristics that specifically identify the object. It is very capable of learning,
requires little signal processing, and has been used successfully for handwriting
recognition, object recognition, face recognition, and sound recognition [71].
A CNN architecture comprises of three layers: an input layer, a group of
hidden layers, and an output layer, (figure 16) [72, 73]. It has the three most common
layers: convolution, activation, and pooling. The foundational component of the CNN
is the convolution layer. It carries the majority of the computational load on the
network. With convolution, a series of convolutional filters are applied to the input
images, each of which activates different aspects of the images. The next layer type is
activation. With the matching of negative data to zero and the preservation of positive
values, activation enables quicker and more effective training. “Relu”, “sigmoid”,
“softmax” and “tanh” functions are the most popular types. Mostly, Activation
function “Relu” is accompanied by Convolution in CNNs (figure 16). So, any
intermediary layers in a feed-forward neural network are known as hidden layers
because the final convolution and activation function conceal their inputs and
outputs. The convolutional layers in a convolutional neural network are hidden
layers. A convolutional layer extracts the image into a feature map, also known as an
activation map. Layers using convolutions transmit their output to the following layer
after convolutioning the input. This resembles how a visual cortex neuron would
react to a particular stimulus. Every convolutional neuron only processes information
for its particular “receptive field”. Although fully connected feedforward neural
networks can be used to learn features and classify data, this architecture is typically
impractical for larger inputs (for example, high-resolution images), where it would be
necessary to use enormous numbers of neurons because each pixel is a significant
input feature. As well, regularized weights across fewer parameters help prevent the
disappearing gradients and exploding gradients issues that were present during
backpropagation in early neural networks.
43
Figure 16 – General architecture of the network
Pooling layer comes next in the list. Using nonlinear downsampling, pooling
reduces the amount of parameters the network needs to learn while still simplfying
the output. Convolutional networks may also have standard convolutional layers and
local or global pooling layers. By merging the outputs of neuron clusters at one layer
into a single neuron at the next, a technique known as pooling layers reduces the
dimensionality of data. Little clusters are combined using local pooling, which
regularly uses tiling sizes of 2x2. Each neuron of the feature map is affected by
global pooling. Max and average are the two most widely used types of pooling.
When comparing local clusters of neurons in the feature map, max pooling utilizes
the largest value whereas average pooling uses the average. The structure of a CNN
switches to classification after learning features in numerous layers. The next-to-last
layer is a fully connected layer that generates a vector of “N” dimensions (“N” is the
maximum number of classes that may be predicted) and contains the possibilities for
each class that a target image belongs to. All of the neurons in one layer
communicate with all of the neurons in the other layer through fully connected layers.
It is equivalent to a conventional multilayer perceptron neural network (MLP). To
identify the images, the flattened matrix passes through a layer that is fully
connected. The final output of the final classification is provided by a classification
layer in the last layer of the CNN architecture [73]. Various types of CNN models
have evolved throughout the evolution of the object recognition problem. These
include LeNet, AlexNet, ResNet, GoogleNet / Inception, MobileNetV1, ZfNet and
Depth based CNNs. And when studying the problem of recognizing sound signals,
simple types of convolutional layers created by several layers were used a lot [1,
p. 864-866; 3, p. 170; 16, p. 472-473; 17, p. 2-3; 74-78]. The CNN infrastructure is
adaptable for image data due to the description of the network and their functionality.
Therefore, the next subsection discusses the theoretical foundations of recurrent
neural networks, which are considered effective for time-varying signals such as
sound [79-83].
44
4.2 Recurrent Neural Networks (RNNs) in Sound Recognition
The initial and most basic design of an artificial neural network was a
feedforward neural network. In this network, data only travels forward from the input
nodes, via any hidden nodes present, and onto the output nodes. The network
contains no loops or cycles. Feed-forward neural networks for sound recognition
tasks have proven attractive in more researches. Moreover, a feedforward network
has become popular for solving prediction problems like image recognition, computer
vision, speech recognition, sound detection [84-86] and others since it employs
multiple hidden layers to maximize learning from the input data [75, p. 229-233; 76,
p. 8-10; 77, p. 87-90]. Overfitting is the primary issue with merely utilizing one
hidden NN layer. By increasing the number of hidden layers, overfitting can be
decreased and generalization can be enhanced. As NNs increase layers, they become
Deep FNNs. Deep FF neural networks also have a drawback in that adding more
layers exponentially lengthens training time, making FF quite impractical [20, p. 26].
Based on the development of functional shortcomings of feed-forward neural
networks, RNN networks have appeared. Recurrent Neural Networks (RNNs) are
derived from Feedforward Neural Networks (FF) as a subset. RNNs can extract long-
term dependencies and features from sequential and time-series data. The input
received by each neuron in an RNN's hidden layers is delayed in time. Current
iterations in recurrent neural networks need access to historical data. For instance,
one needs to be aware of the words that came before the one they are predicting in a
sentence. The RNN can use any lengths and weights as it processes the input over
time. This model's computations take into account historical data, and its size is
independent of the volume of input data. The slow processing speed of this neural
network is a weakness of this network [78]. Based on the solution to this
shortcoming, several types of RNNs have emerged. At present, four different
computational cells of RNNs such as simple RNN, LSTM, BiLSTM and GRU are
popular for prediction. The following subsections provide a theoretical basis for these
4 different RNN networks.
46
Figure 18 – LSTM architecture
The output data from the previous cell ℎ𝑡−1 is mixed with the feature extraction
sequence data 𝑥𝑡 . Also, this combination of input data passes via the input gate 𝑖𝑡 .
(10) and the forget gate 𝑓𝑡 (9). Both gates have sigmoid activation functions that
outputs between 0 and 1. equations (9), (10), (11), (12), (13), (14).
47
ℎ𝑡 = 𝑂𝑡 × tan 𝐶𝑡 (14)
As a result, the input gate (11) determines which input values to update, while
the forget gate (9) determines what data to delete from the cell. Moreover, the tanh
layer, 𝐶́𝑡 ., compresses that mixture.
Here 𝜔𝑓 , 𝜔𝑖 , 𝜔𝐶 are the weights of the corresponding gate neurons; and 𝑏𝑓 , 𝑏𝑖 ,
𝑏𝐶 are the offsets for the corresponding gates. LSTM cells have an inner loop (cell
state) consisting of a 𝐶𝑡 (12) variable called a constant error carousel (CEC). The old
state of cell 𝐶𝑡−1 is switched to set an efficient recurrent loop with the input. The
compressed combination 𝐶́𝑡 is multiplied by the × input data of the 𝑖𝑡 (figure 19).
A forget gate, which chooses which data should be kept or deleted from the
network, is in charge of controlling this recurring loop.
Instead of multiplying, the addition approach ⊕ in this case lowers the chance
of the gradient disappearing. The system then uses the tanh function to push the
values between {-1} and {1} and multiply that result by the output of the sigmoid
gate to position the cell state (12).
So, this gate (13) chooses which values from the ℎ𝑡 cell should be output as the
actual output. In general, updating the internal state is done via the input gate and the
forget gate (11). Because to their many memory slots, LSTM networks have more
complicated computations and greater memory requirements than simple RNNs. It
varies from traditional RNNs in that it has strong advantages over gradient vanishing
as well as long-term dependence.
In the course of the experiment, vanish LSTM, (figure 20) and stacked LSTM,
(figure 21) models were studied, as a result, a single-layer model was developed to
spend less time on calculation.
By placing two layers side by side, delivering the input sequence exactly as it
is at the first level's input, and providing a reverse copy of the input sequence at the
second layer's input, this architecture effectively duplicates the first recurrence level
in the network. Hence, this extra context speeds up the results [35, p. 2527-2528; 36,
p. 403-405]. As a result, two different hidden layers are used by the BiLSTM network
to process the 𝑥𝑡 sequence data in both the forward and reverse directions, and their
hidden layers are joined by a single output layer, as shown in figure 22. Similar to the
LSTM level, the Bidirectional LSTM level's final output is a vector, 𝑦𝑡 =
[𝑦𝑡−1 , … , 𝑦𝑡+1 ] the last element of which is the predicted sequence for the following
time steps 𝑦𝑡+1 . Due to its increased computational complexity over LSTM as a
result of forward and backward learning, BiLSTM demonstrates its drawback. Their
key benefit is that, compared to LSTM networks, they more accurately reflect the
input sequence's past and present contexts [20, p. 7].
49
4.2.3 Gated Recurrent Neural Networks (GRU) for Sound Recognition
The LSTM network has been shown to be a practical solution for keeping
gradients from dissipating or exploding, however because of the many memory
locations in their architecture, they require more memory [49]. To address this issue,
the scientists [53, p. 1724-1733] created the GRU network, which requires less
learning time than the LSTM structure and still achieves great accuracy. The output
gate of GRU networks is absent, in contrast to LSTM networks. The structure of
GRU is seen in figure 23. Two input functions, the previous output vector ℎ𝑡−1 and
the input vector 𝑥𝑡 , are found in the structure of GRU networks at each instant of
time. Moreover, the input of each gate can undergo a logical operation and a non-
linear transformation before being used as the output.
Here, the output to input ratio can be described as follows, equations (15), (16),
(17), (18):
51
5 REAL-TIME UAV ACOUSTIC DATA RECOGNITION AND
CLASSIFICATION SYSTEM
This “filtering block” retrieves all previously recorded UAV audio files of 1-
second length. It is important to emphasize that the study was carried out using
supervised learning. The filtering unit receives audio data of different lengths. There
are 2 functions here. One of them is envelope function (figure 25). And threshold
value taken as "0". Because the envelope of an oscillating signal is a smooth curve
defining its extremes, the envelope function was utilized (figure 26). Since the sounds
of the drone are superimposed on background sounds and there are sounds from
various motorized objects, the “0” threshold was effective.
53
Figure 26 – Signal Envelope method with the threshold “0”
All initial files of their different lengths, figure in (Appendix E), were cut as 1-
second audio files and stored in folders classified according to the initial classes such
as “Loaded UAV”, “Unloaded UAV”, and “Background noise” (figure 27).
b
a – Splitting long files into seconds files; b – Saving received files with classes
Figure 27 – Audio filter preserving audio files under one second in length
The acoustic data of the UAV were adapted before studying the stage of
recognition of UAV sounds based on the analysis of frequency ranges. Acoustic data
was studied in the time domain first. And our background noise class consists of the
sounds of many motorized objects. The sounds of these objects were collected to
prevent false recognition due to the possibility of confusion during recognition.
Therefore, due to the large number of types of background noise objects, the
background noise class was temporarily expanded in this adaptation step, Figures in
(Appendix D) and figures 28, 29. The frequency range of the extended classes was
then studied to preliminarily determine the range of object spectra up to the KAPRE
layers in the model. Therefore, it is necessary to analyze the sounds of various objects
in the frequency domain based on their natural appearance (figures 28, 29).
54
Figure 28 – Temporary extension of background noises
The spectra for each class in the frequency range for our signals were obtained
using the Fast Fourier Transform (FFT) to perform this fundamental analytical work.
They were separated in time and frequency domain only temporarily during the
adaptation analysis stage. And during the application in the neural model, these all
extended classes of background noise were processed together as background noise.
This class extension analysis method helped to determine the frequency ranges
of the desired objects at the level of 16000 Hz, since the informative parts were
visible only in this region (figure 29). That is, the informative component of the
sounds of the UAV and the background noise we need is reflected only up to 16000
Hz, which can be seen in the frequency domain (figure 29).
Figure 29 – UAV signal analysis in the frequency domain using extended six classes
55
The UAV dataset had adapted to be down sampled based on research of
objects’ frequency range. This specially created filter unit of the Downsampling was
constructed to perform these tasks (figure 30).
The audio data through this block gave a database of audio files with 1 second
duration and a frequency set to 16000 Hz (figure 31). In addition, cutting off the
audio spectrum above the 16,000 Hz region saved over-computing time.
The characteristics of the audio signal of the UAV at this stage of the temporal
expansion of classes are not processed, but only adapted. The spectra of the signals
that have successfully adapted to the above procedure using the filter bank are shown
in Figures (Appendix F, G). Feature extraction from audio signals has been
incorporated into the deep learning model itself, which will be discussed in the next
section. The next subsection discusses building the first layer of a basic RNN
recognition model, i.e. the signal processing layer, using the Keras libraries and the
KAPRE method.
56
By examining the frequency spectrum of drone sounds extensively, these feature
vectors can be produced. In general, the processing of drone sound data obtained
during research in the frequency range is called "feature extraction". Efficient
frequency extraction for a real-time UAV sound recognition system was found in the
course of empirical studies that were published earlier in publications [15, p. 457-
458; 16, p. 473-474; 20, p. 26-24-26-25]. Efficient frequency extraction was the
layers of Melspectrogram [20, p 26-25]. Table 5 displays the hyperparameter ranges
and chosen values for the Melspectrogram feature layer. The python programming
environment was used to perform fast calculations to obtain this Melspectrogram
layer.
Thus, it is suggested that the vectors of the Mel scale be extracted from the
UAV sound data while keeping the time and frequency information parameters,
which are called STFTs. In many investigations, the libraries Librosa and Essentia
are primarily used to implement temporal and frequency characteristics based on
conventional approaches. This study implements the KAPRE method built as Keras
layers in Python. The adjustment of acoustic sound processing parameters is the main
benefit of the Kapre approach. And the presentation of the hyperparameters of this
layer from a programmatic point of view was given in (figure 32a).
b
a – Programmatically feeding the hyperparameters of the Melspectrogram layer; b – Layers
of processed signals based on Melspectrogram Layer during training
59
Figure 34 – The proposed RNN based Framework for Real-time UAV sound
recognition
It can be seen from figure 34 that in the proposed structure, the block on the
left is a 1 second file type adaptor, which filters the audio data as explained in the
previous subsections. This device can be thought of as a drone sound production
stage. And the main big block in the middle of figure 34 is a deep learning structure
with modified Melspectrogram that allows us to recognize drone sounds.
The input layer of this main block is the Melspectrogram layer, which
processes drone sounds based on the On-CPU. The Melspectrogram is processed with
the help of STFT calculation and FFT calculation in real time according to the
respective steps as explained in the theoretical framework. Therefore, this layer
consists of several layers during training.
And from one layer during the code line. This layer calculates features of UAV
acoustic data in the dimension of 128 features vectors of frequency by 100 features
vectors of the time. And the KAPRE method libraries in Keras, which allow
processing this layer, are pre-installed and their libraries are called according to the
programming requirements. The tuned hyperparameters of the proposed UAV
acoustic data recognition architecture was given in table 6.
63
Figure 36 – The architecture obtained during the compilation of a simple RNN
network
64
Figure 38 – The architecture obtained during the compilation of a simple RNN
network
Table 7 below shows the average value of the recognition results of the first 32
cell RNN networks. Also, the plot of the recognition accuracy obtained with the
performance of this training in each epoch is given in figure 36. In this table 7, the
results of the recognition accuracy obtained by the CNN model are also given
65
Table 7 – Comparison of Model Accuracy of the SimpleRNN, LSTM, BiLSTM,
GRU, and CNN models on 128-100 dimensional Melspectrograms
Trained models Accuracies in %
SimpleRNN 98
LSTM 97
Bidirectional LSTM (BiLSTM) 97
GRU with 32 cells 98
CNN structure as in [1] 94
GRU with 64 cells 98
Accuracy plots of the models created by their values from 25 epochs. Here, the
solid lines represent the training line, and the dotted line represents the test line,
(figure 40).
After passing the initial stages of training, an analysis was made on the results
obtained. Recognition accuracy plots (train and test) were "non-representative" in the
SimpleRNN model plot, figure 40, despite the fact that the average recognition
accuracy scores were similar. In addition, the CNN network showed a lower
recognition rate than other models. This suggests that the recognition performance of
CNN models for the sounds of UAVs and other objects. The CNN layer can have
high recognition capability if more CNN layers are added deeply. Two CNN layers
were added because the CNN structure was based on previous work [1, p. 864].
However, compared to single-layer RNN models, the recognition of the CNN model
was significantly lower as seen (table 7). At the same time, at least 2-3 attempts were
made to repeatedly check each experiment. This is due to the assumption that a model
trained only once can be a random chance of prediction. The GRU and SimpleRNN
models were found to be significantly more accurate than the LSTM and BilSTM
models. The recognition history plot of the SimpleRNN network was found to be
unrepresentative and was not continued in further studies. The next step of the study
involved increasing the number of GRU cells to 64 and continuing training with 25
epochs. But the value of the "activity regularization" function in the penultimate layer
was sought from a different interval due to an increase in the number of cells. In the
GRU model with 64 cells, the values of the “activity regularization” function were
66
taken equal to L2 = 0.00001, which provided “good fit” to the recognition accuracy
plot. (figure 41) shows the overall accuracy of the models CNN and GRU.
Figure 41 – Model Accuracy plots of the GRU model and CNN model
The proposed GRU model with 64 cells therefore provides a relatively good
recognition ability, as illustrated in figure 25 above. Also, it displayed a "good fit"
model accuracy plot. The CNN model exhibits an unrepresentative gap between
training and testing accuracy, as well as a lower recognition capability than one-layer
RNN architecture. In general, there was also a 4-class dataset performed during
model building and testing. They were checked for drone sounds recorded in the
immediate area. Also, these were performed on the basis of the first database with a
small composition. A series of results from such a study is presented as a confusion
matrix in the (Appendix I).
In general, it is impossible to accurately demonstrate the capabilities of a
recognition system based on average recognition accuracy. Therefore, by presenting
the recognition results in an extended form with recognition accuracy characteristics
for each class, it is possible confidently assess the predictive power of the models.
The performance and robustness of the model in the case of many classification
problems is usually assessed using the classifier confusion matrix. Sensitivity (recall),
specificity and accuracy can be calculated using the components of the matrix. Many
performance indicators [7, p. 3856(17-19)], including Precision, Recall and F1, were
used to evaluate our strategy.
By calculating the ratio of false positive (FP) objects to true positive (TP)
objects using equation (19), the accuracy can be determined as follows:
𝑇𝑝
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (19)
𝑇𝑝 +𝑇𝑝
Equation (20) was used to evaluate recall by comparing true positive (TP)
predictions with false negative (FN) predictions:
𝑇𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = (20)
𝑇𝑝 +𝐹𝑝
67
The F1 score, which reflects the average of the data, was calculated using
equation (9), (10) because precision or recall does not properly assess system
predictability, equation (22):
2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = (21)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
Each class was made up of 100 files so that they could be visually seen when
these scoring methods were used on the prediction dataset (300 files). The prediction
results are shown in Table 8. A confusion matrix was also performed for the studied
basic 5 models and the later developed 6th GRU model. The confusion matrix, which
identifies and predicts each class, was able to provide sufficient information to
predict the reliability of the given models (table 8).
As a result, when estimating the background noise class, almost all types of
RNN models have a very high identifying ability. The CNN model also performed
slightly worse than the RNN models, but had better accuracy in the background noise
class. This demonstrates that while CNN models are capable of solving common
recognition problems, they are less efficient than RNNs when dealing with the same
sounds of objects that are in different states.
Moreover, almost all RNN cells have strong recognition abilities from a single
layer and have a great ability to identify elements based on the engine. Along with
the capabilities of the RNN network, the structure of the model also plays a special
role in this situation. Table 7 shows that the "tanh" activation function was used to set
68
the dense layer prior to the RNN model. The dense layers also got the "relu" feature
after the RNN layer.
To avoid overtraining, a Dropout layer has also been included. Also, the
dataset was upgraded from its previous version. In this way, ideal recognition results
were achieved. In both loaded and unloaded UAV situations, simple RNNs, LSTMs,
and BiLSTMs failed to demonstrate consistently high sound recognition rates, as
evidenced by the study of RNN models in table 8. True Positive recognition results,
shows that the level of recognition of loaded and unloaded UAVs decreased by 4-5%.
Also, the GRU with 32 cells showed the best performance for the main target class of
loaded UAVs. And by expanding to 64 cells, the best results were achieved for all
classes. This leads to the conclusion that GRU cells well and reliably recognize
different noise states of the same object. And CNN models have proven to be
effective in processing binary classification with a large number of levels. However,
all varieties of RNNs have outperformed CNNs on binary and multiple sound
classification problems due to their stable recognition capabilities.
This study concludes that the GRU model is a useful tool for recognizing UAV
acoustic data in different states. The confusion matrix created using the 64 cells of the
GRU model is shown in figure 42.
Figure 42 – The confusion matrix produced with the 64 GRU model cells
69
(figure 43). Background noise class was paraphrased as ambient noise in this
assessing plot of the model.
As a result, the classifier received a data set divided into three classes. In order
to evaluate the predictability of the chosen classifier and ensure that each distinct
class is correctly scored, the actual predicted audio files and false negatives of the
chosen classifier were mapped using a confusion matrix utilizing balanced 100 audio
recordings per class. The results of the tests using the 64-cell GRU model clearly
showed that the recognition skills based on the provided database were stable.
70
CONCLUSION
The main goal of this dissertation study was to solve the problem of UAV
acoustic data recognition. In the course of realizing this goal, the SimpleRNN,
LSTM, bidirectional LSTM and GRU architecture models have been explored in
depth for the real-time UAV acoustic data recognition system. This work especially
carefully examined the situation of whether the UAVs were loaded or unloaded.
Unloaded UAVs, loaded UAVs, and background noises such as the sound of other
engine-based objects from the background were the three main classes. Then, an
efficient method for recognizing UAV acoustic data was determined. During the
experiments (Appendix I, K, L), the accuracy of the UAV recognition system was
evaluated using all the metrics from numerous class classification problems. As a
result, the GRU architecture (64) was found to be an efficient model with a high level
of predictability on the given dataset. In addition, this model can identify loaded and
unloaded UAVs with 98% accuracy, as well as background noise with 99% accuracy.
This evaluation confirms the reliability of the UAV audio recognition system and
proposes to build a network of acoustic sensors using the proposed GRU model (64).
Moreover, various RNN network architectures are robust to binary and multi-
classification problems. Because they are better at content-based recognition than
CNN models. To sum up, the SimpleRNN, LSTM, BiLSTM, and GRU networks
with the proposed architecture can be used in the task of UAV load detection. The
CNN model had a somewhat lower level of multiple classification on sounds than the
RNN model. The CNN could better recognize binary classification instances as seen
from experimental studies.
A limitation of this work is the smaller amount of acoustic data from loaded
UAVs. However, this study showed that it is possible to recognize and evaluate UAV
loads in real-time mode. A further continuation of the study took the direction of a
bimodal method for detecting UAVs using software-defined radio (SDR). As a
scientific continuation of this work, project “AP14971907” is being implemented,
combining acoustic sensor and SDR methods, Appendix M. The system, in this
study, is proposed as a scientific solution for small territorial-strategic areas and a
bimodal method for ensuring national security. And for strategic areas with a large
area, this acoustic sensor can be repeatedly placed at several points or nodes and
carry out protection measures with centralized control.
71
REFERENCES
1 Li S., Kim H., Lee S.D. et al. Convolutional Neural Networks for Analyzing
Unmanned Aerial Vehicles Sound // Procced. 18th internat. conf. on Control,
Automation, and Systems (ICCAS). – Daegwallyeong, 2018. – P. 862-866.
2 Lim D., Kim H., Hong S. et al. Practically Classifying Unmanned Aerial
Vehicles Sound Using Convolutional Neural Networks // Procced. 2nd IEEE internat.
conf. on Robotic Computing (IRC). – Laguna Hills, 2018. – P. 242-245.
3 Vemula H.C. Multiple Drone Detection and Acoustic Scene Classification
with Deep Learning. – Dayton: Wright State University, 2018. – 149 p.
4 Kim J., Park C., Ahn J. et al. Real-time UAV sound detection and analysis
system // Procced. IEEE Sensors Applications sympos. (SAS). – Glassboro, 2017. –
P. 1-5.
5 Park S., Shin S., Kim Y. et al. Combination of Radar and Audio Sensors for
Identification of Rotor-type Unmanned Aerial Vehicles (UAVs) // Procced. 2015
IEEE Sensors. – Busan, 2015. – P. 1-4.
6 Taha B., Shoufan A. Machine Learning-Based Drone Detection and
Classification: State-of-the-Art in Research // Procced. IEEE Access. – 2019. –
Vol. 7. – P. 138669-138682.
7 Seidaliyeva U., Akhmetov D., Ilipbayeva L. et al. Real-Time and Accurate
Detection in a Video with a Static Background // Sensors. – 2020. – Vol. 20, Issue 14.
– P. 3856-1-3856-18.
8 Косенов А. Казахстан подтвердил проникновение узбекского
беспилотника на свою территорию // https://tengrinews.kz/events/kazahstan-
podtverdil-proniknovenie-uzbekskogo-bespilotnika. 20.12.2022.
9 Беспилотник захвачен над зданием Минобороны Казахстана //
https://tengrinews.kz/kazakhstan_news/bespilotnik-zahvachen-nad. 20.12.2022.
10 Presechen nesanktsionirovannyy polet kvadrokoptera nad zdaniyem
Minoborony // https://www.egemen.kz/article/201579-presechen. 20.12.2022.
11 Военные перехватили дрон над Арысью // https://ru.sputnik.kz/20190722
/voennye-perekhvatili-dron-nad-arysyu-11014334.html. 20.12.2022.
12 Houthi drone crashes into Saudi school in Asir province //
https://thearabweekly.com/houthi-drone-crashes-saudi-school-asir-province. 20.12.2022.
13 Mircea Cr. Light Show in China May Have Been Sabotaged, Dozens of
Drones Fell From the Sky // https://www.autoevolution.com/news/light. 20.11.2020.
14 Why Drones Are Becoming More Popular Each Day //
https://www.entrepreneurshipinabox.com/22657/why-drones-are-becoming. 20.11.2020.
15 Utebayeva D., Almagambetov A., Alduraibi M. et al. Multi-label UAV
sound classification using Stacked Bidirectional LSTM // Procced. 4th internat. conf.
on Robotic Computing (IRC). – Taichung, 2020. – P. 453-458.
16 Utebayeva D., Alduraibi M., Ilipbayeva L. et al. Stacked BiLSTM - CNN
for Multiple label UAV sound classification // Procced. 4th internat. conf. on Robotic
Computing (IRC). - Taichung, 2020. – P. 470-474.
72
17 McFarland M. Airports Scramble to Handle Drone Incidents //
https://www.cnn.com/2019/03/05/tech/airports-drones/index.html. 15.08.2021.
18 Li J., Dai W., Metze F. et al. A comparison of Deep Learning methods for
environmental sound detection // 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). – New Orleans, 2017. – P. 126-130.
19 Ranft R. Natural sound archives: past present and future // Anais da
Academia Brasileira de Ciłncias. – 2004. – Vol. 76, Issue 2. – P. 455-465.
20 Utebayeva D., Ilipbayeva L., Matson E.T. Practical Study of Recurrent
Neural Networks for Efficient Real-Time Drone Sound Detection: A Review //
Drones. – 2023. – Vol. 7, Issue 1. – P. 26-1-26-25.
21 Drone Crash Database / https://dronewars.net/drone-crash. 04.02.2021.
22 Koslowski R., Schulzke M. Drones along Borders: Border Security UAVs
in the United States and the European Union // https://www.albany.edu s. 04.02.2021.
23 Shi W., Arabadjis G., Bishop B. et al. Detecting, Tracking, and Identifying
Airborne Threats with Netted Sensor Fence // In book: Sensor Fusion Foundation and
Applications: In Tech, 2011. – 238 p.
24 Samaras S., Diamantidou E., Ataloglou D. et al. Deep Learning on Multi
Sensor Data for Counter UAV Applications – A Systematic Review // Sensors. –
2019. – Vol. 19. – P. 4837-1-4837-35.
25 Ezuma M., Erden F., Anjinappa C.K. et al. Micro-UAV Detection and
Classification from RF Fingerprints Using Machine Learning Techniques // Proceed.
of the IEEE AERO. – Big Sky, MT, USA, 2019. – P. 1-13.
26 Jeon S., Shin J.W., Lee Y.J. et al. Empirical Study of Drone Sound
Detection in Real-Life Environment with Deep Neural Networks. 25th European
Signal Processing conf. (EUSIPCO). – Kos, 2018. – P. 1858-1862.
27 Delivery drone crashes into power lines, causes outage //
https://www.theregister.com/2022/09/30/delivery_drone_crashes_into. 25.08.2021.
28 When Amazon drones crashed, the company told the FAA to go fly a kite //
https://www.businessinsider.com/amazon-prime-air-faa-regulators. 25.08.2021.
29 List of unmanned aerial vehicles-related incidents //
https://en.wikipedia.org/wiki/List_of_unmanned_aerial_vehicles-related. 25.08.2021.
30 Amazon Drone Project Layoffs // https://www.idtechex.com/en/research-
article/amazon-drone-project-layoffs/25951. 25.08.2021.
31 Utebayeva D. Effectiveness of the system of unmanned aerial vehicles
detection on the basis of acoustic signature // Vestnik KazNRTU. – 2020. – Vol. 4,
Issue 140. – P. 300-307.
32 Al-Emadi S. et al. Audio Based Drone Detection and Identification using
Deep Learning // Procced. 15th internat. Wireless Communications & Mobile
Computing conf. (IWCMC). – Tangier, 2019. – P. 459-464.
33 Shi L., Ahmad I., He Y.J. et al. Hidden Markov Model based Drone Sound
Recognition using MFCC Technique in Practical Noisy Environments // Journal of
Communications and Networks. – 2018. – Vol. 20, Issue 5. – P. 509-518.
73
34 Siriphun N., Kashihara S. et al. Distinguishing Drone Types Based on
Acoustic Wave by IoT Device // Procced, 22nd internat. Computer Science and
Engineering conf. (ICSEC). – Chiang Mai, 2018. – P. 1-4.
35 Anwar M.Z., Kaleem Z., Jamalipour A. Machine Learning Inspired Sound-
Based Amateur Drone Detection for Public Safety Applications // IEEE Transactions
on Vehicular Technology. – 2019. – Vol. 68, Issue 3. – P. 2526-2534.
36 Liu H., Wei Zh., Chen Y. et al. Drone Detection based on An Audio-
assisted Camera Array // Procced. IEEE 3rd internat. conf. on Multimedia Big Data. –
Laguna Hills, CA, USA, 2017. – P. 402-406.
37 Bernardini A., Mangiatordi F., Pallotti E. et al. Drone detection by acoustic
signature identification // Electronic Imaging. – 2017. – Issue 10. – P. 60-64.
38 Shaikh F. Getting Started with Audio Data Analysis using Deep Learning //
ttps://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing. 14.03.2021.
39 Sahidullah M., Saha G. Design, analysis and experimental evaluation of
block based transformation in MFCC computation for speaker recognition // Speech
Communication. – 2012. – Vol. 54, Issue 4. – P. 543-565.
40 Wang Y., Fagian F.E., Ho K.E. et al. A Feature Engineering Focused
System for Acoustic UAV Detection // Procced. 5th IEEE internat. conf. on Robotic
Computing (IRC). – Taichung, 2021. – P. 125-130.
41 Kim B., Jang B., Lee D. et al. CNN-based UAV Detection with Short Time
Fourier Transformed Acoustic Features // Prossed. internat. conf. on Electronics,
Information, and Communication (ICEIC. – Barcelona, 2020. – P. 1-3.
42 Z. Shi, L. Zheng, X. Zhang, Y. Wang and L. Wu, "CNN-Based Electronic
Camouflage Audio Restoration Mechanism Zhengyu Shi," 2018 5th International
Conference on Systems and Informatics (ICSAI), Nanjing, China, 2018, pp. 412-416.
43 Choi K., Joo D., Kim J. Kapre: On-GPU Audio Preprocessing Layers for a
Quick Implementation of Deep Neural Network Models with Keras //
https://doi.org/10.48550/arXiv.1706.05781. 10.04.2021.
44 Nijim M., Mantrawadi N. Drone classification and identification system by
phenome analysis using data mining techniques // Procced. IEEE sympos. on
Technologies for Homeland Security (HST). – Waltham, MA, 2016. – P. 1-5.
45 Yue X., Liu Y., Wang J. et al. Software defined radio and wireless acoustic
networking for amateur drone surveillance // IEEE Commun. Mag. – 2018. – Vol. 56,
Issue 4. – P. 90-97.
46 Seo Y., Jang B., Im S. Drone detection using convolutional neural networks
with acoustic STFT features // Procced. IEEE internat. conf. on Advanced Video and
Signal Based Surveillance (AVSS). – Auckland, 2018. – P. 1-6.
47 Matson E., Yang B., Smith A. et al. UAV detection system with multiple
acoustic nodes using machine learning models // Procced. 3rd IEEE internat. conf. on
Robotic Computing (IRC). – Naples, 2019. – P. 493-498.
48 Утебаева Д., Илипбаева Л. Әр түрлі модельдер үшін ұшқышсыз әуе
көліктерін анықтау мәселелерінде акустикалық сигналдарды зерттеу // Вестник
АУЭС. – 2020. –№3(50). – Б. 38-46.
74
49 Salehinejad H., Sankar S., Barfett J. et al. Recent Advances in Recurrent
Neural Networks // https://arxiv.org/pdf/1801.01078.pdf. 22.02.2018.
50 Sak H., Senior A.W., Senior A.W. Long Short-Term Memory Recurrent
Neural Network Architectures for Large Scale Acoustic Modeling // Procced. conf. of
the internat. Speech Communication Association. – Thailand and Penang (Malaysia),
2014. – P. 338-342.
51 Brownlee J. The Long Short-Term Memory Network // In book: Long
Short-Term Memory Networks with Python. – USA, 2019. – Р. 10-12.
52 Thakur D. LSTM and its equations // https://medium.com/@divyanshu132/
lstm-and-its-equations-5ee9246d04af. 10.11.2019.
53 Cho K., Van Merrienboer B., Gulcehre C. et al. Learning phrase
representations using rnn encoder-decoder for statistical machine translation //
Proceed. of the conf. on Empirical Methods in Natural Language Processing
(EMNLP)ю – Doha, 2014. Р. – 1724-1734.
54 Audio signal // https://en.wikipedia.org/wiki/Audio_signal. 10.05.2019.
55 Magalhäes E., Jacob J., Nilsson N. et al. Physics-based Concatenative
Sound Synthesis of Photogrammetric models for Aural and Haptic Feedback in
Virtual Environments // Procced. IEEE conf. on Virtual Reality and 3D User
Interfaces Abstracts and Workshops (VRW). – Atlanta, GA, 2020. – P. 376-379.
56 Sound Waves // https://www.pasco.com/products/guides/sound. 14.04.2019.
57 What is a Sound Wave in Physics? // https://www.pasco.com. 14.04.2019.
58 Sound // https://en.wikipedia.org/wiki/Sound. 14.04.2019.
59 Sine Wave: Definition, What It's Used For, Example, and Causes //
https://www.investopedia.com/terms/s/sinewave.asp. 04.04.2022.
60 Sine wave // https://en.wikipedia.org/wiki/Sine_wave. 14.04.2019.
61 Discrete time and continuous time // https://en.wikipedia.org/wiki. 14.04.2019.
62 Introduction to Audio Signal Processing // https://www.coursera.org
/learn/audio-signal-processing/lecture/fHha1/introduction-to-audio. 14.04.2019.
63 Sampling (signal processing) // https://en.wikipedia.org/wiki. 14.04.2019.
64 Wu Y., Krishnan S. Classification of knee-joint vibroarthrographic signals
using time-domain and time-frequency domain features and least-squares support
vector machine // Procced. 16th internat. conf. on Digital Signal Processing. –
Santorini, Greece, 2009. – P. 1-6.
65 Time series // https://en.wikipedia.org/wiki/Time_series. 25.08.2019.
66 Window function // https://en.wikipedia.org/wiki/Window. 25.08.2019.
67 Bahoura M. FPGA Implementation of Blue Whale Calls Classifier Using
High-Level Programming Tool // Electronics. – 2016. – Vol. 5. – P. 8-1-8-19.
68 time_frequency // https://kapre.readthedocs.io/en/latest/time. 25.08.2019.
69 Types of Neural Networks: Applications, Pros, and Cons //
https://www.knowledgehut.com/blog/data-science/types-of-neural. 25.08.2019.
70 Graves A., Mohamed A., Hinton G. Speech recognition with deep recurrent
neural networks // Procced. IEEE internat. conf. on Acoustics, Speech and Signal
Processing. – Vancouver, BC, 2013. – P. 6645-6649.
75
71 Jie Zhao. Anomalous Sound Detection Based on Convolutional Neural
Network and Mixed Features // Journal of Physics: Conference Series. – 2020. –
Vol. 1621. – P. 1-8.
72 Afridi T.H., Alam A., Khan N. A Multimodal Memes Classification: A
Survey and Open Research Issues // In book: Innovations in Smart Cities
Applications. Cham: Springer, 2020. – Vol. 4. – P. 1451-1466.
73 What Is a Convolutional Neural Network? // https://www.mathworks.com/
discovery/convolutional-neural-network-matlab.html. 10.02.2019.
74 Shu H., Song Y., Zhou H. Time-frequency Performance Study on Urban
Sound Classification with Convolutional Neural Network // Procced. IEEE Region 10
conf. (Tencon 2018). – Jeju (Korea), 2018. – P. 1713-1717.
75 Momo N., Abdullah, Uddin J. Speech Recognition Using Feed Forward
Neural Network and Principle Component Analysis // Advances in Intelligent
Systems and Computing: procced. internat. sympos. on Signal Processing and
Intelligent Recognition Systems. – Manipal, 2018. – P. 228-239.
76 Segarceanu S., Suciu G., Gavat I. Neural Networks for Automatic
Environmental Sound Recognition // Procced. internat. conf. on Speech Technology
and Human-Computer Dialogue (SpeD). – Bucharest, 2021. – P. 7-12.
77 Shamsuddin N., Mustafa M.N., Husin S. et al. Classification of heart sounds
using a multilayer feed-forward neural network // Procced. Asian conf. on Sensors
and the internat. conf. on New Techniques in Pharmaceutical and Biomedical
Research. – Kuala Lumpur, 2005. – P. 87-90.
78 Main Types of Neural Networks and Its Applications – Tutorial. 13 July
2020 // https://towardsai.net/p/machine-learning/main-types-of-neural. 20.12.2022.
79 Mahyub M., Souza L.S., Batalo B. et al. Environmental Sound
Classification Based on CNN Latent Subspaces // Procced. internat. Workshop on
Acoustic Signal Enhancement (IWAENC). – Bamberg, 2022. – P. 1-5.
80 Wu Y., Lee T. Enhancing Sound Texture in CNN-based Acoustic Scene
Classification // Procced. IEEE internat. conf. on Acoustics, Speech and Signal
Processing (ICASSP-2019). – Brighton, 2019. – P. 815-819.
81 Wang Y., Chu Z., Ku I. et al. A Large-Scale UAV Audio Dataset and
Audio-Based UAV Classification Using CNN // Procced. 6th IEEE internat. conf. on
Robotic Computing (IRC). – Milan, 2022. – P. 186-189.
82 Bubashait M., Hewahi N. Urban Sound Classification Using DNN, CNN &
LSTM a Comparative Approach // Procced. internat. conf. on Innovation and
Intelligence for Informatics, Computing, and Technologies (3ICT). – Zallaq, Bahrain,
2021. – P. 46-50.
83 Noman F., Ting C.-M., Salleh S.-H. et al. Short-segment Heart Sound
Classification Using an Ensemble of Deep Convolutional Neural Networks //
Procced. IEEE internat. conf. on Acoustics, Speech and Signal Processing (ICASSP-
2019). – Brighton, 2019. – P. 1318-1322.
84 Lu R., Duan Z., Zhang C. Multi-Scale Recurrent Neural Network for Sound
Event Detection // Procced. IEEE internat. conf. on Acoustics, Speech and Signal
Processing (ICASSP). – Calgary, AB, 2018. – P. 131-135.
76
85 Parascandolo G., Huttunen H., Virtanen T. Recurrent neural networks for
polyphonic sound event detection in real life recordings // Procced. IEEE internat.
conf. on Acoustics, Speech and Signal Processing (ICASSP). – Shanghai, 2016. –
P. 6440-6444.
86 Semmad A., Bahoura M. Long Short Term Memory Based Recurrent
Neural Network for Wheezing Detection in Pulmonary Sounds // Procced. IEEE
internat. Midwest sympos. Circuits and Systems (MWSCAS). – Lansing, MI, 2021. –
P. 412-415.
87 Kamepalli S., Rao B.S. et al. Multi-Class Classification and Prediction of
Heart Sounds Using Stacked LSTM to Detect Heart Sound Abnormalities // Procced.
3rd internat. conf. for Emerging Technology (INCET). – Belgaum, 2022. – P. 1-6.
88 Feng Y., Liu Z.J., Ling Y. et al. A Two-Stage LSTM Based Approach for
Voice Activity Detection with Sound Event Classification // Procced. IEEE internat.
conf. on Consumer Electronics (ICCE). – Las Vegas, NV, 2022. – P. 1-6.
89 Wang Y., Liao W., Chang Y. Gated Recurrent Unit Network-Based Short-
Term Photovoltaic Forecasting // Energies. – 2018. – Vol. 11. – P. 2163-1-2163-14.
90 Fan T., Zhu J., Cheng Y. et al. A New Direct Heart Sound Segmentation
Approach using Bi-directional GRU // Procced. 24th internat. conf. on Automation
and Computing (ICAC). – Newcastle Upon Tyne, UK, 2018. – P. 1-5.
91 Peng N. et al. Environment Sound Classification Based on Visual Multi-
Feature Fusion and GRU-AWS // IEEE Access. – 2020. – Vol. 8. – P. 191100-
191114.
92 Tsalera E.; Papadakis, A.; Samarakou, M. Comparison of Pre-Trained
CNNs for Audio Classification Using Transfer Learning // J. Sens. Actuator Netw. –
2021. – Vol. 10. – P. 721-72-22.
77
APPENDIX A
Table 1 – Model and layers of the CNN algorithm based on the publication
3_CNN_by_[1] Flatten
Fully Connected Neural 1 100
Network
Activation functions: Relu
Dropout 1 0.7
Fully Connected Neural 1 3
Network
Activation function: softmax
2_CNN_by_[2] Flatten
Fully Connected Neural 1 10
Network
Activation functions:
Sigmoid 1 0.1
Dropout 1 3
Fully Connected Neural
Network
Activation function: softmax
Note – Compiled according to the source [48, p. 40]
78
APPENDIX B
79
APPENDIX C
80
APPENDIX D
Figure D.1 – Studying the sounds of background objects and UAVs with 6 classes in
the Time domain
81
APPENDIX E
a – audio recordings of various lengths in original length; b – the process of studying the
filter bank, Melspectrogram and MFCC coefficients with various hyperparameters
82
APPENDIX F
83
APPENDIX G
b
a – filterbank Coefficients for Class 6 UAV Sounds and Ambient Noises by the dimension
40 by 100; b – decimal Spectrograms for UAV sound signals of the proposed system
84
APPENDIX H
85
APPENDIX I
b
a – 4-class database recognition experiment; b – 3-class database recognition experiment
86
APPENDIX K
87
88
89
APPENDIX L
90
APPENDIX M
91