Classification of Marine Vessels Using Sonar-Data Neural Network
Classification of Marine Vessels Using Sonar-Data Neural Network
Abstract
A submarine navigator have to keep track of surrounding ships in order to avoid
collision and to gain a tactical advantage. This is currently done manually by
a sonar operator, trained to listen through the water and identify ship-types by
the sound they emit.
This project presents a review and implementation of di↵erent solutions to
the problem of audio classification. The goal of this work was to build a system
capable of helping submarine navigators identify surrounding obstacles and ships
based on the sound recorded by the sonar on board the submerged submarine.
The research aims to uncover the best combination of techniques that can be
used for this classification task. This project concerns both the field of signal
analysis and artificial intelligence as the system comprises of two parts. The
first being a method of extracting informative features from sonar data captured
by the submarine. The second part is to feed the processed data into a neural
network (NN) and provide a classification of the ship’s type.
In this project a system have been developed in order to experiment with
a variety of feature extraction techniques and neural network structures to find
a solution suitable for the submarine sonar classification problem. The system
have been able to place 97.3% of the ships in the correct category when using the
highest scoring combination of a feature extraction technique and a neural net-
work. The best found combination was the Mel Frequency Cepstral Coefficients
feature extraction technique and a standard feed-forward neural network.
ii
Sammendrag
En ubåt må alltid ha oversikt over alle skip i nærheten for å unngå kollisjon og
for å holde en taktisk fordel fra dypet. Dette er nå gjort manuelt av en sonar
operatør som er trent opp til å lytte etter lyder i vannet og identifisere forskjellige
typer fartøy basert på lyden de lager.
I dette prosjektet presenters forskjellige metoder for klassifisering av lyd-data.
Målet med dette arbeidet er å lage et system som kan assistere navigatøren om
bord på en ubåt med å identifisere omringende hindringer og skip basert på sonar
data.
Forskningen tar sikte på å avdekke den beste kombinasjonen av teknikker
som kan løse denne klassifiseringsoppgaven. Prosjektet omhandler både fagfeltet
signal-analyse og kunstig intelligens da systemet er bygget opp av to deler. Den
første, en metode for å hente ut informative attributter fra rådata fra sonaren
på ubåten. Den andre delen i systemet er et nevralt netverk som mates med
attributtene funnet i den første delen. Dette netverket kan etter at det er trent
opp brukes til klassifisering av ulike skipstyper.
Systemet er designet for å kunne teste en rekke forskjellige kombinasjoner av
preprosessering og nevrale netverk for å finne en løsning som passer til å klassifis-
ere sonar-data til skipstyper. Systemet kan klassifisere riktig lyd til riktig fartøy i
97.3% av tilfellene når det kjøres med den konfigurasjonen som har oppnådd den
høyeste klassifiserings-tre↵sikkerheten. Den beste kombinasjonen funnet var en
preprosesseringsteknikk kalt Mel Frequency Cepstral Coefficients og et standard
nevralt netverk
iii
Preface
This thesis was written for the Department of Computer and Information Science
(IDI) at the Norwegian University of Science and Technology (NTNU).
The project has been planned in cooperation with Kongsberg Defence &
Aerospace AS (KDA), and is a continuation of the preliminary study concerning
the same problem. KDA have developed the sonar system currently on board the
Norwegian Navy’s submarines, and have access to simulation software that can
be used to create training data.
The problem to be solved was thought of by the author after spending the
compulsory military service on a submarine as a sonar-operator.
The readers of this study are assumed to have a background in computer
science.
iv
Acknowledgment
I would like to acknowledge the support from the faculty IDI at NTNU. In par-
ticular my supervisor Helge Langseth, who helped me make this project possible
by being an invaluable resource person in the field of artificial intelligence.
Kongsberg Defence & Aerospace AS and my contact there, Simen Tronrud
has helped me solve the data gathering challenge and granted me access to their
simulation software the following semester. Without this prospect of actual data
this project would not have been possible.
Håkon Gimse
Trondheim, March 26, 2017
Contents
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals and Research Questions . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background Theory 5
2.1 Sonar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Cavitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Time-frequency domain . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Short-time Fourier Transform . . . . . . . . . . . . . . . . . 10
2.2.3 Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Mel-frequency Cepstral Coefficients . . . . . . . . . . . . . . 11
2.2.5 Spectral Density Estimation . . . . . . . . . . . . . . . . . . 13
2.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Standard Feed-forward Neural Network . . . . . . . . . . . 15
2.3.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . 15
2.3.6 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 16
v
vi CONTENTS
A Acronyms 45
Bibliography 46
List of Figures
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Spectrogram of a ferry sample. . . . . . . . . . . . . . . . . . . . . 26
3.3 Spectrogram of a destroyer sample. . . . . . . . . . . . . . . . . . . 27
vii
viii LIST OF FIGURES
List of Tables
ix
x LIST OF TABLES
Chapter 1
Introduction
This work explores how to use artificial intelligence to recognize patterns in sound.
The goal was to create a prototype of a system that could be used on board a
submarine to assist the sonar operator. In order to build such system both the
field of artificial intelligence and the field of signal processing will be studied, as
the system will have to use state-of-the-art techniques from both fields. In this
chapter the background and motivation for doing this will be discussed, followed
by a presentation of the goals of this project with a brief introduction to the
scientific approach taken. Finally the structure of the remaining project will be
outlined.
1
2 CHAPTER 1. INTRODUCTION
only be able to listen in one direction at the time, where the proposed system
could observe every angle at once. It will also be able to alert the operator when
it believes it has classified a nearby ship correctly. The operator could then either
confirm or refute this classification, and the system could use this to learn the
di↵erent classes further.
The solution to this problem is a system that is using raw acoustic data as
input, followed by processing and feature extraction and finally returning a class.
This ideal solution can be divided into two sub-systems, where the output of
the first is the input of the second. This is illustrated in Figure 1.1. This is a
more modular approach. A solution to one of the problems can be implemented
independently of the solution used in the other and they can still work together.
Figure 1.1: A figure showing how the problem can be divided into two sub-
problems
The first problem, the preprocessing and feature extraction from raw audio
data aims to find a more informative representation of the data, either by extract-
ing key elements in the data or finding new information by analysis. Often the
main goal of this task is to find a smaller data set to consider during classification.
This will decrease the complexity of the classification task severely.
There is a large community devoted to this field, and countless papers pub-
lished on the subject. The data this project addresses is di↵erent than what most
studies about sound processing tend to use. The most obvious di↵erence being
that sound travels at a di↵erent rate in water than in air. This study aims to
answer whether the same techniques used to extract features from environmental
sound on land can be used on data sampled under water.
The preprocessing techniques also have to be robust to noise. The ocean con-
tains an abundance of biological noise from whales, dolphins and lesser life forms.
It is also a known issue that the ocean between the sound emitting ship, and the
submarine often contain di↵erent layers of water with di↵erent salt concentration,
1.2. GOALS AND RESEARCH QUESTIONS 3
RQ1 What are the existing solutions for classification of continuous audio-data
using neural networks?
RQ2 How are the solutions found in RQ1 processing the data before feeding it
to the neural network?
RQ3 How can the findings be used when creating a classification-system in a
new environment?
RQ4 How will di↵erent solutions, or combinations of solutions, a↵ect the overall
performance of the system?
4 CHAPTER 1. INTRODUCTION
Background Theory
This chapter provides insight into the field of audio analysis and neural networks,
and the techniques this project is concerning. This chapter also contains a more
technical description of the sonar system, the data sampled and the domain this
project will focus on, namely underwater environmental sounds.
5
6 CHAPTER 2. BACKGROUND THEORY
the cavitation frequency of a nearby vessel. This is unique for most ships and
can yield information about the size, speed and type of the ship. Intermittent
sound sources can also be heard, e.g. a flushing toilet or a wrench hitting the
hull of the ship. A passive sonar will be used in this project, as this is the type
the Norwegian Navy is using.
There are several performance factors to take into account when using a sonar
for navigation. The speed of sound is the most important as this approach is
based on the propagation of the sound. Sound travels more slowly in freshwater
than in saltwater, and even in saltwater there is great di↵erences determined
by the water’s bulk modulus and mass density. Bulk modulus is a measure of
a substance’s resistance to uniform compression, here this is determined by the
temperature, pressure and salinity. Salinity is the saltiness of a body of water.
The a↵ect given by the mass density of water is relatively small compared to that
of the bulk modulus. The speed of sound in the air is approximately 340 m/s,
while in saltwater it is approximately 1500 m/s. The formula used to calculate the
precise speed of sound in water is presented by The National Physical Laboratory
[2005] in Equation 2.1 where D is the depth given in kilometers, S is salinity in
parts per thousand and t = T /10 where T = temperature in degrees Celsius.
c(D, S, t) is the speed of sound at depth D, and c(0, S, t) is the speed of sound
on the surface, given by Equation 2.2.
2.1.1 Cavitation
A propeller absorbs the torque from the ships motor at given revolutions. In
turn the propeller converts this to thrust, driving the ship through the water.
The International Institute of Marine Surveying [2015] states that according to
Bernoulli’s law the passage of a hydrofoil (propeller blade section) through the
water causes a positive pressure on the face of the blade and a negative pressure on
2.2. SIGNAL PROCESSING 7
its back. It is the resolution of the pressures that results in the torque requirement
and the thrust development of the propeller. The negative pressure causes any
gas in solution in the water to evolve into bubbles and when these collapse a sound
is emitted. The repetition of such sound create an acoustic signature unique to
most ships.
fs > 2 ⇤ fn (2.3)
We use signal processing to manipulate a signal to change its characteristics
8 CHAPTER 2. BACKGROUND THEORY
Figure 2.2: National Instruments [2015] demonstrates how a low sampling rate
will lead to an inaccurate digital approximation of a signal.
10 CHAPTER 2. BACKGROUND THEORY
wavelet found, namely the scale and the translation of the wavelet. The transla-
tion decides the timing of how the wavelet translates through the signal and the
scale decides the amplitude of the wavelet. The advantage of using wavelets over
sinusoidal functions is that they are short, and therefore more suitable when a
longer signal is divided into short segments as more wavelets can be used each
segment. The Fourier transformation using continuous sinusoidal signals always
have to cut these short at the cost of resolution in the approximation.
1. Transform the signal into the frequency domain. This is typically done with
Fourier transform on segmented data like in STFT
2. Map the frequency domain representation found in the first step into the
mel scale described below.
The mel scale is a scale created to measure the perceived pitch instead of
the actual pitch. This was a solution to the human ear’s inability to di↵erentiate
correctly between pitches in the highest bands of hearing. The mel scale describes
the human auditory system on a linear scale and here we can express the di↵erence
in pitches we perceive. The relationship between the mel scale and the frequency
scale is shown in Figure 2.5.
A spectrogram is a visualization of the spectrum of frequencies in a signal as
they vary over time. An example of this is given in Figure 2.4.
12 CHAPTER 2. BACKGROUND THEORY
Figure 2.5: The mel scale. A scale based on the perceived pitch.
2.3. ARTIFICIAL NEURAL NETWORKS 13
weight matrix and an input matrix for a speed advantage as matrix multiplication
can be computationally inexpensive.
X
yi = w i xi (2.4)
i
Name Equation
2
TanH f (x) = 1
1 + e 2x
1
Sigmoid a.k.a logistic or soft step f (x) =
( 1+e x
0, for x < 0
Rectified Linear Unit (ReLU) f (x) =
x, for x 0
Sinusoid f (x) = sin(x)
2.3.2 Training
To be able to learn a concept the neural network has to be trained to do so.
Training a neural network means tweaking parameters until it yields the desired
results. The parameters that can be changed depends on the type of network,
but mostly we talk about changing the weights of the connections in a network.
To to so would change the influence one node has on the node on the other side
of the connection. The goal is to make a network so that it understands what
connections usually supplies it with information that corresponds with the class,
and prioritize these. This way the neural network is able to learn patterns the
programmer did not explicitly introduced.
An example of an algorithm for training a neural network is the backpropa-
gation algorithm. This algorithm calculates the error in the classification in each
training example by comparing the prediction given by the network and the label
of the example. Then a gradient of a loss function is calculated with respect to
the weights in the network. This error gradient is than propagated back through
the network and the weights are updated in order to minimize the loss.
2.3.3 Overfitting
Overfitting is a term used to describe a network when it becomes too specifically
fitted to the training data. The aim of a predicative model like a neural network
is to train on a diverse data set from each class in order to gain a general concept
2.3. ARTIFICIAL NEURAL NETWORKS 15
of each class. When this is done right the model can predict data it have never
seen before correctly even if it varies from the training data in the same class. If
the model fits too good to the training data to be able to generalize, it is called
overfitting.
node to the next layer. This filter is used iteratively through the incoming nodes,
by changing the start- and end-point of the connections, to create new signals.
This is shown in Figure 2.7 presented by Nielsen [2015]. The configuration of
these filters are being trained to capture new informative features.
It can also be useful to use several di↵erent filters between the same layers,
as more features can be identified. Using several layers and the sliding technique
often lead to a huge expansion in nodes in the following layer behind these fil-
ters. This is illustrated in Figure 2.8, where three di↵erent filters are used. Too
many nodes in a layer can be a problem, as the computational cost of training
increases. To solve this problem we use a technique called pooling to reduce the
dimensionality of the data. This is done by pooling together a number of nodes in
the incoming layer, hence the name, and extracting some single value from them
to pass on. A popular example is max-pooling. Here the highest activation value
in the pool is simply passed on and all the other values are disregarded. This is
illustrated by Britz [2015], here shown in Figure 2.9 where each number on the
left is the activation level of a node, and the number of corresponding color on
the right is the resulting activation after the pooling layer.
The convolutional network methodology has recently proved very useful for
complex classification task like image recognition. A testament to this is The
ImageNet Large Scale Visual Recognition Challenge held every year. It is a com-
petition in image recognition where several teams train their algorithms on 1.2
million images in 1,000 categories. An algorithm has to have the correct class
in the top five predictions to be correct. Russakovsky et al. [2015] is presenting
results where almost every high ranking algorithm used is based on convolutional
neural networks. They also show that the increase every year in the best al-
gorithm is very high (between 3 and 5 percent). This shows that this is a fast
moving field.
Figure 2.7: A single matrix of weights, or filter, is used to map several input
nodes to a single hidden node. Here the filter is moved one step to the right. It
will continue using the same weights throughout the entire input layer.
Figure 2.8: The expansion of nodes becomes apparent when several filters are
used. Here three di↵erent filters are used on a layer with 28 input nodes.
18 CHAPTER 2. BACKGROUND THEORY
Figure 2.9: A layer where features are condensed based on the pooling function,
here max-pooling
Figure 2.10: A recurrent neural network. The connections that create the recur-
rence are denoted as red arrows.
Chapter 3
Architecture and
Implementation
3.1 Architecture
An illustration of the systems architecture is presented in Figure 3.1. The system
has a module-based architecture, where several independent modules form the
system. This architecture allows further work with the system as several di↵erent
techniques can be developed and used with the system without having to rewrite
any of the old modules.
19
20 CHAPTER 3. ARCHITECTURE AND IMPLEMENTATION
Figure 3.1: The module based architecture of the system developed in this work.
Arrows denote the data flow and require a specific format. A new module would
have to comply with this format for be used with the rest of the system.
to and used by the next module. The data loader module could easily be replaced
by a similar module where continuous audio data is used instead of sound files.
If the system deploys on board a submarine this would be more natural as sound
is recorded continuously.
3.2.1 Data
Raw audio files are the base form of the data used in this work. This data have
been gathered using a sonar simulation system developed by KDA. This acoustical
tool allowed creation of a variety of ships with customizable engine configurations.
Ships could be placed anywhere around the submarine and their relative sound
to the submarine were generated. A total number of 85 sound files of 10 seconds
duration each were gathered and used in this project. A full overview of the
di↵erent configurations used in the data set can be found here: http://folk.
ntnu.no/haakongi/Sonar_Data_TSUS_Public.xlsx. As seen here a number of
7 di↵erent ship classes were selected and generated in di↵erent scenarios.
3.3. FEATURE EXTRACTOR IMPLEMENTATION 21
Wavelet Transformation
To compute the wavelet transformation of a signal, the library PyWavelets is
used. The Discrete Wavelet Transform is computed and the approximation and
detail coefficients for each sample is extracted as the resulting features.
Spectrogram
A spectrogram is presented on image form, or as a 2-dimensional vector. This
makes it suitable to use image recognition techniques as convolutional neural
networks often do. The spectrogram extractor produces an image using the 1-
dimensional data sample and this image is used by the neural network. The
3.4. NEURAL NETWORK IMPLEMENTATION 23
library Matplotlib is used to do the necessary computation. The system also in-
corporates an option to save spectrogram’s as image files. This was implemented
in order to see if humans were able to di↵er between the 7 classes, when given
only the spectrogram.
run in the GUI. This ensures that the network is not entirely dependant on the
value of a few neurons to be able to classify. When each neuron has a chance
of being dropped out, the network has to adapt to use more neurons to classify
thereby making the system more robust. The dropout scheme is only used during
training and not when evaluating the accuracy of the system. This is done to
prevent overfitting while training, but when testing the system should perform
optimally for classification. The dropout rate is thus set to 0 when testing the
system.
A second technique implemented in all the networks is a bias option. When
calculating the output of a hidden layer, firstly the previous layers output values
are multiplied by the weights leading to the next hidden layer. Then the activa-
tion function is applied to the resulting matrix. Implementing a bias is done by
creating a matrix of constant values for each layer and adding these values to the
resulting matrix just before the activation function is applied. The values of the
bias matrix are not dynamically changed during training of the network. This
bias option allows the network to learn even more complex models.
of the memory. The total number of samples is less as there is no overlap in these
samples.
At the very end of the network a fully connected standard feed-forward layer
is used to reduce the dimensionality of the data before the final classification in
26 CHAPTER 3. ARCHITECTURE AND IMPLEMENTATION
the output layer. The size of this layer can be set in the GUI.
The network have been implemented to receive data in the form of images,
namely spectrograms as presented in Figure 3.2 and 3.3. The resolution of these
images have to be fixed in order to set the right size to each layer after convolution
and pooling as described in Chapter 2. The resolution used in this project was
set to 512 data points. This was suitable with a sampling rate of 1024 and sample
lengths of 1 second as the resulting spectrogram will be a 32 by 16 pixel image.
The system allows for some change as long as the resulting resolution stays the
same, e.g. a sampling rate of 512 with a 2 second long sample is also allowed.
Figure 3.2: A spectrogram image of a sample taken from a ferry. Only this
information is passed to the CNN.
Figure 3.3: A spectrogram image of a sample taken from a destroyer. The de-
stroyer class usually have more activity in the lower frequency bounds than the
ferry class.
28 CHAPTER 3. ARCHITECTURE AND IMPLEMENTATION
sample. Each of these seven values correspond to a vessel class and if the correct
value has the highest value of the seven, the system has classified the sample
correctly. It is highly important to separate test data from training data in
order to prevent the system from being trained only to classify data used during
training correctly. The ratio of test data contra training data can be set in the
configuration file.
In this project a single sound recording of a specific vessel have been divided
into several samples. The samples derived from the same recording are unique,
but similar to the other samples from the same recording. To use samples very
similar to those trained on as test data could lead to a less robust system than
expected when confronted with new, less similar data. Three di↵erent approaches
of separating the entire data set into training and testing data have been imple-
mented. The first is to select random sound recordings and place all samples
related to this recording either in the training or the test set. The second ap-
proach is to use only the last samples from each recording as test data. When
using this approach, a test sample is never surrounded by similar training sam-
ples which is important to the quality of the test. The third and last approach
is simply to select random samples from random recordings and use these as the
test data. This last method have been abandoned in the experimentation in favor
of the second, as the data in the test set often have very similar samples in the
training set.
Librosa
Librosa is a Python library for audio and music analysis. This is the library
responsible for loading the raw audio data from memory and making it possible
3.5. EXTERNAL LIBRARIES 29
to process it as a variable.
The goal of this research was to prove that the sonar classification problem could
be solved using a neural network, and improved by a preprocessing technique.
Since several techniques have been implemented, experimentation is needed in or-
der to find the highest performing solution. The experimentation in this project
aims to find an approximation to the optimal parameter configuration in the sys-
tem. In the first section in this chapter the experimental plan will be presented,
followed by the experimental setup and finally the results from the experimenta-
tion. A conclusion based on these results will be given in the next chapter.
RQ4 How will di↵erent solutions, or combinations of solutions, a↵ect the overall
performance of the system?
31
32 CHAPTER 4. EXPERIMENTS AND RESULTS
during development. The aim was to select values that generally worked well in
all the possible combinations.
When all the possible combinations here have been tested with the default
values and results have been recorded, it is time to start tweaking the param-
eters. The focus of this experimentation will be to uncover the best possible
network configuration, therefore only parameter 10 through 19 are tweaked in
the experiments as these will inflict this more than the excluded parameters.
The experimentation will be conducted by changing a single parameter value
at the time, first increasing the value than decreasing it. The results will suggest
whether the value should be increased or decreased further or stay in between
the tested values. This search for optimal values will continue until the posi-
tive change converges. The most promising value will be kept and used when
experimenting with the next parameter.
The downside of this parameter search is that all the parameters in a network
configuration a↵ect each other. The optimal value of a parameter is only found in
relation to the others, meaning that the experimentation with the first parameter,
the learning rate, will yield an optimal value given the other default values. This
value will cascade through the remaining experimentation and all values found
will be a↵ected by this. The sequence of the experimentation is important to be
able to reproduce the results. The experiments start with the parameter with ID
10 followed in order to 19.
All experiments conducted will be measured by the resulting classification
accuracy of the system. The accuracy will be measured two times for each ex-
periment, one for each of the di↵erent training-test data separations described
in Section 3.4.1. The parameter value yielding the highest score in combination
with any feature extraction technique will be kept and used in the remaining
experiments. The optimal parameter values found for each network type will
be summarized in the final section of this chapter. The final accuracy of each
combination will also be recorded when executed using these parameter values.
and used throughout the remaining experiments. All the results from this exper-
imentation will not be presented here, but the best configurations found for each
neural network and the results when using this configuration is presented in the
next section.
The experimentation environment used was a personal computer where the
GPU computations was done using a NVIDIA Gigabyte GTX 770 graphics card.
The following Table 4.1 presents the default parameters used in the experiments.
34 CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.1: All configurable parameters and their default value. Parameter 4 have
been set to 0.1 for the recurrent network combinations only, as this require shorter
related samples.
forward and convolutional network, and the entire file for the recurrent network.
This initial value is used to demonstrate how the system is improving as a better
configuration is found.
Table 4.2: Initial test of classification accuracy with default parameters. X means
that the combination is incompatible and no experimentation have been done
here.
The results from the parameter search is presented in Table 4.3. Here only the
10 parameters experimented with are shown and the parameter ID correspond
to the ID given in Table 4.1. The results show that several of the parameters in
the final configuration are the same in all the network types. The results show
that the system is performing better when using a relative shallow 5 layer net-
work, including the input- and output-layer, than deeper more complex network
structures. This observation will be discussed further in the next chapter.
Table 4.3: In this table the best found configuration for each neural network is
shown. These configurations are used in combination with the FE techniques
and the accuracy of the combinations with this configuration can be reviewed in
Table 4.5.
36 CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.4: The results when the system is ran with optimal configuration. Entire
files are used as test data.
Table 4.5: The classification accuracy of each combination ran with the best
found configuration for each network type. The highest scoring accuracy found
was with the use of MFCC and a standard feed-forward network at 97.29%. The
end of each file is used as test data.
The configuration found and used here achieved 97.29 % classification accu-
racy when combining a standard feed forward network and the MFCC feature
extraction technique. This is exactly 2 % better than the same combination
used with default parameters, showing the importance of the parameter search.
This configuration did not outperform the default configuration when used in all
FE-NN combinations, but achieved the highest classification accuracy in a single
combination.
4.3. EXPERIMENTAL RESULTS 37
The recurrent neural network are only tested with the whole files as test data,
as the sequence of the samples are used for training. Testing on random samples
or only the end of the files would result in a system trained for sequential clas-
sification and tested without utilizing this information. The other two network
types have also been tested with this separation in order to compare the results
from the recurrent network.
38 CHAPTER 4. EXPERIMENTS AND RESULTS
Chapter 5
This chapter presents an evaluation of the research and the system developed.
Then the conclusions drawn from the results presented in Chapter 4 are discussed
in Section 5.2. This is followed by a summary of the contributions given in this
thesis in Section 5.3. Lastly the recommendations for how to proceed with the
further work are given in Section 5.4.
5.1 Evaluation
In Chapter 1, the goal of this project was defined as follows:
Goal Develop a system where a signal preprocessor and a neural network can
work together to classify marine vessels based on their acoustic signature.
39
40 CHAPTER 5. EVALUATION AND CONCLUSION
technique with the best found network configuration improved the classification
accuracy by 28.07% in comparison with no feature extraction technique.
The solutions found to the classification problem were more diverse. They
di↵ered in the complexity of the data to be classified and the computational cost
of training. The networks have been implemented to handle the output data
from the preprocessing techniques, without knowledge of how the data have been
processed. This is a great advantage with neural networks, it is possible to feed
them unstructured data from the feature extractor, and the network itself will
be able to recognize the identifying patterns.
To combine the two components have shown to be less problematic than
anticipated as the neural networks is found to be very flexible when it comes
to what kind of data they can process and recognize features in. The project
have been evaluated through experimentation presented in Chapter 4. The goal
was to find combinations suitable for a task, and the experimentation has shown
that the results vary widely between di↵erent combinations. The variation in
classification rate suggests that some combinations are better suited than others.
The combination yielding the highest classification accuracy was the combination
of a MFCC preprocessor and a standard feed-forward neural network with three
hidden layers. The final classification accuracy was 97.29%.
5.2 Discussion
As a result of the research done in this project a system have been developed,
able to di↵erentiate between 7 di↵erent vessel-types based only on the sound they
emit in water. The system have been used to systematically search for the best
combinations between feature extraction technique and neural network config-
uration. This system was made as a prototype, in order to test the feasibility
of the techniques applied on sonar data. The purpose have been tested to the
extent where it is possible to identify techniques more suitable than others.
The combination found with the highest accuracy are not the most complex
one, using a standard feed-forward network with only three hidden layers. This
could be due to the complexity of the classification task. The complexity needed
in a network is typically lowered when preprocessing of the data is performed.
The aim of the preprocessing is to extract only the informative features from a
larger data set, and remove redundant information leaving the network with less
data to process.
The experimentation also show a great di↵erence in accuracy when using
di↵erent training-test data separation schemes. When using entire files as test
samples, the accuracy drops significantly in all tests. This could indicate that
the data set is too small, and the di↵erence in each recording is too large. When
given access to more raw sonar data the system would be able to use this for
5.2. DISCUSSION 41
training and become more robust and capable of classification of further classes.
The feature extraction techniques implemented have been found to be state-
of-the-art solutions to environmental sound processing. The techniques were not
specialized to handle the sonar data recorded in deep water, but selected for their
ability to extract features from sound recorded in a variety of di↵erent scenarios.
The results presented in Chapter 3 points to the conclusion that the techniques
implemented are suitable to use with the sonar data as the results are improved
by 28.07% when using the best network configuration found.
The MFCC feature extraction technique was in the structured literature re-
view discovered mainly for it’s applications in speech recognition systems. It was
suitable for this task as it was developed as a technique specializing in extraction
of features from sound rich in modulation, like speech. The technique was kept
after reviewing Uzkent et al. [2012] and Cowling and Sitte [2003] where the tech-
nique were successfully used to analyze environmental sounds. The data used in
this project is similar to environmental sounds as continuous noise is present, but
the success of the MFCC method suggests that the identifying characteristic of
each vessel type lies in the modulation of the sound.
Of the three neural network types used in this project, the feed-forward net-
work were most commonly recommended in the literature review. This could
be because many of the studies found were studies not primarily in the field of
AI, but in the field of audio analysis. This network implementation is the least
advanced of the three, but it still achieves the highest scores in the experimen-
tation. The recurrent network achieved a classification accuracy of 77.14% at
it’s highest configuration. When files are divided into smaller sequenced samples
this is the highest score achieved, suggesting the potential of the recurrent neural
network. Interestingly this is also in combination with the MFCC method, once
again indicating that this is the most suitable feature extraction technique for
sonar data.
The most complex neural network model implemented was the convolutional
neural network. To utilize this kind of network the data fed into it have to have
a structure where there is a relation between data points close to each other. A
structure where this is typically the case is images. Using a spectrogram as the
input of this network showed successful with an achieved classification accuracy
of 85.17%. In an on board scenario, the spectrogram resolution could be set much
higher as continuous data is recorded and can be processed into a spectrogram
with the desired resolution in time. A spectrogram of higher resolution could
require a more complex convolutional network structure in order to be interpreted
more accurately.
The system developed was intended to assist the sonar operator by searching
every direction at once and quickly suggest the class of the surrounding vessels.
This has proven to be feasible as the system is able to suggest a class, given some
42 CHAPTER 5. EVALUATION AND CONCLUSION
input data, with a very high accuracy in a fraction of a second. The time used to
train the system is dependant of the size of the training data set used, but is only
required once before the system becomes capable of classification. The operative
system on board could then make classifications continuously in all directions.
5.3 Contributions
Based on our research goals and questions presented in Chapter 1, the contribu-
tions made in this project can summarized in the following way:
The main contribution is the system developed, able to classify 7 di↵erent
maritime vessels based on their emitted sound with an accuracy of 97.29%. This
system is also able to test a variety of configurations for research purposes. Three
di↵erent neural network types have been implemented along with five signal-
processing feature extraction techniques. Dropout, bias, noise and data loading
have are also made available in the system.
In order to develop a system specialized in underwater environmental sound
both the fields of signal processing and artificial intelligence is explored and the
background information collected is summarized in Chapter 2. The background
information have been gathered through a structured literature review performed
by the author of this thesis, Gimse [2016]. This research have been the second
contribution done in this project.
Acronyms
AI Artificial Intelligence
NN Neural Network
FE Feature Extraction
STFT Short-time Fourier Transformation
MFCC Mel-frequency Cepstral Coefficients
45
46 APPENDIX A. ACRONYMS
Bibliography
47
48 BIBLIOGRAPHY
Welch, P. (1967). The use of the fast fourier transform for the estimation of power
spectra: A method based on time averaging over short, modified periodograms.
IEEE Transactions on Audio and Electroacoustics, 15:70–73.