Merged
Merged
Lund University
Mathematical Statistics
Military vehicles typically send out radar signals in order to detect their surroundings. Using an
Electromagnetic Support Measures receiver, these can be detected and it is of interest to be able
to classify them in order to identify the vehicle and type of radar. Although there already exists
multiple methods to do this, it is of interest to automatize and speed up the process of classification
as much as possible. Artificial Neural Networks is a form of machine learning that has proven to
be successful in classifying sequential data from a large variation of sources. The purpose of this
thesis is therefore to investigate how well Artificial Neural Networks can classify different types
of radar signals, when the data is given on a sequential form of arrival times from radar pulses.
Both feed-forward and recurrent neural networks of different types are considered and in addition
to this, methods to apply them on the specific radar data are developed. The results show that
artificial neural networks are capable of classifying radar signals of this form with a precision of
up to 98 percent. In addition to this it could also be done with a couple of seconds worth of data
using relatively simple models.
iii
iv
Acknowledgements
First of all, we would like to thank our supervisor at FOI, Gunnar Hillerström, for his guidance
throughout the project.
We would also like to send our thanks to the members of the unit Radar Electronic Warfare
Systems at FOI, for their helpful comments along the way.
We are also grateful that FOI provided us with all the needed resources to carry out the project.
Finally we would also like to thank our supervisor from Lund University, Andreas Jakobsson,
for assisting us during this project.
v
vi
Acronyms
vii
viii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Plan and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory 5
2.1 Radar signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Visualization and representation . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 One-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The k-nearest neighbors classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 FLD ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Neural Networks 25
4.1 Pre-processing and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Test, train, validation split . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Slicing and the windowing method . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Constructing the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Comparing the networks: feed-forward and recurrent models . . . . . . . . . . . . . 27
4.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Evaluation on sub-sequences of varying lengths . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
5 Applications on whole observations 35
5.1 Variation of sub-sequence sizes between observations . . . . . . . . . . . . . . . . . 36
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Motivation and explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Simple Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Majority layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.4 Expanding to varying subsequence size . . . . . . . . . . . . . . . . . . . . . 41
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.1 Performance of the different models . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Handling the trash observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Appendices 53
x
Chapter 1
Introduction
1.1 Background
Radar is a type of detection system that uses pulses of radio waves in order to determine location,
angle and velocity of objects relative the source of the radar. It is commonly used in airplanes,
ships, meteorology, geology and many other fields. Typically, a radar consist of a transmitter that
sends out radar signals and a receiver that registers them as they return. Because of different
applications, radar signals vary in their characteristics such as frequency, amplitude and time be-
tween pulses. This means that it is possible to detect and differentiate between different types of
radar signals if one can measure these characteristics, allowing classification of the transmitter of
radar signals in turn. This is of particular use in the military, where it is of interest to determine
what type of planes, ships etc that one is dealing with. [20] [14]
A common way to classify types of radar is to only look at the times between pulses. Radar
transmitters usually send out pulses with varying time intervals according to some pre-decided
sequence. By having an Electronic Support Measures(ESM) receiver, measuring when the radar
pulses arrive, this sequence can be decoded and the type of radar classified. However, this is usually
problematized by the transmitter rotating in order to send out radar signals in every direction.
This means that the sequence is regularly interrupted and the ESM receiver rarely gets the full
sequence in one transmission. Instead, it usually receives parts, starting and ending at unknown
positions, of the whole sequence. The task of classification therefore consists of using a combination
of these parts to match them with sequences of known radar types. [20] [14]
Artificial neural networks is a type of machine learning that has yielded impressive results during
the last years, in a large variety of different fields. In particular, it has performed very well on clas-
sification of sequences such as speech recognition, translation and decision making for automatic
vehicles. [22]
1
1.3 Data set
Given data consisted of measured arrival times of radar pulses at an ESM receiver from a number
of different observations of ten minutes each. More precisely, in each measurement an ESM receiver
had been switched on for ten minutes and listened for signals and each time a pulse was received,
the arrival time was registered. Data consisted only of these arrival times and no information was
available about amplitude, frequency etc. The reason for this was that the recorded pulses were
assumed to come from transmitters of such a varying nature that a division into more roughly dif-
ferent radar types was sufficient for the purpose of the task, and for this division the time intervals
between radar pulses were sufficient.
In addition to this, a first sorting algorithm had roughly sorted data on 17 classes that had been
limited to 1,000 observations each. There were thus a total of 17,000 observations of the form
described in the previous paragraph. This division was only a rough first division and a first step.
More precisely, the data was on the form of 17 Matlab data files with 1,000 cells in each where
each cell consisted of an observation representing a ten minute measurement. These observations
were of varying size but averaged around 2,000-3,000 arrival times per observation, regardless of
class. This is shown in Figure 1.1.
Finally, it was assumed that the majority of observations consisted only of receptions from one type
of radar, i.e. there were no regular disturbances or several radar transmitters that were measured
simultaneously, but as a rule the ESM receiver only listens to one transmitter during a ten minute
interception. The observations that still contained this would be considered rubbish and outliers.
This means that the entire ten minute measurement can be classified as one class and does not
need to be divided into several different classes.
2
Figure 1.1: Histograms of sizes of the observations for each of the given 17 classes
In Chapters 4 and 5, methods are developed to deal with the issue of the interruptions in the
sequences. In short, the broad problem of constructing a network that can classify an entire se-
quence of radar data is split up into smaller parts. Neural Networks will then be constructed,
trained, tested and evaluated for these smaller problems. Finally, the chosen neural network mod-
els will be put back into the larger issue of classifying an entire observation. In addition to this,
the ability to classify data marked as trash along the way will also be studied.
3
4
Chapter 2
Theory
Different types of radar differ depending on the time intervals between these pulses. A radar
sends out pulses at intervals according to a sequence that defines just that type of radar. This
enables an ESM receiver to determine the type of transmitter it is dealing with by identifying this
sequence. In addition to this the sequence will also depend on which mode the transmitter is. The
radar switches modes based on the distance it wants to detect objects. For longer distances the
transmitter must take longer time between the pulses in order to avoid interference. However for
the type of radar studied in this thesis, the underlying pattern remains the same for each mode,
just scaled by a factor depending on the mode the radar is in. [20] [14]
All this means that it is not possible to represent this type of data as traditional time series
where the same index in two different observations corresponds to each other. Instead, this type of
radar is usually represented in a different way. First, the arrival times are converted into differences,
because it is the intervals between the pulses being sent out that define the type of transmitter.
Then the values are normalized in order to ensure that observations from the same radar type but
in different modes look the same. Then the values that are abnormally large, corresponding to
the time when the transmitter is turned away from the ESM receiver, and values that are 1.25 or
0.75 of the median, corresponding to missed or double registered pulses, are removed. Finally, this
new difference vector is plotted against itself shifted one step in time. Two examples of this are
demonstrated in Figure 2.1 where two whole ten minute sequences from two different classes are
visualized according to the described method [13]
5
(a) The first observation from class 2 (b) The first observation from class 13
Figure 2.1: Visualization of two radar signals from different classes. The entire ten minute sequence
is presented.
In Figure 2.1, it can be seen that the two different types of transmitters result in two distinct
patterns. The reason why this type of visualization works well is that it does not place the same
importance on order as a regular representation. As long as the same time interval occurs one
after the other in some place in two different measurements, the same point will be plotted. This
means that two sequences that are the same will form approximately the same images, regardless of
when the interruptions occur, while measurements that are different will also usually form different
images. There may be measurements that are different but still give rise to similar images as the
same interval occurs at different places in the sequences but this is assumed to be unusual for a
division between widely different classes which is the purpose of this work. All this means that a
method for determining whether two observations should belong to the same class is to visualize
them and pair them if they give rise to similar patterns. This is a traditional method used in radar
classification. [13]
An alternative way of handling this type of radar data is to keep the original sequence of nor-
malized differences but instead of creating the time-shifted plot, keeping the sequence as it is. This
allows for more than two dimensions to be considered but also makes it complicated to visualize and
handle for a human being or simpler algorithms. However, such sequential data could potentially
be handled by more complex algorithms.
The network takes an input of some form and it is allowed to propagate through the nodes and
connections and an output is produced. The purpose is then to adjust the weights in order to
produce as good output as possible based on the input. This is usually done in the manner of
supervised learning, that is, by feeding the network a large amount of training data, consisting of
different types of input and a matching output. Using different types of algorithms and methods,
the weights are then adjusted and tuned as it learns to recognize different patterns in the input in
order to match the correct output. [28]
6
There are many different types of networks used for different types of tasks. In this thesis, we
will focus on feed-forward networks as well as recurrent neural networks, both commonly used for
classification. Feed-forward networks is the most common type of networks and traditionally used
for classification [24] while Recurrent Neural Networks are networks designed to handle sequen-
tial type of data.[25] Below follows some brief summaries of Feed-forward and Recurrent neural
networks, for a further explanation consult Deep Learning with Python [4].
Source: Wikimedia
A feed-forward neural network is a neural network where no connection forms a cycle. [29] In
the most simple form it consists of a so called single layer perceptron. This perceptron takes the
input nodes, multiplies each with a weight, sums it all together and sends it through an activation
function that outputs a value based on the summed input. There are different types of activation
functions that can be used depending on the traits of the network the designer wants. However
what these activation functions have in common is that they introduce non-linearity to the model.
[16] The weights can then be updated through training with labeled data, using gradient descent
and a loss function[9] [21].
In practice a single-layer perceptron isn’t capable of handling complex problem. Therefore the
multilayer perceptron (MLP) is introduced. Henceforth the feed-forward networks used will be
referred to as MLP networks. This functions in a similar manner to the single-layer perceptron
but there are multiple layers put after each other. This means that the activation function sends
its output as input to the next layer and the patterns is repeated until the final layer. Usually
there are also more than one node in each layer, resulting in a full grid of nodes. How this can
look in practice is visualized in Figure 2.2. The network is called feed-forward because each node
is only connected to nodes in the layer right after it, meaning there are no connections backwards
or between nodes in the same layer, nor are there any skipping a layer. Usually a feed-forward
neural network is also fully connected which means each node is connected to every node in the
layer in front of it. [15]
In order to deal with the large amount of weights in a fully-connected MLP network, a method
called backpropagation is used to calculate the gradient with respect to the loss function. In very
broad terms backpropagation utilizes dynamic programming to calculate the partial derivatives and
then multiply them together according to the chain rule. [12] In practice pure backpropagation is
rarely used but instead regularization algorithms such as ADAM [19] that build on backpropaga-
7
tion are used. They have the added benefit of making the weight calculation a bit more stable and
reliable.
When used in practice, the network functions so that each node in the input layer represents
a feature of the data that is to be handled by the layer and each output node represents a di-
mension in the output space. Typically feed-forward neural networks handle either classification
problems where the output is a class or a regression problem where the output is a value. Some
common hyperparameters that have to be adjusted during training are a regularization algorithm,
loss and activation functions, the number of hidden layers and the number of nodes within the hid-
den layers. In addition to this the amount of epochs the network is trained for has to be considered
both in order to deal with overfitting [26] as well as precision. [30]
Source: Wikimedia
The specific details of the network shown in Figure 2.3 are out of the scope of this report. It
is sufficient to point out that the architecture has different gates allowing it to remember and
forget features as one goes deeper into the sequence. This allows for easier handling of much
longer sequences of data. The number of layers is a hyperparameter that allows for detection
of more complicated structures, however it should be noted that using more layers increases the
complexity of the model by a large amount. The number of nodes represents the number of patterns
the network can keep stored. For a LSTM network the activation function can not be chosen. [10]
8
2.2.3 One-hot encoding
Typically when dealing with classification problems that have more than one target type, a tech-
nique called one-hot-encoding is used. It works by having the target and output being a vector of
size N where N is the amount of potential types the input can be classified as. The target vector
has a 1 at the position of the correct class and 0 at the other positions. The output vector is also
of size N and consists of estimated probabilities for each possible class. In order to have the sum
of the probabilities being between 0 and 1, a softmax activation function is used in the last layer
[11]. In order to do this when classifying sequential data using LSTM, a dense layer must therefore
be added at the end. [2]
2.2.4 Overfitting
A very common problem in neural networks is overfitting, meaning that the model becomes very
good at fitting the data from the training set but do worse on unseen data from the same distri-
bution. The cause of this is that the network learns to perfectly fit the training data but in the
process also includes natural noise or randomness and therefore misses the underlying real trend.
[6] An example of how overfitting can look is visualized in Figure 2.4. Here the model has been
trained to perfectly separate the two classes in the data set, shown in green. However it is very
unlikely that this is a natural separation which rather should look more similar to the black curve.
Usually one has to introduce measures to deal with the problem of overfitting in order to get
as good of a performance on unseen data as possible. A typical way to catch overfitting is to sepa-
rate the data set into training and validation data. The model is then trained on the training data
and with regular intervals tested on the validation data. Since no training has been done with the
validation data, the model can not be perfectly fit to it and generalization is kept. According to
the Universal approximation theorem [5], the accuracy of the model on the training data should go
towards 100% as the model is trained longer and learns to perfectly fit the data. For the validation
data the accuracy starts going down when generalization is lost. Therefore it is possible to plot the
accuracy for the training and validation data next to each other and once the validation accuracy
starts getting lower, once can terminate the training. This method is called early stopping and is
a powerful way to handle overfitting [26]. Another possible method is to force generality into the
model by introducing some regularization method that punishes large weights, as large weights
contributes to fitting strange patterns [26]. One can also introduce a dropout layer [17], which
removes a random subset of the nodes in each round of training. This forces the model to not
specialize into overly strange patterns and keep generality.
9
Figure 2.4: A visualization of overfitting during a classification problem in a 2D-space with 2
classes
Source: Wikimedia
2.2.5 Keras
Keras is a open source software library that works as a Python interface for artificial intelligence. It
is an interface for the TensorFlow library that allows the user to implement deep neural networks.
[18]
10
Figure 2.5: Example of k-NN classification. The green point is to be classified.
Source: Wikimedia
(µ1 − µ2 )2
F DR = (2.1)
σ12 + σ12
where µi and σi are the mean and the standard deviation of the elements in class i, respectively.
The value increases the greater the difference in the mean values and the smaller the variances are,
which was the goal described above. This ratio can also be generalized to more than two classes if
the classes are assumed to be independent. It is then written on the form
N N
1 X X (µi − µj )2
F DR = wi (2.2)
N i=1 j=1 σi2 + σj2
X
wi = 1
i
where N is the number of classes and each class is weighted according to the proportion of all
samples belonging to the class. [7] [27]
11
12
Chapter 3
The original 17 classes were the result of a quick K-means algorithm applied on unsorted data.
It was suspected to contain a lot of misclassified observations, mixed up classes and classes that
should be merged. The first thing that was done was therefore some visualization and examination
of the given classes in order to determine precisely which issues that existed. After this was done,
a method was developed for reclassification of the data.
In the developed method, each observation is seen as unlabeled to begin with and the original
classification disregarded. However, the original 17 classes were still seen as a reference point for
how precise the classification should be. Largely similar differences as were shown there is what
we wanted in our final division of the data, thus it was assumed that there were a total of around
15-25 classes and not many more or less based on inspecting the data set manually. The purpose
was then to develop a method that was consistent in how it divided classes. Just looking at each
observation and doing it by hand was to be avoided as that introduced an unwanted bias. In short,
a class was to be split or two classes were to be merged so that the overall difference between
classes were roughly similar and this similarity was of the degree of precision introduced in the
original data set.
13
Figure 3.1: First observations from the initially given 17 classes
14
Figure 3.2: Zooming into an observation from Class 1.
A problem with the graphs in Figure 3.1 was that they only showed the first observation from
each class. No information was available on the variation between the observations. Therefore,
several observations were visualized simultaneously, from both the same and different classes, both
above each other and side by side, in order to see similarities and differences between them. In
addition to this, parts of the observations were also plotted to see how fast they converge towards
the image that is visible to the entire observations. An example of such a visualization can be seen
in Figure 3.3 where five randomly selected observations from each class are placed on top of each
other. This type of visualization was used throughout the work in order to study the behaviour of
the classifications.
15
Figure 3.3: Visualization of five observations from each class, chosen at random and plotted above
each other in the same graph
In this visualization, some conclusions could immediately be drawn just from looking at the plots.
Class 1 and class 7 had great similarities and so did class 10 and class 13. Class 2 and class 4
seemed to be the same but there were some differences. Other classes, most notably class 12 and
class 9, obviously consisted of several different types and needed to be further subdivided. In
addition to this, a k-NN algorithm was tested to see how well a simple algorithm can classify data
based on this first division. This was assumed to partly provide a kind of performance measure
of how well divided the data is and partly provide information about which classes are difficult to
classify.
The results from having examined the visualization of data and the results of the k-NN algo-
rithm supported the thesis that further division and re-sorting needed to be done. It was assumed
that some classes would be merged while others would be further divided. Many observations were
also considered to be outliers and needed to be cleared out. To do this, the original division was
used to set an accuracy, but the data were chosen to be considered unsorted. Then an algorithm
was created to objectively classify, merge and divide observations based on this accuracy. The
purpose was not only to make an assessment with the eyes but to create an algorithm that can
also take the internal structures into account.
16
3.2 Method for Reclassification
The task was now to create a method for reclassification of the original data. The data was
marked as unlabeled but the precision of the original division were still used as a reference point,
as explained in the introduction of the section. First, a method to quantify a comparison between
observations was introduced. Then, an algorithm consisting of three steps was constructed in
other to create a new division of the data. In the first step the observations were split into multiple
classes based on a high similarity. In the second step, those classes were merged in order to get the
correct level of precision. In the third step, some remaining outliers and smaller classes that were
unable to be merged were thrown out. Finally, a last manual sorting was made using a professional
judgement to fine tune the division.
To compare the quality of the reclassification, two performance measures were used. First, the
k-NN algorithm were used on the different classifications. The reasoning was that the more clear
a classification is, the better result k-NN would yield. The second performance measure was the
Fisher Linear Discriminant which punishes large internal variation within classes and small varia-
tion between classes. These performance measures were applied on the initial division of the data
as well as the new divisions in order to determine how well the reclassification had worked.
The method for determining how close two observations are is based on this representation. As-
sume we have two observations A and B. If these observations are very similar, they will also
most certainly overlap. Then each point in one observation should have an corresponding point
in the other observation. For each point in A, compute the distance to the closest point in B.
The average of all these distances will then be very small if the two observations are similar and
hence belong to the same class. More formally explained, the method works as follows: Start from
two observations A and B, where the first has m points and the second has n, in the xy plane.
For each point in the first observation, all distances to the n points in the second observation are
determined. These distances are entered into a matrix of size m × n. For performance reasons,
the squares of the distances are calculated. To determine how close A is to B, the average of the
smallest value for each row is taken. Similarly, the distance from B to A can be determined by
considering the columns instead of the rows.
As previously mentioned, all data were treated as unsorted. The algorithm begins by taking a
list of observations that have not yet been sorted. Initially, this list contains all observations,
where they are read one at a time. The very first observation forms a new class, the new class 1.
The first observation in each new class is here called the representative of the class. The algorithm
continues by taking the rest of the observations, one at a time, and which are compared with the
representatives of the new classes. This comparison is made according to the method outlined in
the previous paragraph. However in this case, the comparison is done in both directions and the
largest value of these two distances is used. If the observation is sufficiently similar to one of the
representatives, it ends up in the corresponding class. If all comparisons with the existing classes
are greater than a given tolerance, then the observation forms a new class. This allows the unsorted
17
data to form several new classes of varying sizes. The process is done a total of three times, where
at the end of each iteration the observations belonging to classes with less than five observations
are collected. These are then reinserted into the algorithm, but now with increased tolerance. The
reason for this is that in the initial step, a low tolerance is needed to be able to say with certainty
that observations within the same class belong together. However, some observations are only
correctly classified when the tolerance increases by a factor of 2 or 4, which is why it is gradually
increased. The observations that truly do not belong in any class usually give a distance greater
than 10x the initial tolerance. The observations that after all three iterations belong to classes
with less than 5 elements are assigned to a new class, labeled trash. The core of the algorithm is
outlined in pseudo-code below.
A total of 49 new classes were obtained, while 57 observations were labeled as trash after this
first step.
The first thing that was noticed was that certain classes were way more sensitive to internal
variation. Looking at 3.3 one can see that some classes follow a more clear pattern of points (re-
ferred to as point classes) while other seem to be more of a cluster (referred to as cluster classes).
It made sense to handle these two separately as there was unlikely to be any mix up between them
anyway. Some further studying suggested that there are were a maximum of ten point classes
and at most 20 cluster classes. The algorithm then followed as below, with the two types handled
individually.
18
Algorithm 2: Calculating the similarity of two classes
Result: P
N = number of classes;
Let first observation in each class be representative;
Create three NxN matrices, M,V and P;
M is for the means, V for the variance and P the probabilities;
for each class c1 do
for each class c2 do
for each observation o do
Calculate distance between o and representative obs of c1;
add it to vector of distances;
end
Calculate the mean and variance;
Insert in position (c1,c2) in M and V;
end
for each class c1 do
for each class c2 do
P(c1,c2) = (M(c1,c2) - M(c1,c1))/(V(c1,c2) + V(c1,c1));
end
end
end
This resulted in a 46×46 matrix of probabilities that two classes were to be merged. The probability
is larger for classes that are more similar and smaller for classes with a small internal variation. As
the goal was to be consistent in merging classes, one now wants to merge classes based on similar
probabilities. That is, one could not just go through the matrix and merge the largest probabilities
because it would be unclear where to stop. For example, one could theoretically merge all of them
into one class and achieve perfect results, but obviously this was not the goal. What instead was
done was that a limit was chosen, initially high, and all class with probabilities higher than that
were merged. This limit was then lowered by an arbitrary scaling factor and the process repeated.
This was done until fewer than 10 point classes and 20 cluster classes existed. The goal here was
therefore to choose a scaling factor that wasn’t so big that all classes were merged at once but also
not so small that only one at a time were chosen. Instead it should be chosen so that approximately
equal probabilities were merged at the same time. This was done a few times as the scaling factor
was tuned and finally a result of 28 classes were achieved.
19
assistance was taken from Tommy Hult at FOI in Linköping who is experienced at studying radar
signals, especially the data-set in question. Together with Hult, all observations from the four
classes were studied and assigned into new classes. After this process, the observations in the four
previous classes had been divided into 18 new classes.
The Fisher Discriminant Ratio was computed in Matlab according to equations 2.1 and 2.2.
3.4 Results
The performance metrics used on the partitions were the Fisher Discriminant Ratio and the accu-
racy from the k-NN classifer. To begin with the Fisher Discriminant Ratio, described in section
2.4, was used to acquire one value for each partition. This was done for the initial given partition
as well as for the final partition. The respective FDR values were 13.97 and 19.00. The accuracy
for the k-NN methods were 72 percent and 82 percent respectively. The final 18 classes are shown
in Figure 3.4 and the results for the k-NN algorithm are visualized in Figure 3.5.
20
Figure 3.4: Final 18 classes, visualized by one observation in each plot.
21
(a) Original partition (b) New partition
Figure 3.5: Graphs showing the original and the new partition of the data being evaluated using
the k-NN classifier for varying values of k. The blue line shows the accuracy of validation data
while the red star indicates the accuracy for the test data.
22
3.5 Discussion
Looking back, we are surprised at how well the algorithm for sorting the data worked. The mea-
sure for how two observations were compared was designed from scratch, as were the algorithms
and methods used to form the final partition. Although there were some problems along the way,
as having to separate the classes into point classes and cluster classes, we were able to achieve a
working method for sorting the data. An important note however, is that this method depends
on a few parameters and for this reason, our algorithm might not perform as well on other data
sets. The parameters were also adjusted based on the output, introducing bias. However, some
assumptions about data always has to be made, especially when no specification about the data is
given beforehand.
The measurement used to compare two observations has a few flaws. First, it does not take
into account how many occurrences there are at each point. Two observations that look similar
when visualized in 2D might not be the same if the distributions of the points differs. The second
disadvantage of the measure comes when taking the mean of the distances. If a few points in
an observation deviate heavily, then this is ”suppressed” when taking the mean, and hence the
observation containing outliers might not be correctly sorted.
In the second and third parts of the algorithm, the sorting is made using statistical models.
However, this is based on the distances between observations, rather than a property of each ob-
servation. Having a bad representative for a class, or simply adding an observation to a class that
doesn’t belong, could heavily impact the results. An alternative approach would have been to
examine the classes further to distinguish features that were specific to each class. The lengths of
the observations along with how the 2D-points were distributed are only two examples of features
that could be considered.
23
24
Chapter 4
Neural Networks
This chapter is split into three main parts. The first part concerns the methods used to pre-process
the data, preparing the appropriate data sets needed for training and evaluating neural networks.
In the second part, different types of networks, with varying hyper-parameters, will be tested
and compared against each other. The model that provides the highest accuracy, along with a
reasonable model complexity, will then be chosen to be used in the final part of this chapter. In
this third part, it will be examined how the chosen network model performs for sub-sequences of
different lengths.
Once the data sets required for training a neural network have been obtained, the properties
of the neural networks, that will later be evaluated, will be described. Since most of the theory
behind neural networks has been covered in section 2.2, it is assumed that the reader is familiar
with the fundamentals of how a neural network operates, and the common terms used in this
field. All of the methods described were applied on the time difference vector of an observation,
as outlined in section 2.1.1, and not on the actual time of arrival data.
25
combining this with larger and more complex networks, the network will be too complex to train
efficiently. More important however, is the fact that the data consists of continuous arrival times,
along with both regular and irregular interruptions. Since a neural network learns to recognize
patterns in different nodes, or sets of nodes, training it on data that is not aligned will result in
poor accuracy.
Instead of taking an entire observation as input, a better way is to slice the data at the break
points. Hence, one observation results in many smaller slices of varying lengths. By doing this,
one can be certain that all data within a slice is in the correct order. It was important to know
how common slices of different lengths were. Therefore, a histogram of the lengths of the slices,
obtained from the 1,800 observations, was computed and is shown in Figure 4.1a.
As can be seen, almost all of the subsequences are shorter than 30 time steps, as shown in Fig-
ure 4.1a. More interesting however, is to consider the total number of subsequences with lengths
greater or equal to a certain threshold n. This is illustrated in Figure 4.1b. As can be seen, this
is a rapidly declining curve. Longer slices contain more information and should result in higher
accuracy when training networks. However, this comes at a cost as the amount of remaining data
decreases as the slice size is increased. For instance, when n = 12 about half of the data remain,
whilst when n = 20, only about 10 percent of the is available. To avoid discarding too much of the
available data, only models with input length 12 or shorter were considered.
Whether or not one wants to normalize the data before slicing was an early question. When
1
it was tried to use un-normalized data for the networks. About 5.55% = 18 were classified cor-
rectly, which is precisely as good as a random number generator would perform. Therefore, all
data used in the rest of the chapter was normalized.
Since the number of available slices were rapidly declining as n increased, we wanted to extract
as much data as possible from each slice. If a network requires an input of size n then there are
two ways to prepare slices of sizes m where m > n. First, one could simply extract either the n
first or last values from each slice. However, this method would discard most of the data. A more
important reason however, is that the inner sub-sequences would be lost. Since the data is full of
irregular interruptions, it is important to consider every sequence possible, to obtain a well-trained
network. A solution would be to allow a ”sliding window” to run over the slices. In this way, one
slice of length m would result in m − n + 1 new slices, all of length n. In this way, as much data
as possible could be extracted from a sequence. This method is what is meant when the term
windowing is used later on.
26
4.1.3 Constructing the data sets
As touched upon earlier, a split of 70-15-15 was used when construction the training, test and
validation sets. Construction the final data sets used for training and evaluating was done in the
following way. First an observation would be split into slices. If the model used input of length n
then only slices of length n or greater would be considered. The larger sequences would be split
up into slices of appropriate length using the previously described windowing method. However, if
windowing was not used, only the first n elements from each sequence would be considered. The
extracted slices would then be added to the correct data set (either training, testing or validation)
and labeled. The labeling was done on the format of one-hot encoding, described in section 2.2.3.
Depending on the type of class considered, each observation would generate a different number of
slices. To avoid bias from the classes containing more slices than others during training, we made
sure to cut the number of data samples from each class to the same size. In this way the slices
from each class in x train, the set used to train the network, would all be the same amount. It
was also important to have the extracted data from all classes be of equal size in the testing and
validation sets. This is because some classes might be easier for a network to classify than others.
Bias in the number of samples from each class could hence affect the final accuracy.
The table above shows the number of sub-sequences in each data set for n = 10. Notice how the
partition split of 70-15-15 will not be maintained the number of sub-sequences extracted will vary
for each observation. Also notice how windowing greatly increases the sizes of the sets.
For the MLP models, there were mainly three properties that were interesting to investigate.
The first was to determine if the use of windowing when creating the data sets would improve
training and hence result in a higher accuracy. The two remaining properties relate to the struc-
ture of the network. Namely, the complexity and number of layers in the network. One would
assume that as the number of nodes in each layer and number of layers increase, that the accuracy
would increase. Due to training and evaluating neural networks being a time consuming process, it
was not possible to try as many combinations as desired. However, by carefully considering a few
combinations to evaluate on, one was able to get a good understanding of how different properties
27
of the network affect the accuracy. The number of epochs (30) along with the batch size (128)
used during the training remained constant throughout the run. Although the values set for the
number of epochs and batch size may not result in the optimal model, this was of lesser concern.
At this stage, gaining an understanding for how different feed-forward networks compared was
more crucial. In total, 24 different combinations were run and evaluated. The inner layers varied
from one to three, with the number of nodes in each layer ranging from 128 to 1,024. This was
all done for the data sets with and without windowing. Table 4.2 displays the accuracy on the
validation data for different network models.
As touched upon earlier, a feed-forward neural network is well suited for cases where there is
a pattern in the features of the input data. In our case, the data being used consists of continuous
time steps, where one would assume a strong time dependency between the values of the input
nodes. In this regard, the relative difference between the values of the input data is of greater
importance than the actual values at the exact positions. As outlined in section 2.2.2, a recurrent
neural network, such as LSTM, is very well suited for these types of problems. In Keras, the
overhead when changing from a feed-forward network to a LSTM-based is minimal, with the main
difference being in how the input data is passed. In the case of the MLP models, each sample in the
input data was being passed as vector of length of length n. When using a recurrent network, the
input data is instead passed as a matrix of size n × 1. This is interpreted as data from n time-steps
with one feature at each step, just as desired. Hence, the shape of the input layer changed, but
there was no need to re-compute the input data.
The MLP models considered either had one, two or three hidden layers. These were now all
replaced by a single LSTM layer, with the number of nodes being varied. Besides also the output
layer, which remained the same as before, there were no additional layers. There were two reasons
for this. First, the number of trainable parameters greatly increase as the number of nodes is
increased. This is due to the recurrent nature of the LSTM layer. Therefore, adding more layers
might result in a network with too many parameters, making it difficult to train efficiently. More
important though, is that the input data was so simple that it was assumed that a single layer
of LSTM would be able to yield satisfying results. As long as the number of nodes in the LSTM
layer was high enough, it was likely that the majority of the patterns in the input data would be
learnt by the network.
As in the case with the feed-forward network models, a number of variations were tried and
evaluated. The number of nodes in the LSTM layer ranged between 25 and 500 nodes. As in the
previous case, this was done with and without windowing. The results are shown in Table 4.3.
28
4.2.1 Results
Table 4.2: Feed-forward neural network models for n = 10. Accuracy for validation data
when at the end of training
Table 4.3: LSTM-based network models for n = 10. Accuracy for validation data at
the end of training
4.2.2 Discussion
In the first part of this section, when MLP networks were compared against LSTM-based recurrent
networks, there were plenty of interesting results. In Table 4.2 one immediately notices how win-
dowing greatly improves the accuracy. There appears to be an increase of at least five percentage
points for the networks with one hidden layer. For the networks with two and three hidden layers,
the increase is not as large, but still noticeable and consistent for every model. One also notices
how the accuracy increases as the number of nodes per layer is increased. The accuracy usually
reaches its maximum between 256 to 512 nodes, to then decline for more complex models.
In terms of being the best model, there were two clear candidates. The first was the model
which consisted of two inner layers with 512 nodes each. While the second candidate was the case
where we had three inner layers with 256 nodes in each layer. With similar accuracies, we have to
consider the complexities of the models. With the second model having a total number of trainable
parameters roughly equal to half of the first, this one is to prefer.
With most of the values being as expected, there was however one value in the table that stood
out. When training the model for three layers with 1,024 nodes in each, the resulting accuracy
(without windowing) was only 4.6 %. This is marked in the table by an asterisk (*). It is not fully
clear why this happened. Since not using windowing results in much smaller data sets, a hypothesis
could be that the ratio between the number of trainable parameters and the number of training
29
samples is too great. It might just be that the optimizer can’t efficiently compute all the weights
required to increase the accuracy. This was confirmed when closer examining the training history.
As seen in appendix A.1, it is clear that the network struggles to optimize the parameters. The
same problem does not seem to appear when a larger training set, as when windowing is turned on.
When looking at the results from the recurrent networks, we again notice a drastic improvement
when using windowing. This increase is, once again, more profound for simpler models with fewer
parameters. One also notices how LSTM-based networks tend to perform better than traditional
feed-forward networks when the number of trainable parameters are roughly the same. We also
notice how LSTM networks provide much higher accuracies, despite far fewer trainable parame-
ters. For instance, the LSTM(100) network with 42,618 parameters outperforms even the most
complex MLP models. Just as for the networks in Table 4.2, the accuracy tends to decrease as the
complexity of the network increases above a certain point. Among the networks in Table 4.3, we
find that the network with a LSTM(300) layer gives the highest accuracy. However, this increase
in accuracy comes at a price. When comparing LSTM(100) against LSTM(300) the increase in
accuracy is about .2 percentage points, but with 8.6 times the number of trainable parameters.
We therefore decided to use both network models in the next section to see how they compare
for sub-sequences of different lengths. Since windowing always seem to yield higher results, it will
from now on always be used.
The process of implementation was quite straight-forward. Since the different length sub-sequences
being evaluated all need a different number of nodes in the input layer, each test for sub-sequences
of length n, require a new model to be trained. And since the three data sets test, validation and
train depend on the chosen slice size, the test-data were different for each test. Imagine we have
a sub-sequences of length ten. When we compute the data sets for sub-sequences of length three,
this single original sub-sequence will create a total of eight new sub-sequences. The data sets not
only grow in size, but also in the number of different sub-sequences, as n is decreased. As the
lengths of the sub-sequences become shorter, the sequences from different classes become harder to
discern, since they contain less information. This makes it harder for the network to find patterns,
which increase the number of epochs required to provide a good result. For these reasons each
model trained for a total of 100 epochs. When examining the accuracy plots for the validation
data, with the intent of finding after how many epochs the accuracy leveled out, it became clear
that 100 epochs was enough for all the models. A concern that the reader might have is the risk
for overfitting when increasing the number of epochs during training. This will be addressed in
the following discussion.
30
4.3.1 Results
n 3 4 5 6 7 8 9 10 11 12
Accuracy 85.3 % 91.4 % 93.4 % 95.1 % 95.5 % 96.6 % 97.1 % 97.3 % 97.4 % 97.8 %
n 3 4 5 6 7 8 9 10 11 12
Accuracy 85.9 % 92.0 % 94.0 % 95.5 % 96.3 % 97.3 % 97.6 % 98.1 % 97.8 % 97.8 %
31
Table 4.7: LSTM(100), confusion matrix for n = 12
4.3.2 Discussion
As can be seen in Table 4.4 and Table 4.5, the accuracy seems to get worse as n is decreased.
This was just as expected. What was not expected however, was how small the decline would be.
Even for sub-sequences of length four, the accuracy was still above 90 percent. The largest drop
in accuracy was noted between n = 3 and n = 4. This indicates that it becomes harder for the
network to find patterns at this stage. The lowest accuracy, which was 85.3 percent for n = 3 in
the LSTM(100) model, is still better than the results from using the k-NN classifier, which gave
82 percent. It is also interesting to note how the accuracy seems to stagnate for n ≥ 10. This
indicates that the optimum for how well this type of network can perform on the original data set
has been reached.
Overall, the LSTM(100) model performed worse or equal than LSTM(300) for every n. How-
ever, the loss in performance was usually only about half a percentage point. The small increase
in performance comes at the cost of a much more complex model. The LSTM(300) model consists
of more than 8 times as many parameters as the LSTM(100) model. Due to this, the more com-
plex model took nearly 6 times as long to train, and during evaluation roughly 3.5 times as long
compared to the simpler model. We therefore decided to use the model based on LSTM(100).
In Table 4.6 and Table 4.7 the confusion matrices are shown for n = 3 and n = 12, respec-
tively, when using the LSTM(100) model. Since n = 3 resulted in the lowest accuracy, and n = 12
performed the best, the two corresponding confusion matrices were included in the report. By
considering Table 4.6 we immediately notice how the accuracy differs for the so called point classes
vs cluster classes. Since the point classes consist of simpler, more predictable patterns, this is very
reasonable. The only point classes that the network seems to struggle with, are classes 3 and 5,
which sometimes are confused. This is very likely due to them having a similar appearance, as
32
visualized in 2D in Figure 3.4.
Table 4.7 shows a clear improvement. The problem with the cluster classes being harder to classify
appears to be gone. Just as before, it appears that class 3 is often classified as class 5, although
to a lesser extent than for n = 3. It appears however that class 5 is sometimes (in 9 percent of
the cases) classified as class 13. Unfortunately, we do not have a good explanation for this. It
could, for instance, be a result of the network trying to increase the total accuracy by improving
the accuracy for other classes. The cost of gaining an overall higher accuracy might be that the
accuracy for class 5 decreases.
The issue of overfitting was also considered. For our LSTM(300) models the accuracy plots for the
training and validation data were plotted next to each other in order to study this, an example
is shown in appendix B.1. It is seen that the training accuracy seems to continue to grow at a
very slow pace while the validation accuracy plateaus. Therefore no overfitting seems to be taking
place, at least not within the amount of epochs the model is trained on. This was attempted for
a very long training time of 1000 epochs as well and the same results were obtained.
The reason for this is probably the very large amount of training data in combination with the
relatively small sequences used in the models. In order for overfitting to take place, the weights
would have to be adjusted to fit the output to always match, for example, class 3 to a specific
sequence belonging to class 3. However since there are so many different sequences of class 3 being
considered in each epoch, this adjustment keeps getting tuned to fit new sequences and thus never
becomes a perfect fit for one. Usually in overfitting the model manages to still fit all the training
sequences in a way that makes it lose generality but naturally this goes down as the diversity of
sequences goes up. In addition to this a smaller input sequence means there are less patterns to fit
perfectly, making the input of a large data set stronger. Therefore, it was concluded that for this
specific implementation, overfitting was not a concern. However there was still the possibility that
some generalization could be lost and dropout was also attempted. However this yielded worse
results and the layers were therefore removed.
33
34
Chapter 5
Applications on whole
observations
In Chapter 4, neural networks dealing with subsequences were established, trained and tested. In
this chapter methods were developed to apply these networks on the whole full length observations
instead of isolated subsequences. The reasoning was that a longer observation can be seen as a
chain of subsequences and therefore it should be possible to apply networks able to handle subse-
quences in tandem to classify a longer sequence. In order to do this, the LSTM(100) networks from
Chapter 4 were considered final and the focus of this part is to vary the n parameter and adjust it to
different methods in order to see how it performs for longer sequences. These different neural net-
work models will often be referred to as “Model n” for a specific value of n in order to simplify the
text. Henceforth, “Model 5” will therefore refer to a LSTM(100) network with an input size of five.
To determine how well a method and network performs, four performance measures were con-
sidered. They were the simplicity of the model, the accuracy, the speed and if an observation is
classifiable or not. The simplicity of the model referred to how few parameters are required to be
saved and how few calculations are needed to make a decision. The accuracy of the model meant
how often an observation is classified correctly. With the speed of the model, how much time a
model requires to classify an observation was considered. How often a model could classify an
observation referred to how often subsequences of the required size can be found for the network
models to be able to make a classification. These performance measures were gathered by applying
the models on the test data, that is the final 15 observations from each class that has remained
untouched during the training and validation process. This was to avoid most of the bias and give
as general of a result as possible. These results were first acquired for the developed methods using
the fixed size networks introduced in Chapter 4 and then for more complicated models being able
to handle varying input sizes.
In addition to this, the same networks and methods were also applied to trash data in order
to study their behaviour for observations that did not belong to a class, or were parts of a class
but heavily disturbed. Optimally, the model should be able to determine not only which class a
sequence is from, but also if it is not from a class in the distribution at all, or if it belongs to a class
that has been disturbed. One possible example could be a sequence that largely belong to one class
but certain subsequences are erroneous, causing it to be marked as trash in the steps described in
Chapter 3. Another could be a sequence detecting two different radar types, appearing in tandem.
This is something the model should optimally be able to detect, allowing the supervisor to draw
as much information as possible from the sequence.
35
5.1 Variation of sub-sequence sizes between observations
While it was made sure that there were an equal amount of subsequences from each class in Chap-
ter 4, no importance was placed on from which observations they came. This means that it is
possible that a network can perform very well on all the subsequences it is tested on but if all of
these sequences only come from a few observations, that is you can not find sequences of that size
in most observations, the model will still be very bad in practice. This is because it will not be
able to make a classification at all for most observations. In addition to this it could also be that
subsequences of certain sizes are very rare, meaning it could take a long time for the detector to
find such subsequences, making the model very slow. This was investigated in order to confirm
the suspicions. The results are visualized in Figure 5.1 where the sizes of subsequences for each
observation in each class have been studied. The table shows the maximum slice size one can have
for the model so that X percent of observations from class C will be able to be classified.
Looking at Table 5.1 one can see that in order to be certain that an observation can always
be classified one can’t have a size larger than six. For 90 percent that number is instead eight and
for 75 percent nine. However for the majority of classes, larger models can still classify all observa-
tions and there is a large variation between classes. Recalling the results from Chapter 4, this adds
another dimension to choosing the input size for the model. While larger sizes performed better,
they will ultimately more rarely be able to make that classification for certain classes, yielding a
worse result in practice. The large variation also offers some flexibility depending on the purpose
of the task though. If the supervisor is only interested in classifying a certain amount of classes,
he can allow himself to use a larger input size, and thus also increase the accuracy.
Table 5.1: The maximum sub-sequence sizes one can use for each class so that a certain percent
of observations can be classified
Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
75 % 12 13 15 18 15 9 9 13 19 13 15 20 17 15 13 20 16 12
90 % 9 11 11 14 15 8 8 12 14 9 11 18 16 10 10 20 13 9
100 % 7 7 9 9 11 7 6 10 9 6 8 11 14 9 8 11 8 7
5.2 Method
5.2.1 Motivation and explanation
In Chapter 4, neural networks for inputs consisting of uninterrupted sequences of a pre-fixed size
were developed. In this part these models should be extended to longer sequences with regular and
irregular interruptions, dividing them into subsequences of varying sizes. Once again these longer
sequences are normalized, similar to how the subsequences were normalized before. The method
should also be able to view the observations in real time. That is, the data points should be fed
to the model in order and it should make a classification as quickly as possible. This introduces a
couple of problems when it comes to normalization as one in practice can only normalize using the
existing points. For simplicity this was ignored during this Chapter but the issues are addressed
in Chapter 6. This classification should then be able to be updated as time progresses and more
data becomes available. What this means is that rarely will the whole ten minute sequence be
needed and optimally a decision and all the relevant information should be able to be found much
quicker. Therefore the full ten minute sequence will often not be plotted as one can get a full
perspective from a much shorter period of time. In addition to this it also saves time as running
neural networks for long sequences can take a lot more time. However, all the results presented
holds equal for the full sequence as well.
Two different methods were considered, where the second is a build-on from the first. This is
because there are different scopes of interest. In some cases, one is interested in examining how
classification varies over time, for example in order to detect types of disturbances/interference
36
between different radars. In other cases, one will be more interested in getting a single classifi-
cation of the observation as quickly as possible in order to take a decision. The method dealing
with the first case will be presented first. These methods are first explained for the more simple
fixed-size-input neural network models but with a slight modification they can also be utilized for
the more complex varied-input-size models that were attempted later on.
A more precise description of how the algorithm works; The input is once again in the shape of a
time-of-arrival vector. This is transformed into a difference vector and differences larger or smaller
than a certain threshold is transformed into NaN-values in order to represent interruptions. A sub-
sequence size N and a step size K are then defined and the corresponding neural network model
for size N is imported. The difference vector is padded by N − 1 NaN-values in the beginning
so that the batch can begin with the first value as its last element. A window of size N taking
batches is then allowed to iterate over the data, either over all of it if the data already exists, or
over it as more comes if the model is being applied in real time. This window takes a batch, checks
if it includes any NaN-values (if so it outputs NaN) and if it does not, it sends it to the neural
network model as input and take the single output of the neural network as output. The output
is placed in the larger output vector that is growing iteratively as more data is being processed.
If a step size of one is used, the padding in the beginning will make it so that the index i of the
output vector corresponds to index i + 1 in the original time-of-arrival vector. Thus, one can map
the classification to a certain time point, which is especially useful in real time. One can see that
the first N − 1 elements in the output vector will always be NaN for example, which is because no
classification can be made with the model until at least N data points exist. If a larger step-size is
used one simply has to make K copies of the most recent output in the output vector in order to
have the same correspondence to the time-of-arrival vector
In Figure 5.1 an example of this is visualized. Here the first layer is a difference vector with
arbitrary values and a single interruption shown. N is set to three and K is set to one. The
difference vector is therefore padded with two NaN-values in the beginning, shown in the second
layer. The window slides over the vector one step at a time and if no NaN-values are present in
37
the window, takes it as input to the imported neural network model, and outputs a classification.
This classification or NaN-value is added to the output vector at the corresponding index shown in
the third layer. Here, there are two NaN-values in the beginning as the two first windows includes
a NaN-value. Then two classifications are made as there are four cells with non-NaN values and
then a NaN-value is included again, resulting in yet another NaN-output.
Figure 5.2: The method applied on the first 250 data points of observation 75 from class 13, using
the neural network model with fixed input size seven
An example of how this can look in practice is shown in Figure 5.2 where the method has
been applied on an observation from the data set. The neural network used here is with input
size seven. The time is on the x-axis and the classifications on the y-axis and the output has
been plotted for the first 250 data points, roughly corresponding to the first 35 seconds of the
observation. When no classification is made it is represented by an empty value, resulting in large
gaps when the transmitter is turned away from the receiver, which also can be seen in the plot.
The same observation is plotted for 20 and 1500 data points as well in Figure 5.3, corresponding
to about 20 milliseconds and four minutes respectively. Here it can be seen that there are multiple
points hidden in each marking in Figure 5.2, meaning the method rather quickly makes a lot of
predictions before the longer interruptions. One can also see that the behaviour does not really
change over longer time periods. The method mostly classify the observation correctly as Class 13
but there are faulty classifications occasionally. In these examples a subsequence size of seven has
been used but both smaller and larger models were also tested and the results studied. In general
smaller sizes resulted in more frequent classifications but also more faulty classifications, as was
expected.
38
(a) (b)
Figure 5.3: The same model and observation as before, but visualized for the first 20 data points
(a) and the first 1500 data points (b)
The method gives rise to some interesting patterns and an alternative way to visualize the radar
signals to the previously used 2D-pattern. Poor subsequences being wrongly classified that in the
2D-pattern might have given rise to weird shapes distorting the classification process can now be
isolated and studied to see what went wrong. One could also remove the data points corresponding
to those subsequences and plot the 2D-pattern again to see if it looks different. However, it is also
a possibility that the reasons certain subsequences are classified wrong just comes down to the
fault of the network model, especially for those with a smaller slice size. This type of analyzing
will be further explored in a coming section.
The first two layers work exactly the same as in the previous task. (Note that there should
actually be four layers here, but the padding part has been excluded in the visualization for
visibility) But in addition to this the third layer takes the output of the previous layers and
changes the value at index i to be the majority of all values on positions smaller than i in the
previous layer. What this means is that the output is said to always be the majority sum of what
39
the network model has previously predicted the observation to be, unrelated to what the current
subsequence is classified as. The motivation is that even if the model sometimes classify wrongly,
it is assumed to be correct most of the time. Using probability this means the majority layer will
converge to the correct classification as more data becomes available and the correct classifications
are likely to outweigh the incorrect ones.
In Figure 5.5 the majority layer method is visualized for 1500 data points, once again using
the same observation and neural network as in the visualization of the simpler model. It can be
seen that it converges to the right class quite rapidly. Comparing it to Figure 5.3 b) one sees that
there are no occasional errors after the initial misclassification and that the method converges to
the correct value very quickly. There are also no large gaps in the classifications. Another example
is shown in Figure 5.6 where the majority layer has been used for two different neural network
models on the same observation. The first one is with input size three and the other with input
size nine. Here, one can see how the smaller neural network is making a couple of really quick
faulty classifications before it converges to the correct class. The larger network on the other hand
takes a longer time to decide but immediately converges to the correct decision.
(a) (b)
Figure 5.6: Majority method used on observation 66 from class 17 with two LSTM(100) neural
networks with different input sizes. In a) the input size three was used and in b) the input size
nine
Due to the nature of the majority layer, it almost always converges to some classification, whether
it is right or wrong. It therefore opens up a new scope of interest besides accuracy, namely how
quickly the model converges. As is seen in Figure 5.6, there is a very different behaviour for two
40
different neural networks even though both classify the observation correctly in the end. It is
reasonable to assume that the speed with which the classification converge largely vary based on
which model one is using. If two models have an equal level of accuracy, clearly the quicker one is
to prefer if one works in real time. The majority layer will be used to compare the different types
of models where along with accuracy, the convergence speed will also be used as a performance
measure.
There were two possible ways of doing this. First one could use a so called padded model. That is,
a new network model that is trained on different subsequence sizes that it pads to a pre-decided
size, allowing it to use input of the same size every time. Such a model was created and trained
with a size of ten in order to keep the consistency from Chapter 4, allowing it to be trained on the
same data for the same amount of time with the same type of network as for the fixed-size models.
Another idea was to use all the trained models for sizes up to ten and use the largest model
possible for each subsequence. This meant that for each subsequence, a classification could be
made while the largest model was always chosen. A setback of this this was that it would require
about eight times as many parameters to be stored though, but if the results were significantly
better, it could potentially be worth the trade off. The reason a maximum model size of ten was
chosen was that the larger network models simply did not perform better so including any larger
models would be a waste of memory space. The padded model also performed its best on subse-
quences for a size of ten, meaning that ten was chosen for that model as well.
For both models, a slight adjustment had to be made. Instead of using a fixed subsequence
size to determine N , it was simply always set to three, allowing sub-sequences of at least size three
to always be classified. Lower subsequence sizes were not considered since that was deemed to be
too simple to include much interesting information. In addition to this the window was not of a
fixed size taking fixed steps but instead was allowed to grow until it reached size ten or included
a NaN, after which it behaved as before.
5.3 Results
With the methods to apply the different models in practice established, the models were ready to
be tested. The eight fixed-size models along with the two varied-size models were tried out for
the entire test data set, consisting of 18x15 observations, and the results were summarized. There
were four performance measures to be considered. The accuracy of a model, the speed with which
it converged, how often it could classify at all and how simple the model was.
41
data was marked as unclassified by the model, meaning the output only consisted of NaN-values.
The reasoning behind separating unclassified from wrongly classified observations was that it was
considered more dire to make a faulty classification than to simply be unable to classify a obser-
vation.
The third performance measure was how quickly the model converged to a decision, irrelevant
if that was correct or not. As was stated in the previous paragraph, the model was considered
converged if 100 following values in the output vector was the same. A model was therefore consid-
ered converged at the first time point of such a sequence of 100. In addition to this, the simplicity
of the model was also a factor. However all the models were of the sort LSTM(100) and thus had
as many parameters to store. The one exception was the model using a combination of all the
fixed-size models that therefore used eight times as many parameters.
To begin with the test data set was extracted, consisting of 15 observations for each class, or
270 observations in total. The accuracy and convergence speed for the fixed-size models were then
acquired and the results are summarized in Table 5.2 and Table 5.3, respectively. In Table 5.2
each column represents one of the neural network models and the rows a specific class. The cells
contain the amount of observations from that specific class that were correctly classified for that
specific model. The final two rows show the total percentage of observations that were classified
correctly and were unable to be classified, respectively, for each neural network model.
For the convergence speed table the columns once again represent the different models and the
rows the different classes. However here the cells show the average convergence speed for that
specific class and model. The final row shows the exact average convergence time for all classes for
that specific model. In addition to this, all the models have been run for the first 1000 data points
in each observation. Due to the large variation in length between observations, shown in Figure
1.1, this was chosen to guarantee that no observation would be unclassifiable because of a lack
of data points. 1000 data points on average represent about a third of the total ten minute time
span, or around three to four minutes of listening. It was also confirmed that different results were
generally not achieved when including more than a couple of minutes, thus the whole 10 minute
sequence was seen as redundant. This is further confirmed by the convergence times shown in
Table 5.3 which never are of the order of hundreds of seconds.
The same performance measures were then acquired for the two models that utilizes a varying
input size, described in the previous section. The results for these two are shown in Table 5.4 and
Table 5.5. Note that the amount of unclassifiable observations are emitted here as it is zero for
both models.
42
Table 5.2: Accuracy for the fixed-size neural network models
n
3 4 5 6 7 8 9 10
Class
1 15 15 15 15 15 14 13 12
2 15 15 15 15 15 15 14 14
3 12 13 13 13 13 12 12 11
4 15 15 15 15 15 15 15 15
5 15 15 15 14 13 11 11 11
6 15 15 15 13 12 10 9 8
7 15 15 15 13 9 7 6 6
8 15 15 15 15 15 15 15 15
9 15 15 15 15 15 15 15 14
10 15 15 15 14 14 14 13 12
11 15 15 15 15 15 15 14 14
12 15 15 15 15 15 15 15 15
13 15 15 15 15 15 15 15 15
14 7 9 10 12 14 15 15 14
15 14 14 14 15 15 14 13 10
16 15 14 14 14 14 14 14 14
17 14 14 14 15 15 15 15 15
18 15 15 15 15 14 13 12 12
Accuracy 95.2% 95.9% 96.3% 95.6% 93.7% 90.4% 87.4% 84.1%
Unclassifiable 0% 0% 0% 1.1% 4.1% 6.7% 10.0% 13.0%
Table 5.3: Convergence speed for the fixed-size neural network models
n
3 4 5 6 7 8 9 10
Class
1 0.003 0.472 0.962 5.373 8.258 35.482 43.986 47.763
2 0.155 0.156 0.691 0.998 6.924 15.791 18.243 29.165
3 3.638 7.421 7.422 8.743 15.948 45.921 50.748 58.978
4 0.003 2.615 2.616 2.618 2.619 2.620 5.233 5.234
5 1.008 0.174 2.478 0.009 0.011 0.012 0.013 0.243
6 1.843 2.845 5.534 24.952 38.135 43.001 37.695 27.971
7 0.787 5.828 31.443 34.029 28.089 0.419 4.022 13.946
8 0.347 0.348 5.418 5.600 6.144 6.326 27.853 35.873
9 0.703 2.877 8.425 8.425 8.426 8.711 18.131 20.531
10 6.689 9.407 9.679 12.307 17.613 19.672 25.621 8.508
11 0.343 0.189 2.611 2.612 7.362 32.014 9.657 18.911
12 16.584 17.982 18.507 20.657 20.658 21.514 26.483 35.504
13 1.381 3.723 4.129 4.098 4.918 15.964 16.766 35.619
14 2.849 1.508 1.133 6.557 0.793 4.815 4.630 2.497
15 6.400 6.913 9.656 9.149 22.537 8.988 20.263 8.467
16 2.672 2.125 2.125 2.707 2.708 3.289 3.471 3.472
17 1.426 2.804 3.454 5.236 5.389 8.973 8.507 9.751
18 6.066 5.199 6.582 13.196 7.846 14.334 15.385 28.735
Average (s) 3.629 4.033 6.826 9.302 11.354 15.989 18.704 21.709
43
Class Accuracy / 15 Speed (s) Class Accuracy / 15 Speed (s)
1 15 0.003 1 15 0.003
2 15 0.155 2 15 0.155
3 13 3.358 3 13 3.359
4 15 2.617 4 15 0.003
5 13 11.356 5 14 7.830
6 15 1.843 6 15 1.843
7 15 0.787 7 15 0.787
8 15 0.347 8 15 0.347
9 15 6.251 9 15 0.702
10 15 9.388 10 15 6.689
11 15 16.099 11 15 0.180
12 15 23.559 12 15 15.029
13 15 37.935 13 15 3.727
14 9 12.585 14 15 19.397
15 15 5.005 15 15 5.647
16 14 5.007 16 14 2.124
17 14 31.647 17 15 2.251
18 15 5.199 18 15 5.199
Mean 90.0% 9.619 s Mean 98.5% 4.182 s
5.3.2 Discussion
Beginning with the fixed-size models, the highest accuracy achieved was 96.3 percent for Model 5.
In addition to this Model 3 was also the quickest at converging with a 3.629 second convergence
time on average. For larger models the speed generally dropped as expected but the accuracy also
started dropping very quickly which was surprising. The lack of subsequences of matching size,
described in section 5.1, seems to be the cause as one can see that the amount of observations
that could not be classified at all increases quite rapidly for the larger models. Meanwhile the
accuracy remained fairly high for n as small as three. Comparing Model 3 and Model 10, one
sees a difference in accuracy of over ten percentage points in favor of Model 3. This stood in
contrast to the results acquired in Chapter 4 where one had a 85.3 percent accuracy for Model 3
and 97.3 percent for Model 10. This confirmed the suspicions described in section 5.1 and showed
the importance of investigating the full model rather than just smaller subsequences. Weighting
accuracy, speed and simplicity, the model of size three seems to perform the best but it should
be noted that for Class 14 it performs significantly poorer than the larger models, making Model
5 the best contender for the best model. The larger models can be rejected based on their lower
accuracy, slower convergence time and more complicated structure. Therefore it was concluded
that a LSTM(100) network with input size 5 was the best network for a fixed input size.
Looking at the varied-size models instead. One sees a worse accuracy for the padded model and
a slightly better accuracy for the combined model of 90.0 percent and 98.5 percent respectively.
The average convergence speed for the two models were 9.62 seconds and 4.18 second respectively.
The first thing that stands out here is that the padded model performs very poorly, as it is less
accurate, takes longer to converge and is more complex. It can therefore safely be rejected. If
one instead compare model 5 to the combined model, it has approximately the same convergence
speed but the combined model has a two percentage point accuracy advantage. Therefore one has
to weight the benefit of that extra precision against the cost of saving over eight times as many
parameters in the memory and the added complexity of constantly switching models. It should be
noted, though, that both models have a 100 percent accuracy for many classes but for the classes
where there is a difference in performance, it is larger. Looking at class 14, Model 3 is only correct
46 percent of the time and Model 5 66 percent of the time. Meanwhile the combined model is
always correct. Thus, if the emphasis is on obtaining a more reliable classifier, the combined model
44
is to prefer.
Moving on to examine how the models performed for specific classes, the results were mostly
in line with what was discovered from the confusion matrices in Chapter 4. It was of interest to
note that the general performance is not much better than for the subsequences isolated, once
one disregards the unclassifiable observations. This suggests that problematic subsequences tend
to group together rather than being evenly spread out. This holds true because if they had been
evenly spread out, the nature of the majority layer would’ve handled them, causing a higher total
accuracy. It was noted that the performance for certain classes, in particular Class 6 and Class 7,
dropped heavily for the larger models while others such as Class 13 remained stable. This is in
line with the issues discussed in section 5.1 about certain classes being unclassifiable more often
for larger models.
The two classes causing the most problems for the smaller models were Class 3 and Class 14
which also were the classes hardest to classify for the networks. As discovered in Chapter 4, this
is once again because observations from Class 3 is being classified as Class 5 and observations
from Class 14 as Class 15. Looking at Figure 3.4 this makes sense as those pairs resemble each
other. Class 3 is clearly a subsequence of Class 5 and Class 14 and Class 15 both resemble two
diagonal lines with the difference that Class 15 is a bit more jagged and Class 14 has a regular
point appearing outside the diagonals. In addition to this, a single observation from Class 16 could
never be classified except for the smallest fixed-size model. This was odd but was rejected as a
strange outlier.
It was seen that the performance on observations from Class 14 increases with larger models,
suggesting that a larger input allows it to be classified better. This is probably because of the
point lying a bit on the outside of the shape in observations from Class 14 being included more
frequently with a larger input. If it is excluded, it is very likely the pattern in an observation from
Class 14 will be identical to the pattern in class 15. However, this is not repeated for Class 3 that
performs poorly for the larger models as well. Looking at the pattern of Class 3 in Figure 3.4,
one sees that Class 3 consists of a 16 data points long pattern, all included in the larger 35 point
pattern of Class 5. This means that one would need at least 17 consecutive data points in order to
certainly determine if a subsequence belongs to Class 5 instead of Class 3. Subsequences are rarely
of this length and never in our models, thus the classes are mixed up. It is noted that, with the
exception of the varied input size models, Class 5 is classified correctly. This is probably just due
to it having the longer pattern, meaning there is a smaller risk that a subsequence of it is similar
to the subsequences typical for Class 3. The reason the varied input size model behaves oddly in
regards to Class 5 and Class 14 is probably due to the nature of training with a padded model
causing some issues.
Finally a note on the accuracy. Since only 15 test observations exist for each class, the results
aren’t entirely reliable and using a a larger test set would’ve likely lead to more precise accuracy
estimations. For example, it is unlikely that any class can be classified correctly 100 percent of
the time in practice but with only 15 test observations any real probability over 96.7 percent will
on average be rounded up to 100. However the main purpose here is to compare models to find
out the best one and since no bias has been introduced, as the test data has not been used in
choosing/training the models, it is likely the comparison between them still hold. Also while the
results might be a bit off, it is unlikely that one would see so much worse results that the main
conclusion would not hold. A top accuracy of 95-99 percent can probably be assumed which means
that classifying radar signals using neural networks definitely is possible. More test data would
have been welcomed for the smaller classes but from the issues described in Chapters 3, those types
were relatively rare and quite hard to acquire. However, it could be a project for future studies.
45
5.4 Handling the trash observations
In this part, the purpose is to study how the models behave for the observations in the trash class.
Since one single classification is not really of interest but rather how the observations are classified
over time, the more simple two-layer method is used.
Many of the observations in the the trash class were presumed to be observations from one of
the 18 classes but disturbed by a single or a few outlier points. These outliers caused the whole
difference-vector to be slightly shifted post-normalization. The hope was therefore that the model
would be able to handle this and still classify them correctly for the subsequences outside the
outlier points. In addition to this, it was suspected that a lot of the trash observations were the
result of two or more types of radar being detected simultaneously, resulting in an odd plot in
the 2D-perspective. If this was the case it was unlikely that the two radar signals would interfere
during the exact same timeframe, and thus the subsequences should still belong to just one class.
This should in theory be detectable in the realtime plots as classifications of two or more classes
in intervals matched by when the corresponding signal is detectable by the receiver.
In order to visualize this, a combination of the original 2D-perspective from Chapter 3 and the
method developed earlier in the current section were used. First the two layer method using the
Model 5 from the previous part was applied for each observation from the the trash class. The
resulting time frame plot was then plotted next to the matching 2D-plot. Thus one could easily
look at the classified classes in the timeseries plot and compare their corresponding 2D-plots, seen
in Figure 3.4, to the observations 2D-plot to see what conclusions could be drawn. Note that only
the first 250 points were plotted in order to save time. This means that certain outlier subsequences
might not be visible as it is impossible to determine from what part of the sequence they come
from just by looking at the 2D-plot. However the relevant information can still be gathered from
a shorter part of the whole sequence.
(a)
(b)
Figure 5.7: An observation that seems to be from Class 3 with four to five outlier points
The first observation that this was done for was an observation that seems to be from Class 3 but
slightly disturbed by four to five outliers. The results are shown in figure 5.7 where one can see
that it indeed is classified as Class 3. This makes sense because even though outliers exist, most
subsequences will still be clear examples from Class 3. This indicates that the slight skew from
the normalization did not impact the results. This was further confirmed by testing out other
observations looking like they belong to one of the existing classes but with a couple of outliers.
These results meant that even if an observation has a couple of outliers, it can still be classified
correctly. This was of interest because the test set in the previous section only contains cleaned
observations and no information on how well they handled observations with outliers had been
obtained. Since outliers can appear quite frequently, this means that one does not necessarily have
to involve any sort of outlier-removing preprocessing before sending the observation through the
model in order to achieve results of the quality seen in the previous section.
The second situation that was to be tested was an observation that seemed to be the result of two
46
(a)
(b)
Figure 5.8: An observation that seems to be a mix of two different observations from two different
classes
signals interfering with each other. In Figure 5.8, one can see an observation that seems to be a
combination of Class 14 or Class 15 and Class 3. However, it is for the most part classified as Class
13 and Class 18. This seemed odd at first since the individual subsequences were presumed to
still be part of their respective classes. It can also be noticed that, due to the normalization, their
positions are quite skewed in regards to how individual instances of Class 3 and Class 14/Class 15
look like. Since the differences appear spread out and not really resembling anything, the network
classifies them as Class 13 and Class 18 which are the most spread out classes available. Looking
at other observations with similar behaviour, one could see the same pattern of classifications as
Class 13 and Class 18, supporting this hypothesis. While this is something that is quite hard to
get around, it could potentially be solved by normalizing each subsequence individually instead of
the whole vector at once, But that also has the drawback of making non-disturbed observations
more prone to noise and being more spread out. Due to time constraints this was not explored
further but it is perhaps something that can be attempted in the future.
(a)
(b)
Figure 5.9: Visualization of an observation that seems to not resemble any other observation
Finally observations that were of a more chaotic pattern, not really resembling any established
class, were observed. These were assumed to either belong to one class and be heavily disturbed
by noise, or belonging to two or more classes interfering with each others sub-sequences, resulting
in strange patterns. An example of this is shown in Figure 5.9 which is classified as mostly Class
13 and Class 12. It is known from looking at other plots that both chaotic and mixed up plots are
classified as class 13 due to the nature of class 13 being very spread out. So if one decides to ignore
the classifications as Class 13, one could potentially see this observation as a heavily disturbed
observation from Class 12. However, due to the issues pointed out in the previous paragraph, this
is not reliable. It is clear that further work is needed in order to make any clearer predictions of
47
observations such as these.
48
Chapter 6
Looking back at the problem formulation, we believe that we can conclude that artificial neural
networks can classify different types of radar signals. In addition to this it can also be done with a
very high accuracy after just a couple of seconds. The exact preciseness of the speed and accuracy
is not certain due to the limited test data but it is unlikely to not be higher than 95 percent within
a couple of seconds.
An interesting comparison with the traditional 2D-perspective used when classifying radar sig-
nals and the neural networks used in this part is the ability to classify small subsequences. It is
unlikely that just plotting four to five data points would allow a person to make a classification
with over 95 percent certainty by just looking at the plot. However the neural network models
managed to do this. This is not only an indication of the power of neural networks but also re-
veals that there is potentially a lot more information stored in a subsequence than what is seen
from a 2D-plot of the time shifted differences, meaning more than two dimensions contain useful
information.
A few comments on issues and what could have been handled differently. The most clear drawback
is that a LSTM(100) network has been used for every input size, both fixed and varied. It is very
unlikely that the same model performs the best for different input sizes and therefore it is likely
that potentially even better results could have been achieved if the optimal model was found for
each individual input size instead. In addition to this, it is possible that a less accurate but much
simpler model for isolated subsequences could still classify with comparable accuracy once applied
on the full sequences. The reason this was not fully investigated was mostly time constraints. As
training, testing and evaluating different neural networks take a lot of time and it would have
required eight times as much testing to find the optimal model for every input size, it was deemed
unfeasible. Furthermore the results were already sufficiently high and it was unlikely that a much
higher accuracy could be reached. The biggest room for improvement was therefore concluded to be
in the scope of model simplicity. Training for longer time periods, tuning parameters and testing it
out on full observations could potentially lead to a less complex model performing on similar levels.
Another issue was how the normalization of data was handled. The networks found in Chap-
ter 4 were trained and tested on normalized data which did not present any problems. However
when it came to classifying observations in real time as was explored in Chapter 5, this became a
problem. As data appears in real time it is not normalized and therefore it must first be normalized
before it can be handled by the network models. This can be handled by constantly normalize
incoming data using the existing data points but this results in a normalization likely to differ
from the one used for training early on as only a few data points are available. This will in turn
increase the risk for poor classifications during the early stages of the signal.
Due to reasons of simplification, this perspective has been largely ignored during the scope of
this report and instead the observations have been used as pre-normalized. There is a risk that
this could cause some problems in the accuracy during the early stages of the observations if one
49
uses the models in real time. However due to the fact that the observations come from the same
source, the normalization will become increasingly similar as more data is acquired. This was inves-
tigated and it was found that after a couple of seconds of data this no longer presented a problem.
Since the results acquired were said to converge after about four seconds, this was assumed not to
affect the results and the conclusions of this thesis in a significant manner. Still it was attempted
to redo the models on unnormalized data to get around the problem but this resulted in very poor
results and the idea was abandoned. It should also be noted that most of the normalization deals
with handling the mode the transmitter is in. If one has some prior knowledge about the position
of the object one is trying to classify, it is possible that the mode can be determined beforehand
and the normalization process can be done quicker even for a smaller amount of data points.
As was mentioned in the possible issues, it is likely that simpler models could be acquired with
a more thorough investigation. In particular it is likely that training feed-forward networks for a
longer time period could result in possibly equal results. In addition to this models that perform
worse for specific subsequences could still potentially perform comparably well on longer observa-
tions. It could therefore be of interest to try to find the absolutely best model, reaching the best
result on all the performance measures.
Finally another possible field of future work could be to obtain more data from (to the model)
unknown radar types and see how it classifies them over real time. This would also be of interest to
do for known types but heavily disturbed as was attempted in Section 5.4 but with the difference
that one utilizes observations with a known cause of disturbance/error. This could potentially
allow the observer to connect the models behaviour to a certain type of error or disturbance.
50
Bibliography
[3] Y.-H. K. C.-C. C. M.-H. Y. I.-H. Chen, Yung-Yao; Lin. Design and implementation of cloud
analytics-assisted smart power meters considering advanced artificial intelligence as edge ana-
lytics in demand-side management for smart homes. https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC6539684, 2019.
51
[18] Keras-Team. Keras. https://github.com/keras-team/keras, 2021.
[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. https://arxiv.org/
abs/1412.6980, 2017.
[20] Q. S. Kingsley Simon. ”Understanding Radar Systems”. McGraw-Hill Book Company, 1992.
[21] C. Lemaréchal. ”cauchy and the gradient method”. https://www.math.uni-bielefeld.de/
documenta/vol-ismp/40_lemarechal-claude.pdf, 2012.
52
Appendix A
53
54
Figure A.1: Loss and accuracy plots for a network with 3 hidden layers, all with 1,024 nodes each.
Appendix B
55
56
Figure B.1: Loss and accuracy plots for a network with one hidden LSTM layer with 300 nodes.
MATEMATIKCENTRUM | LUNDS TEKNISKA HÖGSKOLA | PRESENTERAD 2021-04-14