AD1114517
AD1114517
POSTGRADUATE
SCHOOL
MONTEREY, CALIFORNIA
THESIS
by
John E. Geldmacher
June 2020
i
THIS PAGE INTENTIONALLY LEFT BLANK
ii
Approved for public release. Distribution is unlimited.
John E. Geldmacher
Captain, United States Marine Corps
BS, U.S. Naval Academy, 2012
from the
Walter A. Kendall
Co-Advisor
Christopher Yerkes
Second Reader
Thomas J. Housel
Chair, Department of Information Sciences
iii
THIS PAGE INTENTIONALLY LEFT BLANK
iv
ABSTRACT
Advances in the development of deep neural networks and other machine learning
(ML) algorithms, combined with ever more powerful hardware and the huge amount of
data available on the internet, has led to a revolution in ML research and applications.
These advances have massive potential for military applications at the tactical level,
particularly in improving situational awareness and speeding kill chains. One opportunity
for the application of ML to an existing problem set in the military is in the analysis of
Synthetic Aperture Radar (SAR) imagery. Synthetic Aperture Radar imagery is a useful
tool for imagery analysts because it is capable of capturing high-resolution images at
night and regardless of cloud coverage. There is, however, a limited amount of publicly
available SAR data to train a machine learning model. This thesis seeks to demonstrate
that transfer learning from a convolutional neural network trained on the ImageNet
dataset is effective when retrained on SAR images. It then compares the performance of
the neural network to shallow classifiers trained on features extracted from images passed
through the neural network. This thesis shows that cross-modality transfer learning from
features learned on photographs to SAR images is effective and that shallow
classification techniques show improved performance over the baseline neural network in
noisy conditions and as training data is reduced.
v
THIS PAGE INTENTIONALLY LEFT BLANK
vi
TABLE OF CONTENTS
I. INTRODUCTION..................................................................................................1
A. PROBLEM STATEMENT .......................................................................1
B. PURPOSE AND RESEARCH QUESTIONS..........................................2
C. THESIS ORGANIZATION ......................................................................2
vii
V. CONCLUSIONS AND FUTURE WORK .........................................................41
A. CONCLUSIONS ......................................................................................41
B. FUTURE WORK .....................................................................................43
viii
LIST OF FIGURES
ix
THIS PAGE INTENTIONALLY LEFT BLANK
x
LIST OF TABLES
xi
THIS PAGE INTENTIONALLY LEFT BLANK
xii
LIST OF ACRONYMS AND ABBREVIATIONS
xiii
THIS PAGE INTENTIONALLY LEFT BLANK
xiv
ACKNOWLEDGMENTS
I would like to thank the people without whom I could not have succeeded during
my time at the Naval Postgraduate School. My thanks to Dr. Ying Zhao and Mr. Tony
Kendall for working with me as my thesis advisors. They saw potential in my idea and
guided me in my research to create this thesis. I’d also like to thank Chris Yerkes for his
patience in dealing with my sometimes basic questions about coding and for his inputs in
the writing and editing process. I’d also like to thank Michael McCarrin for his assistance
in beginning this research as the final project for his Computer Vision class. Thank you
also to the reader of this work. I hope it aids you in your own research or in the development
of the work below into a fieldable system.
I am also thankful for the cohort with which I have had the pleasure of taking so
many classes over the past two years. The comradery and commiseration in the Trident
Room and in town will remain my fondest memories of my time in Monterey.
xv
THIS PAGE INTENTIONALLY LEFT BLANK
xvi
I. INTRODUCTION
A. PROBLEM STATEMENT
The analysis and classification of targets within imagery captured by aerial and
space-based systems provides the Intelligence Community and military geospatial
intelligence (GEOINT) personnel with important insights into adversary force dispositions
and intentions. Satellite imagery has also entered the mainstream thanks to openly available
tools like Google Earth. The high resolution of space-based sensors and common use of
overhead imagery in everyday life means that, with the exception of decoys and
camouflage, an average person is now reasonably capable of identifying objects in electro-
optical (EO) imagery. EO images are, however, limited by cloud coverage and daylight.
About half of the time when a satellite in low earth orbit can image a target it is night,
necessitating the use of either an infrared (IR) or a synthetic aperture radar (SAR) sensor.
SAR has the additional advantage over EO/IR sensors of being all-weather. It is not
obscured by cloud, dust or smoke cover that can render EO/IR systems ineffective.
Automated target recognition (ATR) seeks to reduce the total workload of analysts
so that their effort can be spent on the more human-centric tasks like presenting and
explaining intelligence to a decision maker. ATR is also intended to reduce the time from
collection to exploitation by screening images at machine speeds rather than manually.
SAR ATR is complicated by the available data to train and assess machine learning models.
Unlike other image classification tasks that are studied today, such as those to support self-
driving vehicles, there is not a large and freely available amount of training data for
1
researchers. The paucity of training data requires creative solutions to achieve acceptable
model performance, particularly with respect to the level needed to support a targeting
decision by a military commander.
The primary goal of this thesis is to demonstrate that a transfer learning approach
using a model pre-trained on photographs can be applied to achieve high precision and
recall rates on Synthetic Aperture Radar (SAR) images of tactically meaningful targets.
This thesis will also explore the utility of a two-step classification method where a
convolutional neural network is used to extract features from test images for classification
by shallow classifiers. Shallow classifiers are machine learning algorithms that do not rely
on layers of artificial neurons, rather they use statistical or spatial methods to do regression
or assign labels. Classification performance of this multistep method will be compared to
the base CNN from which features were extracted as the training dataset is reduced and
noise is added to the dataset. This thesis will therefore address the following research
questions:
2. Does using a neural network as a feature extractor for input into a shallow
classification algorithm improve model performance?
C. THESIS ORGANIZATION
2
dataset, the structure of the neural network, training of the network and how intermediate
features were extracted from the model prior to classification. Chapter IV compares the
performance of transfer learning approaches and the performance of neural networks to
shallow classifiers trained on extracted features. Chapter V looks at how the results address
the research questions and areas of future research that could be conducted to further this
project.
3
THIS PAGE INTENTIONALLY LEFT BLANK
4
II. TECHNICAL BACKGROUND
Synthetic Aperture Radar (SAR) is a radar mounted to a moving platform that uses
the platform’s motion to approximate the effect of a large antenna. The high resolution that
can be achieved by creating a radar with an effective aperture much greater in size than is
physically possible allows for radar returns to be processed into images similar to what can
be achieved with a camera (Skolnik, 1981). SAR imagery provides an important tool for
the United States Intelligence Community and military IMINT analysts because of its all-
weather, day/night collection capability. Additionally, some wavelengths that SAR
imaging systems operate in have a degree of foliage and ground penetrating capability
allowing for the detection of buried objects or objects under tree cover that would not be
observable by electro-optical sensors.
These important advantages of SAR imaging for IMINT analysts do come with
some significant drawbacks inherent to SAR images. Because SAR images are not true
optical images they are susceptible to noise generated by constructive and destructive
interference between radar reflections that appear as bright or dark spots called “speckle”
in the image (Skolnik, 1981). Also, various materials and geometries will reflect the radar
pulses differently creating blobs or blurs that can obscure the objects physical dimensions.
SAR sensors operate in three primary modes, strip map, spotlight and inverse SAR. In strip
map mode the SAR sensor is fixed and collects a swath of radar returns to be processed
into an image. While operating in spotlight mode the physical antenna dwells on the desired
target as the sensor moves through its path. The longer dwell time on the target allows for
higher resolution as well as more average speckle (Skolnik, 1981). The different
operational modes of the SAR sensors can change the way a target is represented in the
formed SAR image. These issues, as well as problems caused by Doppler shift in moving
objects and radar shadows, make the identification and classification of objects in SAR
images a difficult and tedious task requiring a well-trained and experienced analyst.
Figure 1 demonstrates the difficulties an imagery analyst would face when identifying
targets in SAR imagery. The vehicles that are easily recognizable as cars in the EO image
5
become blurs in SAR. The static display aircraft are also difficult to identify. The wing of
the pictured aircraft is only distinguishable by the radar shadow it casts, and the dimensions
of the helicopters also become difficult to determine. Figure 2 is a photograph of the same
model of aircraft shown in Figure 1 and is representative of the type of photographs that
make up the ImageNet database. The ImageNet dataset and transfer learning will be
discussed in further detail later in this chapter, but it is clear that the type of features learned
on photographs, even of the same equipment as those in SAR images, will be significantly
different.
Pictured are static display aircraft at Kirtland Air Force Base, New Mexico. This
comparison demonstrates the way objects in SAR imagery are not always easily
identifiable by laymen and require a trained analyst to identify targets.
6
Figure 2. Grumman HU-16B Albatross. Source: U.S. Air Force (n.d.).
A neural network is, at its most basic, a function that learns to take some input and
map it to an output. In image classification this means taking the red, green, blue (RGB)
values of each pixel as an input value, performing a function and selecting the strongest
output as the most likely class. A neural network is made up of artificial neurons, that
similarly to a biological neuron are stimulated by an input and stimulate other neurons to
create a response. This biological analogy does not truly hold up on closer inspection, but
it provides a framework for thinking about how a neural network behaves. The operation
of the most basic form of artificial neuron is shown in Figure 3. The neuron sums weighted
inputs then applies an overall bias. The summed inputs and bias are compared to a set
threshold value. If the output exceeds the threshold, the output is defined as 1, if it is less
than the threshold it is defined as 0 (Nielsen, 2015). This creates a model for making a very
simple decision where inputs that are more heavily weighted have greater impact on a
binary true or false decision.
7
Figure 3. The Artificial Neuron. Adapted from Nielsen (2015).
Linking many of these neurons together into layers creates a neural network. To
return to the biological model the neurons are linked to each other as in the brain and
activations of neurons in turn activate other neurons. One of the common ways of
describing the middle layers of a neural network is by calling them hidden layers since the
user often has little insight into what the network is doing in between the input and the
output that the network provides. Figure 4 depicts a basic 4-layer neural network with 2
hidden layers. The layers in the example neural network are fully connected layers where
every neuron in a layer is connected to every other neuron in the next layer. Modern image
classification techniques make use convolutional layers that only connect one portion of a
layer to the next. Convolutional layers and convolutional neural networks (CNN) made up
of these layers are discussed in the next section.
8
Figure 4. A Basic Neural Network. Source: Nielsen (2015).
The response of individual neurons in a neural network can be made more precise
by applying a more complex activation function than a simple threshold. Rather than a
binary on/off response the neuron will activate with a strength based on the activation
function. This allows for activations of some neurons to have a greater impact on the neural
networks final output if they are more strongly activated than other neurons. The rectified
linear unit (ReLU) activation function is a commonly used activation function since it
removes negative activations, is computationally cheap, and simplified training by
reducing the need for pre-training (Glorot et al., 2011). Equation (1) shows an input/output
equation for the rectifier used in the ReLU activation.
f ( x) = max(0, x) (1)
In the output layer two common activations are the sigmoid function and the
softmax function. The sigmoid function compresses all outputs to a range between zero
and one, allowing for a perceptron’s output to be expressed as a probability of belonging
to a class when making a binary classification. The softmax function is a generalized form
of the sigmoid and allows for a probabilistic assignment of a label when there are more
than two classes. The equation for the sigmoid is given in Equation (2).
9
1
σ ( x) = (2)
1 + e ∑ i i i −b
− wx
One of the most powerful image classification tools to emerge in recent years is the
CNN. Convolutional layers provide several advantages over fully connected layers such as
shift invariance, and reduced parameterization. In a convolutional layer a kernel of a given
height, width, and depth is convolved with an input and the output of that function becomes
the input to the next layer. The kernel is then shifted by a chosen stride length and the
process is repeated. The height and width of the kernel act as the window while the kernel
depth represents the number of filters within the kernel (Stewart, 2019). An example of
how the convolution function using a simple 3x3x1 kernel is applied to a region is shown
in Figure 5. The convolutional filters provide a means of efficiently describing the local
neighborhood of a pixel and can identify meaningful features in an image. These filters are
convolved throughout the image slowing them to identify features wherever they appear in
an image. Moving the filters across the image and identifying features within a small
window allows CNNs to be shift invariant. This is useful since one cannot assume that an
object will appear in the same location within an image every time it imaged.
10
Another advantage of convolutional layers is the reduced parameterization required
to train the convolutional filters. Since even a relatively small 128x128 pixel RGB image
requires a network with 49,152 inputs, if each pixel input is connected to every neuron in
the next layer and if each subsequent hidden layer is also fully connected the number of
parameters would quickly grow to the point of computational infeasibility. The use of a
convolutional layer limits the number of weights that need to be saved and updated with
each training iteration to the number of parameters within the kernel multiplied by the
depth of the previous layer. Additionally, local connectivity between layers means that a
single strong activation from a neuron is not propagated across the entire next layer as
would occur with a fully connected layer.
D. TRANSFER LEARNING
CNNs require a very large amount of data to train an accurate model and it is not
uncommon for datasets with tens or even hundreds of thousands of images to be needed.
Transfer learning presents one possible solution when training a CNN on a limited dataset
by leveraging knowledge from a previously learned source task to aid in learning a new
target task (Pan & Yang, 2010). In an image classification problem transfer learning works
by training a CNN on a very large number of images and freezing the parameters for a
certain number of layers before training further layers and the final classification layer
(Kang & He, 2016). Low and mid-level features are likely common across even dissimilar
datasets which allows the model to leverage these pre-learned features to reduce the
training time and training data. Alternatively, one could attempt to transfer all the weights
and biases and only retrain the final model output layer, or a more limited model top. While
11
this could have the advantage of being computationally cheaper and faster, it is more likely
to run into the problem of negative transfer. Transfer learning requires that the source and
target tasks not be too dissimilar. If the source task, such as classifying full color
photographs, is drastically different from the target task, like classifying overhead SAR
images, the transfer learning method may handicap the model performance (Pan & Yang,
2010).
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the
leading challenges for computer vision tasks. Started in 2010, the ILSVRC classification
and localization task consists of a subset of 1000 non-overlapping synsets for which
12
researchers train classifiers and submit their generated labels for unlabeled test images. The
transfer learning approach explored in this thesis uses a model pre-trained on the ILSVRC-
2014 dataset that consisted of 1.2 million images across the 1000 classes (Russakovsky et
al., 2015).
F. SHALLOW CLASSIFIERS
The use of multilayered neural networks for machine learning is also called deep
learning. Shallow machine learning techniques do not rely on hidden layers to learn
features and classify new data points. This often reduces the computational resources
needed to implement shallow machine learning methods, but also often reduces the
flexibility to deal with a problem like image classification where the possible feature space
for an image is functionally infinite. The three “shallow” techniques used in this thesis are
k-nearest neighbor (kNN), support vector machines (SVM) and random forests (RF).
1. K-Nearest Neighbor
13
Figure 7. Example of kNN. Adapted from Theobald (2017).
For this example, k is set to three. If an unweighted classifier is trying to assign the star to
a specific class, it would assign it to the green circle class because of the three nearest
objects two are green circles.
Support vector machines shares many similarities with a logistic regression but
rather than attempting to minimize the distance from all data points to the hyperplane that
divides classes, the goal is to maximize the distance from the hyperplane to the nearest
point in the divided clusters as shown in Figure 8. This reduces sensitivity to outliers and
reduces the likelihood of misclassifying data points (Gandhi, 2018). In the case of non-
linearly separable like that shown in Figure 9, the “kernel trick” can be applied to map the
data into a higher dimensional space where a hyperplane may be found to separate the data.
14
Figure 8. Example SVM Classifier. Adapted from Theobald (2017).
15
3. Random Forests
The final shallow classification algorithm used in this thesis is random forests.
Random forests leverage decision trees that use binary splits to create branching decisions.
In a random forest multiple different decision trees are run on a randomly selected subset
of features to predict the class label. The decision trees also make splits based on different
binary decisions. As demonstrated in Figure 10, the model then compares the predicted
labels generated by the individual decision trees and the majority label is then assigned to
the data point. By spreading the classification across multiple trees the random forest
compensates for error in any single decision tree (Yiu, 2019).
16
G. RELATED RESEARCH
Due its availability and ease of access for researchers, the Moving and Stationary
Target Acquisition and Recognition (MSTAR) dataset has become the standard dataset for
SAR image classification research. The MSTAR dataset is described in greater detail in
the methodology section. Initial work on the SAR ATR problem was conducted to support
the Defense Advanced Research Projects Agency’s Semi-Automated IMINT Processing
(SAIP) project which grew out of the operational experience of hunting for Iraqi mobile
ballistic missile launchers in the Gulf War. A template matching method pursued by the
Lincoln Laboratories reported good results but required the construction of 72 templates
per target class covering all orientations of the target (Novak et al., 1997). Research
supporting the SAIP project was eventually able to produce 95.8% accuracy on the 10-
class MSTAR dataset. There was, however, a notable reduction in performance when faced
with certain real-world situations such as having a self-propelled artillery piece deployed
in a revetment or a tank that has additional armor applied to it. Lincoln Laboratories was
able to overcome these situations through the creation of additional templates (Novak,
2000). Since this research were in support of the SAIP program some data explored is not
publically available today. SAR ATR research of the public MSTAR dataset using shallow
classification methods also generally produced good results. An approach using an SVM
classifier for a three-class subset of MSTAR targets with two confuser vehicles reported
93.4% accuracy (Bryant & Garber, 1999). An SVM method proposed by Zhao achieved
91% accuracy in a three-class test (Zhao & Principe, 2001). while a Bayesian classifier
reported 95.05% accuracy in a 10-class test (O’Sullivan et al., 2001).
In recent years, the work on classification of SAR imagery has focused on CNNs.
In 2015, Morgan showed that a relatively small CNN could achieve 92.1% accuracy across
10-classes of the MSTAR dataset roughly in line with the shallow methods previously
explored. Morgan’s method also showed that a network trained on nine of the MSTAR
target classes could be retrained to include a tenth class 10–20 times faster than training a
10-class classifier from scratch. The ability to more easily adapt the model to changes in
target sets represents an advantage over shallow classification techniques (Morgan, 2015).
This is especially valuable in a military ATR context given the fluid nature of military
17
operations where changes to the order of battle may necessitate updating a deployed ATR
system. In order to overcome the limitations caused by the relatively small number of
images in the MSTAR dataset researchers explored transfer learning from a CNN pre-
trained on simulated SAR images. Artificial SAR images were generated by using ray
tracing software and detailed computer aided design models of target systems. They
showed that model performance was improved, especially in cases where the amount of
training data was reduced (Malmgren-Hansen et al., 2017). The technique of generating
simulated SAR images for training could also be valuable in a military SAR ATR context
where an insufficient amount of training data for some systems may exist.
Chen and Wang explored several approaches to deep learning and feature extraction
for the MSTAR dataset. A method where the convolutional kernels were trained using a
sparse auto-encoder using randomly sampled image patches resulted in 84.7% accuracy
across 10 classes (Chen & Wang, 2014). The employment of a clever data augmentation
scheme that sampled several 88x88 pixel patches from each training image in the MSTAR
dataset increased the training dataset to 2700 images per class. This larger dataset allowed
a CNN to achieve 99.1% accuracy on average across 10 classes (Wang et al., 2015).
While many CNNs use a softmax activation in the output layer of multi-class
classifiers, there is evidence that SVM classifiers can outperform the softmax activation
for some classification tasks (Tang, 2013). A multi-step classification process with transfer
learning from ImageNet to MSTAR and feature extraction for training an SVM classifier
was explored by Mufti et al. in 2018. Their methodology compared the performance of an
SVM classifier trained on mid-level feature data extracted from multiple layers from
AlexNet, GoogLeNet, and VGG16 neural networks without retraining the feature
extracting network (Al Mufti et al., 2018). The basic workflow of this methodology is
shown in Figure 11.
18
Figure 11. CNN to SVM Methodology. Source: Mufti (2018).
In the preprocessing step a center 50x53 pixel chip was extracted to reduce noise
around the target. Two augmentations were performed on the training dataset, the mean
grey level was subtracted and a Laplacian of the Gaussian filter was performed to
emphasize target edges (Al Mufti et al., 2018). Combining these augmentations with the
original images tripled the size of the training dataset. They reported 99.1% accuracy when
classifying targets based on features extracted from mid-level convolutional layers from
AlexNet. The best performance reported by Mufti et al. from the VGG16 architecture was
92.3% from a mid-level convolutional layer, but only 49.2% and 46.3% from features
extracted in the last two fully connected layers.
Although not focused on SAR images, the application of transfer learning to remote
sensing target detection and classification was previously studied at the Naval Postgraduate
School by Lieutenant Katherine Rice in 2018. Rice showed that a CNN classifier trained
on a photographic dataset could be retrained to perform remote sensing classification of
EO satellite images of ships at sea with a recall of .99 and precision of .98 when retrained
on a dataset contained 2000 images. This research supports that a transfer learning
approach between modalities is feasible (Rice, 2018). Rice’s approach to transfer learning
is the foundation for the approach taken in this thesis.
19
THIS PAGE INTENTIONALLY LEFT BLANK
20
III. METHODOLOGY
A. MSTAR DATASET
21
Figure 12. Example Photographs and MSTAR Images by Class. Adapted
from U.S. Air Force (1996), Leahy (1994), and Torin (2008).
The dataset was collected at a 15-degree and a 17-degrees look angle in September
1995 at the Redstone Arsenal, Huntsville, AL, by the Sandia National Laboratory SAR
sensor platform. The sensor was operated in spotlight mode producing images with a 1-
foot resolution. The dataset consists of 128x128 pixel greyscale images representing one
of the few publicly available SAR datasets and therefore a commonly studied dataset for
SAR ATR. The total number of images per class are described in Table 1. Since there is
not a consistent number of images per class, 200 images collected from a 17-degree look
angle were randomly selected from each class as the training data. The test set consisted of
2700 images collected at a 15-degree look angle and generated by dropping a single image
from the 10 vehicle classes and three SLICY images.
22
Table 1. Number of SAR Images per Class
B. NETWORK ARCHITECTURE
23
Figure 13. VGG16 Architecture. Adapted from Simonyan and Zisserman
(2015).
The base architecture is modified for this thesis as shown in Figures 14 and 15. The
model top of three fully connected layers is replaced with a new model top consisting of a
fully connected layer, a dropout layer to mitigate overfitting, and two final fully connected
layers with a softmax activation for classification. For the partial transfer learning approach
the model is initialized with the ImageNet weights and has the first two convolutional/
pooling blocks frozen for training to take advantage of the broad feature detection of the
pre-trained network (Rice, 2018). This thesis expands on the partial transfer learning
approach by exploring a full transfer of weights in the convolutional base. This is achieved
by freezing all convolutional and pooling layers and only training the new model top.
Figure 14. Modified VGG16 with Partial Transfer Learning and 11-Class
Output. Adapted from Rice (2018).
24
Figure 15. Modified VGG16 with Full Transfer Learning and 11-Class
Output. Adapted from Rice (2018).
Figure 16 shows process of the multistep methods. A SAR image is passed through
the CNN to extract features from the second fully connected layer. The extracted features
are saved as a 1024-dimensional vector and used to train the shallow classifiers. Test data
is also passed through the CNNs in order to vectorized the images which are then used to
evaluate the performance of the shallow classifiers against the softmax activation layer of
the base CNN.
Figure 16. Multistep Classifier Using a CNN for Feature Extraction. Adapted
from Rice (2018).
The previously described models are implemented using the Keras application
program interface (API) with TensorFlow as the backend. TensorFlow is an open source
25
Python machine learning library developed by Google. TensorFlow greatly reduces the
programing workload for implementing deep learning. Rather than having to build a CNN
from scratch, TensorFlow allows for the creation of models by calling functions to create
layers with given parameters It also provides visualization tools for learning rate, training
time, and other useful information in the evaluating the training process of a deep learning
model. The Keras API is capable of running on top of other machine learning libraries in
addition to TensorFlow and has a prebuilt VGG16 available. The Keras API is a more user-
friendly interface with the TensorFlow library. The ImageNet weights available in Keras
are ported from Visual Geometry Group at Oxford University that developed the VGG16
architecture for ILSVRC-2014 localization and classification tasks (Simonyan &
Zisserman, 2015).
Orange is an open source data science and machine learning toolkit that allows
users to easily manipulate data through a graphical user interface. Orange has several built
in machine learning algorithms and simplifies the data management and preprocessing
requirements to allow users to experiment with approaches to machine learning and data
science (Demšar et al., 2004).
1. Measures of Performance
The primary methods for measuring the model performance in this thesis are recall
and precision. Recall is the proportion of targets that are correctly classified by the model.
Recall can also be thought of as the true positive rate. Precision is measure of the
percentage of true positives over total positives. High recall indicates that the model
correctly classifies targets while high precision indicates that the model does not have many
false positives. Both measures have their origins in data retrieval but are a standard way of
measuring machine learning performance.
True Positives
Recall = (3)
Total Targets in Class
26
True Positives
Precision = (4)
Total Predicted Positives
2. Experiment 1
3. Experiment 2
Experiment 2 explores the utility of CNNs as feature extractors for training shallow
classifiers. Features are extracted from the last fully connected layer prior to the final output
layer of the CNN as depicted in Figure 16 for both the training and test datasets and are
saved as a 1024-dimensional vector representing each image. The vectorized feature
representations of the images are run through the Orange workflow pictured in Figure 17.
Performance is then compared between the base CNNs and kNN, SVM, and Random
Forest classifiers trained and evaluated on the extracted features. For the kNN classifier k
was set to 11 and weighted Euclidian distance was used to determine which class label to
assign to test images. A sigmoid kernel was used in the SVM classifier, and the random
forest consisted of 10 decision trees. The parameters for the shallow classifiers are
unchanged in Experiments 3 and 4.
27
Figure 17. Orange Workflow
4. Experiment 3
For Experiment 3, random Gaussian noise was added to the images from the
datasets test and training datasets. Two noise levels were set to a random value up to 5%
and 20% of the max pixel value to create low-noise and high-noise tests. An example of
the resulting image in the high-noise dataset is shown in Figure 18. The noise added images
are then classified by the CNN and multistep process without retraining the CNN in order
to test the robustness of the classifiers to noise. The CNNs were retrained on the high-noise
images and features extracted from the noise added training dataset were used to train the
shallow classifiers.
28
Figure 18. Comparison of Original Image and 20% Noise Added Image
5. Experiment 4
29
THIS PAGE INTENTIONALLY LEFT BLANK
30
IV. RESULTS AND ANALYSIS
A. EXPERIMENTAL RESULTS
1. Experiment 1
Table 2 compares the performance of the CNN trained without transfer learning on
the MSTAR data to the two transfer learning approaches described in Figures 13 and 14.
The partial transfer learning approach shows some moderate improvement over the CNN
trained exclusively on the MSTAR data. The full transfer learning of all convolutional
weights and only retraining the CNN top did not match the performance of the non-transfer
learning approach, suggesting some negative transfer occurs in the later convolutional
layers. The transfer learning approach also had the advantage of converging much more
quickly than the CNN initialized with random weights. Figure 19 shows the training loss
and training accuracy of the CNN trained without transfer learning and the partial transfer
learning method.
31
Figure 19. Comparison of Training Loss and Accuracy
The left graph depicts the training accuracy and loss of a CNN initialized with random
weights. The right graph shows a model trained with the partial transfer learning method
described in Figure 14. The training loss of the transfer learning model descends much
more rapidly than the model trained exclusively on the MSTAR dataset. Similarly, the
transfer learning method’s accuracy more rapidly approaches 1.
2. Experiment 2
The multistep method using shallow classifiers trained on features extracted from
the CNN showed some improvement in recall over the base CNN, particularly in the lowest
performing classes in the base CNN like the 2S1 and T-62. The multistep method also saw
generally higher performance in precision as well, once again well outperforming the base
CNNs, particularly in the low precision classes such as the ZSU-23-4. As shown in
Table 3, all shallow classifiers showed improved average performance over the base CNN
that was not pre-trained. Tables 4 and 5 show the multistep classifier performance
compared to the CNNs pre-trained on ImageNet. The kNN classifier matched the partial
transfer learning model’s performance and exceeded the SVM classifier. In the full transfer
learning approach the kNN and SVM classifiers exceeded the CNN’s average performance
in both precision and recall.
32
Table 3. Comparison of Multistep Classifier without Transfer Learning
33
Table 5. Comparison of Multistep Classifier with Full Transfer Learning
3. Experiment 3
34
Table 6. Comparison of CNN Performance on Low-Noise Images without
Retraining
Tables 8, 9, and 10 show the model performance after retraining on the high-noise
dataset. The pre-trained CNNs do not match the performance of the CNN trained
exclusively on MSTAR data. The trend in Experiment 2 of improved average accuracy of
shallow classifiers trained on features from CNNs holds with noisy data as well. The kNN
35
and SVM classifiers outperformed the models from which features were extracted in both
precision and recall. The addition of noise more negatively impacts the transfer learning
approaches compared to a CNN trained from scratch on noisy data.
36
Table 10. Comparison of Multistep Classifier with Full Transfer Learning
Trained on 20% Noisy Images
4. Experiment 4
Building off of the performance of the partial transfer learning method with shallow
classifiers shown in Experiments 1 and 2, Experiment 4 seeks to determine the partial
transfer learning method’s performance as training data is reduced. Multistep classification
methods showed significant improvements over the base model as the size of the training
dataset decreased as shown in Tables 11 and 12. When the training data was reduced to
100 images per class the multistep classifiers all performed better on average than the base
CNN. Both the kNN and SVM classifiers maintained precision and recall performance
above .95 while the base CNN performance dropped to .907 for recall and .904 for
precision. When the training dataset was reduced to 50 images per class all multistep
classifiers continued to outperform the baseline model and saw much better performance
on difficult targets like the BTR-60, BTR-70 and T-62.
37
Table 11. Performance of Multistep Classifier with Partial Transfer Learning
Trained on 100 Images per Class
B. ANALYSIS
The performance of the kNN classifier compared to the softmax activation for
neural network output is notable. Multistep classifier performance on the most difficult
classes for the CNN was significantly better in several of the tests. When trained on 50
images the CNN only achieved a recall score of .371 on the BTR-60 class, well below the
.742 and .814 of the kNN and SVM classifiers. In the same test the base CNN had recall
rates below .65 for the BTR-70, T-62, and ZIL-131, while the kNN and SVM classifiers
were able to achieve results for these cases with a recall above .8. There are also large
reductions in precision for some classes that are not mirrored in the multistep method. The
39
nature of the SVM and kNN algorithms may provide greater robustness under the evaluated
circumstances when compared to the CNN classifiers. The support vectors and number of
neighbors used to classify new data points do not rely on a probability the way a softmax
activation does and therefore are less likely to be affected as harshly by a reduced sample
size. Since the kNN algorithm calculates the distance to all training data points the
computational cost of the kNN algorithm increases with the volume of training data used.
The good performance of the kNN method as the training data is reduced as described in
this thesis is represents a computationally cheap way to build a future classifier.
Excepting the performance of models on noisy data prior to retraining, all models
performed well in both recall and precision on the SLICY target. Performance on the
SLICY class is of interest because it demonstrates the model’s ability to discriminate a
non-valid target from a valid target. All other classes, with the exception of the D7, are
former Soviet Union military equipment. Up-armored versions of the D7 and related
equipment are often used in combat engineering roles. In a military context this means they
are likely to be a valid target. The classification of a SLICY as any other class would
indicate the model is accepting a clearly invalid target as a valid target. As demonstrated
by the high precision in this class across the experiments valid targets are very infrequently
classified as a SLICY and the high recall indicates that the random objects are not being
accepted as valid targets.
40
V. CONCLUSIONS AND FUTURE WORK
A. CONCLUSIONS
When examining the research questions proposed in this thesis we can make the
following observations:
2. Does using a neural network as a feature extractor for input into a shallow
classification algorithm improve model performance?
41
4. Does a shallow classification method trained on features extracted from a
neural network perform better in noisy conditions or with a small training
dataset?
Both the baseline model employing transfer learning and the shallow classifiers
using a neural network as a feature extractor performed with a high degree of accuracy and
would be valuable in an operational context as an aid to IMINT analysts. The multistep
classification technique also shows that a very large training dataset does not need to be
developed and a system could be trained with as few as 100 unaugmented training images
per class. This presents an opportunity for an operational SAR ATR model to be updated
based on the current units in a specific area of operations or on vehicles that are modified
or damaged in a way that prevents accurate classification by the CNN or multistep
classifier.
42
B. FUTURE WORK
The MSTAR dataset also was not collected using current DOD remote sensors.
Future research could be conducted under conditions that more closely mirror the
operational environment in which a fully developed model would be fielded. This includes
target detection in a complete scene, a task that was not required due to the format of the
MSTAR dataset where the target appears centered in a relatively small image. This thesis
does, however, provide a technical demonstration that transfer learning from one modality
to another has potential to make the development of accurate models with limited training
data sets easier.
Further work could also be done to explore a kNN classifier as a replacement for
the classification output of a neural network. kNN performance when trained on features
extracted from mid-level convolutional layers and from other neural network architectures
could be studied for SAR images or in other classification tasks. The optimal size for
vectorized images for input to a kNN classifier could also prove a fertile area of study as
the distance measurements used in the kNN algorithm become less computationally
expensive with smaller vectors.
43
THIS PAGE INTENTIONALLY LEFT BLANK
44
LIST OF REFERENCES
Al Mufti, M., Al Hadhrami, E., Taha, B., & Werghi, N. (2018). Automatic target
recognition in SAR images: Comparison between pre-trained CNNs in a transfer
learning based approach. 2018 International Conference on Artificial Intelligence
and Big Data (ICAIBD), 160–164. https://doi.org/10.1109/ICAIBD.2018.8396186
Bryant, M., & Garber, F. (1999). SVM classifier applied to the MSTAR public data set.
Proc. SPIE, 3721, Algorithms for Synthetic Aperture Radar Imagery VI.
https://doi.org/10.1117/12.357652
Chen, S., & Wang, H. (2014). SAR target recognition based on deep learning. 2014
International Conference on Data Science and Advanced Analytics (DSAA), 541–
547. https://doi.org/10.1109/DSAA.2014.7058124
Demšar, J., Zupan, B., Leban, G., & Curk, T. (2004). Orange: from experimental
machine learning to interactive data mining. In JF. Boulicaut, F. Esposito, F.
Giannotti, D. Pedreschi (Eds.), Knowledge discovery in databases: PKDD 2004
(vol. 3202, pp. 537–539). Springer. https://doi.org/10.1007/978-3-540-30116-
5_58
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-
scale hierarchical image database. 2009 IEEE Conference on Computer Vision
and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Google (n.d). [Kirtland Air Force Base static display]. (n.d.). Retrieved April 14, 2020,
from https://earth.google.com/web/@35.05470595,-
106.59537596,1623.90184995a,214.4694424d,35y,-
142.92018984h,31.60149328t,0r
Kang, C., & He, C. (2016). SAR image classification based on the multi-layer network
and transfer learning of mid-level representations. 2016 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), 1146–1149.
https://doi.org/10.1109/IGARSS.2016.7729290
Leahy, T. (1994). [Overhead shot of a Kuwait Army BMP-2 infantry fighting vehicle
taking part in Operation VIGILANT WARRIOR in Kuwait]. National Archives.
https://catalog.archives.gov/id/6496793
45
Malmgren-Hansen, D., Kusk, A., Dall, J., Nielsen, A., Enghold, R., & Skriver, H. (2017).
Improving SAR automatic target recognition models with transfer learning from
simulated data. IEEE Geoscience and Remote Sensing Letters, 14(9), 1484–1488.
https://doi.org/10.1109/LGRS.2017.2717486
Morgan, D. (2015). Deep convolutional neural networks for ATR from SAR imagery.
Proc. SPIE, 9475, Algorithms for Synthetic Aperture Radar Imagery XXII.
https://doi.org/10.1117/12.2176558
Novak, L. M., Owirka, G. J., Brower, W. S., & Weaver, A. L. (1997). The automatic
target-recognition system in SAIP. The Lincoln Laboratory Journal, 10(2), 187–
202.
O’Sullivan, J. A., DeVore, M. D., Kedia, V., & Millier, M. I. (2001). SAR ATR
performance using a conditionally Gaussian model. IEEE Transactions on
Aerospace and Electronic Systems, 37(1), 91–108. https://doi.org/10.1109/
7.913670
Pan, S., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on
Knowledge and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/
TKDE.2009.191
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Kholsa, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). ImageNet large
scale visual recognition challenge. International Journal of Computer Vision, 115,
211–252. https://doi.org/10.1007/s11263-015-0816-y
Sandia National Laboratories. (n.d.). [SAR Image of Kirtland Airforce Base Static
Display]. https://www.sandia.gov/radar/index.html
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. International Conference on Learning Representations 2015.
https://arxiv.org/abs/1409.1556
46
Stewart, M. (2019, February 26). Simple introduction to convolutional neural networks.
Towards Data Science. https://towardsdatascience.com/simple-introduction-to-
convolutional-neural-networks-cdf8d3077bac
Tang, Y. (2013). Deep learning using linear support vector machines. 2013 ICML
Challenges in Representation Learning. https://arxiv.org/abs/1306.0239
United States Air Force. (n.d.). [Grumman HU-16B Albatross at the National Museum of
the United States Air Force]. https://www.nationalmuseum.af.mil/Upcoming/
Photos/igphoto/2000574422/
United States Air Force. (1996). Moving and stationary target acquisition and
recognition public dataset [Data set]. United States Air Force.
https://www.sdms.afrl.af.mil/index.php?collection=mstar
Wang, H., Chen, S., Xu, F., & Jin, Y.-Q. (2015). Application of deep-learning algorithms
to MSTAR data. 2015 IEEE International Geoscience and Remote Sensing
Symposium (IGARSS).
Xavier Glorot, Antoine Bordes, & Yoshua Bengio. (2011). Deep sparse rectifier neural
networks. Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics, PMLR, 315–323. http://proceedings.mlr.press/v15/
glorot11a.html
Yiu, T. (2019, June 12). Understanding random forest: how the algorithm works and why
it is so effective. Towards Data Science. https://towardsdatascience.com/
understanding-random-forest-58381e0602d2#f3c8
Zhao, Q., & Principe, J. (2001). Support vector machines for SAR automatic target
recognition. IEEE Transactions on Aerospace and Electronic Systems, 37(2),
643–654.
47
THIS PAGE INTENTIONALLY LEFT BLANK
48
INITIAL DISTRIBUTION LIST
49