0% found this document useful (0 votes)
6 views

AutoEncoder1

This document provides a comprehensive review of autoencoders and their variants, focusing on their applications in remote sensing, particularly in synthetic-aperture radar (SAR) image classification. It discusses various techniques, improvements, and hybrid models that enhance the performance of autoencoders in feature learning and classification tasks. The authors aim to fill the gap in systematic summaries of autoencoder applications and configurations, providing insights and suggestions for future research.

Uploaded by

swati.dbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

AutoEncoder1

This document provides a comprehensive review of autoencoders and their variants, focusing on their applications in remote sensing, particularly in synthetic-aperture radar (SAR) image classification. It discusses various techniques, improvements, and hybrid models that enhance the performance of autoencoders in feature learning and classification tasks. The authors aim to fill the gap in systematic summaries of autoencoder applications and configurations, providing insights and suggestions for future research.

Uploaded by

swati.dbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

A Review of the Autoencoder

and Its Variants

Ganggang Dong, A comparative perspective from target


Guisheng Liao, Hongwei Liu, recognition in synthetic-aperture radar images
and Gangyao Kuang

I n recent years, unsupervised feature learning based on a


neural network architecture has become a hot new top-
ic for research [1]–[4]. The revival of interest in such deep
weights that allow deep autoencoder networks to learn low-
dimensional codes. The encoding trick introduced works
much better than principal component analysis (PCA) in
networks can be attributed to the development of efficient terms of dimension reduction.
optimization skills, by which the model parameters can The basic neural network model includes an autoencod-
be optimally estimated [5]. The milestone work done by er [7], a restricted Boltzmann machine [8], and a convolu-
Hinton and Salakhutdinov [6] proposes to initialize the tional neural network (CNN) [9]. The deep neural network
architecture, including a stacked autoencoder, deep belief
Digital Object Identifier 10.1109/MGRS.2018.2853555
network, and deep CNN, is formed by concatenating the ba-
Date of publication: 24 September 2018 sic models hierarchically. Due to its simple implementation

44 0274-6638/18©2018IEEE ieee Geoscience and remote sensing magazine september 2018


and attractive computa- Geng et al. present a deep supervised and contractive
tion cost, an autoencoder neural network for synthetic-aperture radar (SAR) image
h a s been used in m a n y classification [21], [29]. Three kinds of handcrafted fea-
fields. Examples include tures, a gray level-gradient cooccurrence matrix, Gabor
natural language process- filter banks, and a histogram of oriented gradient descrip-
ing [10], image processing tors, are jointly fed into a stacked autoencoder network. The
[11], object detection [12], learned representation is then input to a softmax classifier
biometric recognition [13], for land-cover classification. Similar thought has been em-
and data analysis [14]. This ployed in [30], where the geometric feature and local tex-
article focuses primarily on ture feature are combined to train a stacked autoencoder for
autoencoder applications target recognition in SAR images.
for remote sensing: terrain Inspired by the change-detection algorithm, Planini
classification, land-use rec- and Gleich propose to assess the damage caused by fires
ognition, image retrieval, [31]. They extract the higher-order log cumulants of a frac-
semantic annotation, ob- tional Fourier transform from the tunable Q discrete wave-
ject detection, and target let transform and then use the obtained features to train a
recognition in radar and stacked autoencoder model to distinguish the changed and
hyperspectral images [15]– unchanged areas. Zhang et al. design a framework to learn
[26]. Although there are a robust features from polarimetric SAR (PolSAR) data [32].
great many previous works The local spatial information has been exploited to train a
concerning this family of stacked sparse autoencoder model. The influence of neigh-
topics, a systematic sum- boring pixels is controlled according to the spatial distance
mary is still needed. This to the central pixel.
article attempts to fill the To handle the presence of clouds, Malek et al. use an
gap. We first review most autoencoder neural network to recover the missing data in
of the related works, with multispectral images [33]. They aim to model the relation-
their contributions orga- ship between a given cloud-free image (i.e., source image)
nized into four categories. and a cloud-contaminated one (i.e., target image). Two
schemes, pixel-level and patch-level mapping, are devel-
Information Fusion oped to exploit the spatial contextual information. The size
In earlier work, the prede­­ of the hidden layer is determined by a new solution that
fined features or raw pixel combines the minimum descriptive length criterion and a
values are combined by an Pareto-like selection procedure.
©istockphoto.com/Devrimb

autoencoder model, from


which a refined represen­ New Cost Function
tation can be lea r ned. It Training an autoencoder model refers to estimating the
is then connected with a model parameters, weight and bias, and is achieved by
softmax layer to achieve solving an optimizing problem whose objective function
dif ferent c la ssif ic at ion is composed of a loss function and certain regularization
tasks. Because each set of terms. Therefore, the performance can be improved by de-
features represents a certain kind of information, the au- signing a task-specific objective function, into which cer-
toencoder model can be viewed as a black box to achieve tain information can be incorporated. The loss function
information aggregation. and the regularization term can be tuned according to the
Lin et al. design an autoencoder model for hyperspectral task at hand. In addition, the unsupervised learning can
image classification [27], [28]. The spatial patches and spec- be converted to supervised by designing a label-related
tral bands of hyperspectral images are jointly considered cost function.
for classification. The basic model and the stacked autoen- Ma et al. describe a spatial updated deep autoencoder
coder are used to generate the shallow and deep represen- for hyperspectral image classification [17]. They present a
tations. The spatial information and spectral information new regularization term composed of the similarity of the
can then be combined by the autoencoder network. Tao hidden neurons weighted by the cosine distance of visible
et al. propose adaptively learning a good representation nodes. The contextual information can then be exploit-
from the unlabeled data [24]. They establish two learning ed. A similar idea is applied in [34], where the objective
schemes: sparse spectral feature learning and multiscale function is tuned according to the task of target recogni-
spatial feature learning. The learned spectral-spatial fea- tion. The authors design a regularization term that mea-
ture is then embedded into a linear support vector ma- sures the total difference of the hidden neurons. Because
chine (SVM) for classification. the class membership is incorporated into the objective

september 2018 ieee Geoscience and remote sensing magazine 45


function, it actually represents a supervised learning to produce the relatively small number of convolutional
mode. Similarly, Xie et al. present an improved objective features from the input data set. The extended features are
for the autoencoder model according to the task of Pol- then extracted from the learned features by a multiple nor-
SAR image classification [20]. The deviation of the recon- malized difference method to compose a derivative feature
structed data to the initial value is measured by Wishart set. The global statistics (i.e., histogram moments, mean,
distance, rather than mean-square error (MSE) or cross en- variance, and standard deviation) are finally used to build
tropy. Liang et al. discuss a robust and discriminative fea- the midlevel representation.
ture representation by an improved smooth autoencoder Wu et al. propose a hybrid architecture, i.e., deep filter
[35]. The encoding of each sample is used to reconstruct banks, for land-use scene classification [40]. They combine
its local neighbors. The learned representations are there- a multicolumn stacked denoising sparse autoencoder and
fore consistent among local neighbors and robust to small Fisher vector to automatically learn the representative and
variations of input. Gong et al. propose a multiobjective discriminative features in a hierarchical manner. Kim and
sparse feature learning model based on an autoencoder Hirose consider PolSAR image classification by a quater-
[36]. The model parameters are learned by optimizing two nion autoencoder and a quaternion self-organizing map
objectives, reconstruction error and the sparsity of hidden (SOM) simultaneously [41]. A quaternion autoencoder is
units, simultaneously. employed to extract feature information based on the nat-
ural distribution of PolSAR features. The extracted features
Combination with other learning schemes are classified by the quaternion SOM in an unsupervised
To further promote performance, some researchers pro- manner, by which new and more detailed land categories
pose introducing a third-party learning scheme, such as ac- can be discovered.
tive learning or learning via an extreme learning machine
(ELM). The key issue is how to utilize the additional learn- Achieve Transfer Learning
ing skill reasonably during neural network training. Although supervised deep learning methods are currently
Sun et al. propose a hybrid plan where an autoencoder state of the art, they require a great many labeled data.
and active learning are simultaneously considered for hy- The labeled image benchmarks available are too small to
perspectral image classification [37]. The most informa- train a deep supervised network effectively. To handle the
tive samples selected by an active sampling algorithm are problem, some researchers recommend transfer learning
used to pretrain the autoencoder. Three active sampling via large quantities of unrelated data, such as self-taught
strategies, mutual information, breaking ties, and random learning [42]. One popular idea is to rely on an autoen-
sampling, are verified. Tang et al. present an ELM-based coder to deal with the small sample size and limited train-
hierarchical framework [38]. The compact and meaningful ing resources.
representations are obtained by unsupervised multilayer Kemker and Kanan achieve hyperspectral image classi-
encoding and an ELM-based sparse autoencoder. Hou et al. fication by self-taught learning [43]. They extract a bag of
consider both the autoencoder model and the superpixel generalizable features or filters by training an unsupervised
trick for PolSAR image classification [23]. To integrate the model on a sufficiently large number of unlabeled data that
contextual information, the true-color (red, green, blue) are distinct from the target data set. The trained model is
image formed by Pauli decomposition is used to produce used to produce discriminative features from small labeled
superpixels. A multilayer autoencoder network is then em- target data sets. Two kinds of schemes are developed. The
ployed to learn the features by which the multiple catego- first designs a shallow architecture by means of indepen-
ries for each pixel can be distinguished. dent component analysis; the second presents a deep net-
Zhou and Wei combine spectral feature learning in a work via a stacked autoencoder. Othman et al. develop a
hierarchical fashion [39]. A shallow neural network, i.e., deep model for land-use scene classification [26]. They gen-
a kernel ELM, is then embedded into the developed hi- erate the initial feature representation of a scene image us-
erarchical network. Lv et al. achieve hyperspectral image ing deep pretrained CNNs. The generated features are then
classification with an ensemble of ELMs and a stacked fed into the autoencoder to refine the representation. The
autoencoder [19]. The latent representations learned from CNN features can therefore be transferred to the terminal
the stacked autoencoder are fed into several ELM base clas- learning architecture.
sifiers, from which the inference can be drawn. Chen et Yao et al. consider an autoencoder for the automatic se-
al. employ multilayer projective dictionary pair learning mantic annotation of high-resolution optical satellite im-
(MPDL) and a sparse autoencoder jointly for PolSAR im- ages [44]. They learn a high-level feature from an auxiliary
age classification [18]. MPDL is first used to extract the dis- satellite image data set using a stacked discriminative sparse
criminative feature, followed by a sparse autoencoder to autoencoder. The learned high-level features are transferred
realize the nonlinear relationship between the elements of to semantic annotation, and the transferred representations
feature vectors in an adaptive way. Li et al. researched mid- are further fine-tuned in a weakly supervised fashion by
level feature representation by a sparse autoencoder and the tile-level annotated training data. Zhang et al. design
pooling skill [22]. A sparse autoencoder is first employed an unsupervised feature learning framework for land-use

46 ieee Geoscience and remote sensing magazine september 2018


scene classification [3]. They extract a set of representative CONTRIBUTIONS
patches from the salient regions in the image data set. These Although many studies concerning autoencoders have been
unlabeled patches are then employed to generate a set of published previously, a comprehensive summary is still
high-level features via a sparse autoencoder. The scene is fi- lacking. Work is seldom dedicated to the appropriate choice
nally characterized by statistics computed from the learned of network architecture for remote sensing (i.e., the impact
features. Chen et al. achieve land-use classification by an of related factors). This inhibits further development of the
effective midlevel visual elements-oriented representation technique. To fill the gap, we pursue a comprehensive re-
[15]. A library of pretrained part detectors, i.e., a partlet, view of related studies. Multiple comparative experiments
is generated for midlevel visual elements discovery. Their are performed from the viewpoint of target recognition in
responses to a large number of part detectors are then em- SAR images. Some qualitative propositions can then be
ployed to represent the scene image. This is achieved by concluded. We go on to provide suggestions regarding the
building a single-hidden-layer autoencoder and a single- logical choice of an appropriate network architecture ac-
hidden-layer neural network with an , 0 -norm sparsity con- cording to the application at hand. Therefore, the contribu-
straint, respectively. tions here includes the following.
Additionally, the autoencoder has been applied to oth- ◗◗ A review of the autoencoder and its variants. Many im-
er related fields. Shao et al. develop a deep-learning-based proved versions have been developed since optimiza-
workflow for mapping forest above-ground biomass by in- tion algorithms for deep networks were first presented
tegrating Landsat 8 and Sentinel-1A images with airborne [61]. We pursue a systematic review of the autoencoder
light detection and ranging data [45]. They demonstrate and its variants, including the shallow framework and
the advantage of a stacked sparse autoencoder network in neural network architecture.
comparison to five other prediction techniques: multiple ◗◗ The verification of network configuration from a comparative
stepwise linear regression, k-nearest neighbor (kNN), perspective. Although the autoencoder has been studied
SVMs, back propagation neural networks, and random widely in previous works, little research has been de-
forest. Li and Misra train a variational autoencoder to voted to studying the effect of related factors, especially
generate the NMR-T2 distributions along a 300-ft depth from a comparative perspective. To fill the gap, we verify
interval in a shale petroleum system at 11,000-ft depth the performance of an autoencoder with multiple quan-
below sea level [46]. The trained model successfully pre- titative experiments. The experimental results demon-
dicts the T2 distributions for 100 discrete depths at an strate sound proof for some conclusions.
R 2 of 0.75 and normalized root-mean-square deviation
of 15%. Elshamli et al. generate domain-invariant repre- THE BASIC MODEL
sentations by denoising the autoencoder and employing The idea of deep learning cannot be harnessed until effi-
domain-adversarial neural networks [47]. The first builds ciently optimized algorithms are developed [61], [62]. The
an unsupervised feature learning followed by a super- deep network is usually composed of several basic models.
vised classification, while the second jointly considers In this section, we review the basic autoencoder and the im-
the invariant representation learning and classification proved versions. (All notations used in this article are listed
during training. in Table 1.)
De et al. develop a new technique for the classification
of urban areas in PolSAR images [48] leveraging a synthetic AUTOENCODER
target database for data augmentation. A new reference da- An autoencoder is a typical unsupervised learning algo-
tabase is formed by the uniformly rotated and collated data rithm. It aims to set the target values so that they are equal
set. The data are used to train a stacked autoencoder, and to the original input. As shown in Figure 1, a generic model
the information in the augmented data set is transformed is usually composed of three separate phases:
into a compact representation. The classification is per-
formed by a multilayer perception network. Windrim et al.
extend the spectral angle-stacked autoencoder with the in- TABLE 1. THE NOTATIONS USED IN THIS ARTICLE.
corporation of a physics-based model for illumination [49].
x, xu Visible units, corrupted visible units
The proposed algorithm learns shadow-invariant mapping
h, hu Hidden units, dropped hidden units
without the need for any labeled training data, additional
sensors, a priori knowledge of the scene, or the assumption d v, d h Dimension of visible layer, hidden layer
of Planckian illumination. W v, W h Weight of visible units, hidden units
Because of advances in deep learning, there is encourag- b v, b h Bias of visible units, hidden units
ing evidence that the learned representation could result in J ($), L ($) Cost function, loss function
better performance than the canonical, predefined image v ($), d ($) Activation function
feature. Therefore, it is no longer necessary to pay close at- t, tt Desired and real average activation of hidden units
tention to designing delicate handcrafted features, as stud- *, , Convolution and element-wise product
ied in earlier works [50]–[60].

september 2018 ieee Geoscience and remote sensing magazine 47


the magnitude of the weights and hence prevents overfit-
Encode Activation Decode ting. The weight matrix and bias are obtained by the mini-
x h z batch stochastic gradient descent (SGD) algorithm
Wx + b σ (Wx + b) δ (W Th + b ′)

Output
Input

min {Loss(x, z)}


W (new) = W (old) - a 2 J (W, b v, b h)
2W
bv(new)
= (old)
bv - a 2 J (W, b v, b h) 
FIGURE 1. The generic flowchart of an autoencoder. 2b v
b h = b h - a 2 J (W, b v, b h),
(new) (old)
(5)
2b h
1) encoder: a set of linear feed-forward filters parameter-
(old)
ized by the weight matrix and bias (i.e., feed-forward where a is the learning rate, W (old), b (vold), b h denote the
(new)
neural network) old model parameters, and W (new), b (vnew), b h are the re-
2) activation: a nonlinear mapping that transforms the en- newed ones.
coded coefficients into the range [0], [1]
3) decoder: a set of reverse linear filters that produce the SPARSE AUTOENCODER
reconstruction of the input (i.e., back propagation). To exploit the inner structure of data, an additional regu-
An autoencoder is a typical feed-forward neural network larization, the sparsity constraint on the hidden units, is
that hooks together many simple neurons. The output of a developed. A neuron whose output is close to one is active,
neuron can be the input of another, and the parameters can while one whose output is close to zero is inactive. A sparse
be estimated using a back-propagation algorithm. A forward autoencoder aims to limit neurons so that they are inactive
pass is first run to compute the activations throughout the most of the time.
network, including the output value of the hypothesis. For Given a set of training samples, x 1, x 2, f, x m, the average
each middle node, an error term that measures how much activation of the jth hidden unit is tt j = (1/m) R mi = 1 [hu j (x i)].
that node was responsible for any errors in the output is A sparse autoencoder enforces the constraint tt j = t, where
computed. For an output node, the difference between the t is the desired average activation value. Most activations of
network’s activation and the true target value can be mea- the hidden units are then near zero. The constraint is real-
sured directly and further used to renew the error term. ized by a penalty,
Given the visible layer x ! R d v, the autoencoder first dh
maps it into a hidden layer h ! R d h with a weight matrix / {t log ttt j + (1 - t) log 11 -- ttt j },(6)
j=1
W v, bias b v, and an activation function v ($) : R 7 [0, 1]:
which is the Kullback-Leibler (KL) divergence R dj =h 1 KL^t tt j h.
h = v (W v x + b). (1) The cost function of a sparse autoencoder is
dh
The hidden units are then cast into the output layer z ! R d v J sparse = L (x, z) + b / KL ^t tt j h . (7)
in a reverse fashion with weight matrix W h, bias b h, and j=1

activation d ($) : R 7 [0, 1]:


DENOISING AUTOENCODER
z = d (W h h + b h). (2) To deal with the small variation of input data, Vincent et al.
propose a trick, i.e., a denoising autoencoder [63]. They de-
To simplify the network architecture, the tied weights struct the input data manually and verify that even partially
strategy W v = W h = W has usually been employed. The destroyed input could yield almost the same result. This is
parameters to be determined are {W, b v, b h} . The objec- because a good representation should be capable of captur-
tive function to train an autoencoder is to minimize the ing the stable structures in the form of dependencies and
cost function, regularities characteristic of the unknown distribution. The
learned representation is robust toward the slight disturbanc-
arg min
W, b , b
v
J (W, b v, b h) . (3)
h
es observed.
Input x is first corrupted to get a partially destroyed ver-
Given the training samples, the cost function is defined as sion xu by means of stochastic mapping. The random cor-
ruption can be implemented in two fashions:
J (W, b v, b h) = L (x, z) + mg (W), (4) 1) Binary noise: Randomly choose a fixed number of raw
data, and force their values to be zero.
where L (x, z) is the loss function and g (W) is a regular- 2) Gaussian noise: Generate a number of Gaussian random
ization (i.e., weight decay). The comparative studies on loss values, and add them to the initial data.
function can be found in the “Experiments and Discus- The corrupted data xu is then mapped to the hidden layer
sions” section. The weight decay is the Frobenius norm of
2 2
weight, g (W) = 0.5 ( W v F + W h F). It tends to decrease h = v (W v xu + b v), (8)

48 ieee Geoscience and remote sensing magazine september 2018


Encode Decode Encode Decode
h
Input Reconstructed z Input Reconstructed z
Representation Data x Representation Data
Data x Data

min {Loss(x, z)} min {Loss(x, z)}

(a) (a)

"
x x h h
Corruption Corruption
Encode Encode Decode
Decode
Input Reconstructed z Input Reconstructed z
Representation Data x Representation Data
Data Data

min {Loss(x, z)} min {Loss(x, z)}

(b) (b)

FIGURE 2. The (a) prototype of the autoencoder and (b) denoising FIGURE 3. The (a) archetypal autoencoder and (b) drawn autoen-
autoencoder. The input data are manually corrupted at a certain coder with dropout. The input data are mapped to the hidden layer,
level. The corrupted data are then mapped to the hidden layer, in which some neurons are deactivated. Only the remaining units
from which the initial data can be reconstructed. are employed to reconstruct the input data.

from which the following reconstruction can be obtained:

z = d (W h h + b h). (9)

A diagram of a denoising autoencoder is illustrated in Figure 2.


As shown in the previously discussed works, there is a Random
Random
low-dimensional manifold near which the data concentrate.
The denoising autoencoder can then be viewed as a way to Drop
define and learn a manifold [64]. It aims to build a map- Corrupt
ping from the low-probability event K x to the high-proba-
bility event x. This is achieved by promoting the probability h
p (x | xu ), which is similar to minimizing the loss function.
x Hidden Dropped
The learned representation h can be interpreted as a coor-
Input Noisy Reconstructed
dinate system for points on the manifold. The tendency is
significant if the dimension of h is smaller than that of x.
The process shows that the learned representation captures FIGURE 4. The implementation flow of the autoencoder with
the main change of the input. On the other hand, the latent denoising and dropout. The blue circles denote the manually cor-
state h can be viewed as the low-dimensional representation rupted (i.e., dropped) neurons. All are randomly determined.
of input x. The autoencoder may also be seen as linear (with-
out activation) or nonlinear (with activation) dimension- h = v (W v x + b v )
reduction mapping. hu = h 7 Bernoulli (p) 
z = d (W h hu + b h), (10)
AUTOENCODER WITH DROPOUT
Dropout is a strategy popularly used in training large neural where 7 is the element-wise product operation and p is the
networks with limited training resources [65]. It provides proportion of hidden units to be dropped. The autoencoder
an exponential way of approximately combining many dif- with denoising and dropout is shown pictorially in Figure 4.
ferent neural network architectures efficiently. Dropout re-
fers to dropping out some hidden units in a neural network. CONTRACTIVE AUTOENCODER
Dropping a unit out means temporarily removing it from Both denoising and the dropout trick produce a stochastic
the network, along with all of its incoming and outgoing neural network model. All the corrupted or dropped neu-
connections. The diagram flow is shown in Figure 3. rons are randomly determined. Differently, Rifai et al. pro-
Consider a simple neural network with three layers: the pose to train a deterministic version of a neural network,
input layer x, the hidden layer h, and the output layer z. a contractive autoencoder [66]. They design a well-chosen
The feed-forward operation with dropout is expressed as penalty term, the Frobenius norm of the Jacobian matrix of

september 2018 ieee Geoscience and remote sensing magazine 49


the encoder activations with respect to the input. The pen- It refers to the recognition model q z _ h x i as a probabilistic
alty could generate a localized space contraction, which in encoder. The data are encoded into a soft ellipsoidal region
turn yields robust features on the activation layer. in the latent space, rather than a single point. Given an ex-
To enhance the robustness of the learned representa- ample x, a variational autoencoder produces a distribution
tion, an intuitive idea is to penalize the sensitivity to the in- (e.g., Gaussian) over the possible values of code h, from
put. This is realized by the Frobenius norm of the Jacobian which the datapoint x can be generated. The phase of en-
matrix J v (x) of activations, h = v (Wx + b v). The penalty coding can be then expressed as
term is defined as the sum of the square of all partial de-
Z
rivatives of the hidden nodes regarding input ] hu = v^W 0 x + b 0h
] n = W 1 hu + b 1
2h j (x) 2 Encoder [ , (13)
= / c 2x m . (11) = W 2 hu + b 2
2 2
J v (x) F ] log v
i
ij ] log q _ h x i = log N (h; n, v I)
2
\
Considering the linear encoder, the penalty is degraded to
the weight decay, g (W). The Jacobian penalty encourages where the model parameters and latent states sample the
the hidden-layer mapping to be contractive in the neighbor- statistical distribution h + q z _ h x i .
hood of input. The objective function is then expressed as In the phase of decoding, a similar vein can be pursued.
We refer to p i _ h x i as a probabilistic decoder. Given a code h,
2
J contractive = L (x, z) + h J v (x) F . (12) it produces a distribution over the possible corresponding
value of x. The decoder is therefore a conditional generative
Because the training examples congregate near a low-di- model that estimates the probability of generating x given
mensional manifold, the variations present in the original the latent variable z ,
data correspond to the local dimensions along the mani- Z
] hu l = d^W 3 h + b 3h
fold. Specifically, the small variations in the input data cor- ] nl = W 4 hu + b 4
respond to the directions orthogonal to the manifold. The Decorder [ 2 u . (14)
contractive autoencoder tries to make the features invariant ] log v 1 = W 5 hl + b 5 2
] log p _ x z i = log N ^ x; nl, v 1 Ih
in all directions by means of the F-norm of the Jacobian \
matrix penalty. Simultaneously, it should be able to recon- A variational autoencoder aims to approximate p i _ x h i
struct the input. Therefore, it is good at representing the by the given distribution q z _ h x i . It is required to balance
data variations near a lower-dimensional manifold. the reconstruction accuracy and the Gaussian distribution’s
goodness of fit. The objective can be achieved by the neural
VARIATIONAL AUTOENCODER network itself. Therefore, the loss function is composed of
A variational autoencoder is a recently developed method- these items. The reconstruction accuracy can be measured
ology [67], [68]. It is not a precise autoencoder model, but by the MSE, while the difference between the latent vari-
a generative one. It could generate the training samples not ables and the Gaussian distribution is typically quantified
available in the gallery. Different from those of an arche- by the KL divergence
typal autoencoder, the outputs of the encoder and decoder
represent samples drawn from a parameterized probability J VAE = D KL _ q z _ h x i p i _ x h ii + MSE (x, z), (15)
density function, as shown in Figure 5.
From the perspective of coding theory, the hidden vari- where z and i are the variational parameters and the gen-
able h can be interpreted as a latent representation or code. erative parameters, respectively.

Mean

x Encoder h Decoder z

log_var

Sampling from N (0, 1)

FIGURE 5. An illustration of a variational autoencoder. The model parameters and the latent states are sampled from a parameterized
statistical distribution.

50 ieee Geoscience and remote sensing magazine september 2018


For high-dimensional data, similar examples may be eigenmaps, while the network with linear activation resem-
distributed in a high-dimensional manifold. The task of bles the linear modes, such as PCA and linear discriminant
representation learning is to predict that manifold either analysis. In this section, we demonstrate some illustrative
explicitly or implicitly. The observations can then be recon- examples in which PCA and an autoencoder with linear
structed from the low-dimensional latent variants, and the and nonlinear activation are compared.
latent variants could approximate the observations. More- To visually compare an autoencoder and PCA, we per-
over, the similarity is varied smoothly. form a group of experiments. The images are 96 × 96 pixels
in size and represent three different classes. The details can
CONVOLUTIONAL AUTOENCODER be found in the “Experiments and Discussion” section. We
The fully connected autoencoder ignores the spatial struc- produce the two-dimensional representations for the bag
ture of an image. To handle the problem, Masci et al. pres- of images by a trained autoencoder and PCA; the learned
ent a convolutional autoencoder (CAE) [69]. They introduce features are displayed in Figure 6. Here, the representations
some redundancy in the model parameters, forcing the learned by the autoencoder are more scattered, while the
learned representation to be global, i.e., spanning the entire ones produced by PCA overlap. Therefore, it is much eas-
visual field. Because the weights and bias are shared among ier to differentiate these samples using the representations
all locations of input, the spatial locality can be preserved. learned by the autoencoder.
Given the single-channel image x, the latent representation We then evaluate the recognition performance. The
of the kth feature map is given by learned representations are connected with a softmax layer
to implement target classification. The experimental results
h k = v ^ x ) W k + b v h . (16) are displayed in Figure 7, where the tendency is obvious.
With the dimension of learned representation increased,
The bias b v is broadcast to the whole map. Because one bias the recognition accuracy changes. The 256-D representation
per pixel would introduce too many degrees of freedom, always produces the best performance. On the other hand,
one bias per latent map is employed. Then, each kernel fil- the recognition accuracy obtained using PCA is consistently
ter can specialize on features of the whole input.
Following the road map of a CNN, a max-pooling layer
is connected with the convolutional layer. The sparsity over 2
the representation may then be introduced. It erases the
nonmaximal values in nonoverlapping subregions, which 1.75
5
could generate a more broadly applicable feature and avoid 1.5
trivial solutions, such as having only one weight on. The 0
1.25
reconstruction is obtained using
–5 1
z = da / h k ) W
u k + b h , (17)
k 0.75
k –10
0.5
where W u denotes the flip of W over both dimensions
–15 0.25
of the weights. In the phase of reconstruction, the sparse
latent coding decreases the average number of filters con- 0
tributing to the decoding of each pixel, forcing filters to be –2 0 2 4 6 8 10 12
(a)
more general. Consequently, , 1 or , 2 regularization over
2
the hidden layer is not needed. The cost function is
1.75
6
J conv = L (x, z) = / # x i - z i 2 - . (18)
2
1.5
i

4 1.25

ILLUSTRATIVE EXAMPLES 1
The generic autoencoder model is composed of an encoder 2 0.75
and a decoder. The learned feature is the encoded coeffi-
cient. For the nonlinearly separable data set, a network 0.5
0
with many more hidden neurons is usually built (i.e., over- 0.25
complete representation). For high-dimensional examples, 0
the dimension can be reduced by training a network with –2 0 2 4 6 8 10
fewer hidden neurons. Therefore, an autoencoder can be (b)
viewed as a dimension-reduction trick. The network with
nonlinear activation plays a similar role to that of nonlinear FIGURE 6. The scattering map of the learned two-dimensional
mappings, such as locally linear embedding and Laplacian feature by (a) the autoencoder and (b) PCA.

september 2018 ieee Geoscience and remote sensing magazine 51


layer is wired to the input of the successive layer, as illus-
0.92
trated in Figure 8. A stacked autoencoder inherits the ben-
efits of a deep network having greater expressive power. An
0.9 autoencoder model aims to learn a good representation of
the input. The first hidden layer of the stacked autoencoder
0.88 tends to learn the first-order feature from the input data.
Accuracy

The successive hidden layer typically generates the high-lev-


0.86 PCA
el feature corresponding to the pattern in the appearance of
AE (Linear)
0.84 AE (ReLU) those from previous levels. Therefore, it captures a useful
hierarchical grouping or part–whole decomposition of the
0.82 input [70].
For a stacked autoencoder with N hidden layers, the
0.80
64−D 128−D 256−D 384−D 512−D 640−D weight and the bias of the kth layer are denoted by
(k) (k)
Learned Representation W (vk), W h , b v(k), b h . The encoding is implemented layer by
layer in a forward order:
FIGURE 7. The recognition performance obtained using the auto- h (k) = v (z (k))
encoder and PCA. AE: autoencoder. 
z(k + 1)
= W (vk) h (k) + b v(k), (19)
where z (k) = x (k - 1) is the output of previous layer. Similarly,
the decoding is run layer by layer in a reverse order:
h (n + k) = d (z (n + k))
(n + k + 1) (n - k) (n - k) 
z = Wh h (n + k) + b h . (20)
Three Classes
The information is contained in the transition period, i.e.,
the activation of the deepest layer of hidden units. The mod-
el parameters are estimated by a layer-wise greedy strategy
[61]. Specifically, the first hidden layer is trained by the in-
(1) (1)
put data, resulting in the parameters W (v1), W h , b v(1), b h .
Hidden The activations of the hidden neurons are then used to
Input Hidden train the second hidden layer, producing the parameters
(2) (2)
W (v2), W h , b v(2), b h . The procedure is transferred to the fi-
Output
Softmax nal hidden layer. Each layer is trained individually, i.e., the
AE2 parameters of the network that are not involved are frozen.
To boost the performance, fine-tuning via back-propaga-
AE1 tion is proposed. This improves the results by tuning the
parameters of all layers.
The implementation flow is summarized as follows:
FIGURE 8. A stacked autoencoder with two hidden layers. The ◗◗ Compute the activations of each hidden layer.
successive layers are connected with the preceding layer by the ◗◗ For certain layer n l, compute the partial derivative
model parameters. To implement classification, the last hidden
layer is usually connected with a softmax layer, whose output is d^n lh = - ^d h(n ) J h ·vl ^z^n lhh .
l

the probability of the class taking each possible value.


◗◗ For l = n l - 1, n l - 2, f, 2, set
poorer than that using the autoencoder, whether linear ac-
T
tivation or activation of the rectified linear unit (ReLU) is d (l) = ^^W (l)h d^l + 1hh ·vl ^z^ l hh .
used. Simultaneously, we found the performance obtained
using an autoencoder with ReLU activation is always much ◗◗ Produce the desired partial derivatives
better than the linear activation for all six dimensions of T
d W J^W, b h, b v h = d^l + 1h ^h^ l hh
^lh

representation.
d b^ l h J^W, b h, b v h = d^l + 1h.
THE DEEP MODEL Due to the representation power, the stacked autoen-
A deep architecture is formed by the composition of mul- coder has been widely used in earlier works. Chen et al.
tiple levels of representations. Likewise, the basic autoen- build a deep learning architecture by stacking an autoen-
coder can be concatenated to build the deep-model stacked coder, with which the useful high-level features can be
autoencoder. It is composed of a visible layer, several hid- learned from the hyperspectral data [28]. Geng et al. re-
den layers, and an output layer. The output of each previous fine the hand-engineered features by a contractive neural

52 ieee Geoscience and remote sensing magazine september 2018


network [21]. Hao et al. encode the spectral values of the The regression parameter i is obtained by optimizing ob-
input pixel using a stacked denoising autoencoder and jective it = arg min i J ^i; ^ x i, y ihiM= 1h. It is solved by the gradi-
handle the corresponding image patch with a deep CNN ent descent method
[71]. Sun et al. learn a discriminative deep feature using
^ new h ^ old h 2 ^ h
a trained stacked autoencoder, where the constraint of la- ij = ij -a J i , (23)
2i j
bel consistency on a neighborhood region has been intro-
duced [16]. Zhang et al. consider the spatial and spectral where a is the learning rate, and is important to the solu-
information jointly using a recursive autoencoder network tion of the regression parameters i.
[72]. A weighting scheme is developed to enhance the rep-
resentation power, and they determine the weights using LOGISTIC REGRESSION
the spectral similarity between the neighboring pixels and Logistic regression is a special form of linear regression
the investigated pixel. in which the dependent variable is categorical (i.e., bio-
Zhang et al. build a stacked denoising autoencoder to logical). It typically produces the binary classification
learn the salient prior knowledge from auxiliary anno- y ! {0, 1}. The hypothesis concerning the label of sample
tated data sets and then transfer the learned knowledge x takes the form
to estimate the intrasaliency for each image in cosaliency
1
data sets [73]. Cao et al. prove that sparsity regulariza- h i ^ x h = v^i T x h = . (24)
1 + exp ^- i T x h
tion and a denoising mechanism seem to be mandatory
for constructing interpretable feature representations [74]. It is obtained by a logarithm operation on the ratio of posi-
Feng et al. propose a stacked marginal discriminative au- tive and negative hypotheses:
toencoder model to handle the limited training samples
[75]. The marginal samples obtained by k nearest neigh- P_ y = 1 x i P_ y = 1 x i
log it^ x h = ln f p = ln f p
bors between different classes are used to fine-tune the P_ y = 0 x i 1 - P_ y = 1 x i 
defined network. Paul and Kumar present a mutual infor- = i 0 + i 1 x 1 + i 2 x 2 + g + i m x m = i T x. (25)
mation-based, segmented stacked autoencoder model to
reduce computational complexity [76]. A nonparametric The cost function can then be expressed as
dependency measure based spectral segmentation is de-
J ^i; ^ x i, y ihh = - m < / y i log h i ^ x ih + ^1 - y ih log ^1 - h i ^ x ihhF .
M
fined to consider both linear and nonlinear interband de- 1
i=1
pendency for spectral segmentation of the hyperspectral
bands. Han et al. present an unsupervised convolutional
sparse autoencoder model to represent the spatial-spectral MULTICLASS LOGISTIC REGRESSION
features around the central pixel within a spatial neigh- Multiclass logistic regression is a generalization of logistic
borhood window [77]. regression to the case of multiple classes. The labels of lo-
gistic regression are binary, while the softmax classifier al-
CLASSIFICATION lows multiple classes, y ! " 1, 2, f, K , . For a query sample
Although an autoencoder represents unsupervised learn- x ! R d, the output of the softmax classifier is the probabil-
ing, it can be appended by a softmax layer to achieve ity of the class label taking one each of the different pos-
pattern classification. This section first provides a simple sible K values:
review of linear regression and its special form, logistic
R V R T V
regression, followed by multiclass logistic regression (i.e., S P _ y = 1 x; i i W Sexp ^i 1 x hW
softmax classifier). SP _ y = 2 x; i i)W Sexp ^i 2 x hW
T
1
hi ^xh = S W=
S h W / exp ^i Tj x h SS h WW ,
LINEAR REGRESSION S P _ y = K x; i i W j Sexp ^i TK x hW
T X T X
Linear regression models the relationship between a scalar
dependent variable y (response) and a set of explanatory where i = [i 1, i 2, i, i K ] are the model parameters. The
variables x 1, x 2, f, x m, probability function is
K J exp ^i Tj x h NI" y = j ,
y = h i (x) = i 0 + i 1 x 1 + i 2 x 2 + g + i m x m = i T x, (21) P _ y x; i i = %K K
O , (26)
j=1 K / T O
K exp ^ i i x h O
where x = 61, x 1, x 2, f, x m@ is an obser ved instance and L i=1 P
i = [i 0, i 1, f, i m] are the regression coefficients. Given a where I " · , is an indicator function. The likelihood func-
set of labeled training samples "^ x 1, y 1h, ^ x 2, y 2h, f, ^ x M, y Mh,, tion is
the cost function is defined as
N KJ exp ^i Tj x ih NI^yi = jh
M LK ^i; ^ x i, y ihh = % %K K
O . (27)
1 i=1 j=1 K / O
J ^i; ^ x i, y ihiM= 1h = 2M / ^y i - h i ^ x ihh2. (22) K exp ^ i T
x h
k i O
i=1 L k=1 P

september 2018 ieee Geoscience and remote sensing magazine 53


Taking the logarithm operation, this can be then writ- a trained classifier. Therefore, the performance depends
ten as mainly on a discriminative classifier rather than the
feature itself. Although easily defined, the recognition
N K J exp ^i Tj x ih N performance is limited, especially in nonliteral settings.
log LK ^i h = / / I " y i = j , log K K
O.
◗◗ Geometry feature. These features are obtained by separat-
i=1 j=1 K / exp ^i T x h O
K k i O
L k=1 P ing the target and the radar shadow, from which some
region (or shape) descriptors can be obtained [86],
The cost function is [87], such as the shape context descriptor of the radar
shadow and target as well as various kinds of statisti-
RN K V
S exp ^i Tk x ih W cal moment. This family of features usually needs fine
J (i) = - S / / I " y i = k , log K W . (28)
SS
i=1 j=1 / exp ^i Tj x ih WW segmentation of the target, which is an open problem
T
j=1
X in radar imaging.
◗◗ Projection coefficients. Some researchers propose project-
APPLICATION TO TARGET RECOGNITION ing the initial image onto a designed linear subspace and
With the development of integrated circuit technology, defining a feature using the transformed coefficients.
huge high-resolution images are collected by various kinds The representative trick includes PCA, locality preserv-
of sensors. It is urgent to achieve image interpretation auto- ing projection, and nonnegative matrix factorization
matically. Target recognition (i.e., on the ground, military [53]. There is considerable ambiguity about what these
vehicles, such as a main-battle tank, an armored personnel features actually mean.
carrier, or a car) in SAR images is a typical research topic ◗◗ Filter banks. To deal with the extended conditions such
in image interpretation. We aim to differentiate these mili- as translation and distortion, some researchers propose
tary vehicles by the radar images and use the term target defining the feature using a family of filter banks. The
recognition to classify the scenery of a radar image as one of representative includes Fourier transform, wavelet, and
several classes. Gabor filters [88], [89]. These methods are usually effec-
Although studied widely, target recognition is still an tive for a specific data set or task but not easily extended
open problem due to the complicated battlefield environ- to other generalized applications.
ment and mutable imaging conditions. The conventional ◗◗ Scattering center models. This family of features refers to
methods used in optical sensor images are not effective for the returns received by the interactions between the inci-
radar imaging. Researchers have gradually recommended dent radio wave and the physical structure of an object.
learning the effective representation from the data them- The concept has been pursued by Potter and Moses [90],
selves via deep neural networks [78]–[82]. In this section, we who present a framework for feature extraction predi-
provide a simple review of prior works on hand-engineered cated on parametric models for the radar returns. The
features. The achievement of target recognition in SAR im- models are motivated by the scattering behavior pre-
ages via an autoencoder neural network is then presented. dicted using the geometrical theory of diffraction. Some
improved methods have also been developed [91]–[95].
PREVIOUS WORKS
The performance of radar target recognition depends main- TYPICAL CLASSIFIER
ly on two key issues: 1) how to design an effective feature The generated feature is input to a trained classifier to predict
from a radar image and 2) how to implement classification the class membership. The methods popularly used include
with the defined feature. We provide a simple summary of kNN classifier, kernel-based classifier, Bayesian inference,
the hand-engineered features developed for SAR images, fol- and sparse representation. A kNN classifier predicts the
lowed by the typical classification methods popularly used. identity by measuring the similarity between the probe
and the training. The key is to define an appropriate met-
HANDCRAFTED FEATUREs ric, such as MSE [96], KL-divergence [97], [98], or Wishart
In the SAR imaging process, multiple scattering mecha- distance [99]. A kernel-based classifier projects the initial
nisms contribute to the backscattered signal [83]. The inter- data onto an abstract feature space with high or even infi-
action of the electromagnetic wave and object includes the nite dimensionality. The class separability may then be en-
direct backscatter, the single-direction double bounce, and hanced in the feature space. One of the most representative
the return-direct multibounce. Due to the complicated scat- methods is an SVM with the linear kernel, Sigmoid kernel,
tering mechanisms, it is difficult to extract features from ra- Gaussian radial function, or polynomial kernel.
dar images. The conventional features developed for optical Zhao and Principe present an application of SVMs for
sensor images are no longer effective. Therefore, some deli- SAR automatic target recognition [85]. They demonstrate the
cate features specific to SAR images have been presented. advantage of the SVM compared with conventional classifi-
◗◗ Raw intensity. In early works, the raw intensity values of ers in both closed and open sets. Sun et al. propose to clas-
the enhanced images are directly employed to produce sify the SAR images using the adaptive boosting algorithm
a feature [84], [85]. The defined feature is then input to with the radial basis function network as the base learner

54 ieee Geoscience and remote sensing magazine september 2018


[50]. Kemker and Kanan implement hyperspectral image identification can be predicted. The procedure is shown
classification by feeding the learned representation into a in Figure 9.
SVM classifier [43]. The Bayesian inference family of clas- The implementation flow is twofold, summarized as follows:
sifiers predicts the identification using Bayesian theory, such 1) Generate the representation by training an autoencoder
as maximum a posterior probability [51], maximum likeli- neural network.
hood [100], and generalized likelihood ratio [101]. The 2) Predict the identification via a softmax (or a third-
sparse-representation-based classifier is a recently devel- party) classifier.
oped method that considers multiple linear regression mod-
els for the classification problem and addresses the theory’s EXPERIMENTS AND DISCUSSIONS
sparse representation problem [84]. This section verifies the performance of an autoencoder
and its variants from a comparative perspective. The ex-
TARGET RECOGNITION VIA AUTOENCODER periments are pursued on the Moving and Stationary
Although great efforts have been pursued, feature gen- Target Acquisition and Recognition database, a collection
eration from radar imagery is still challenging due to the consisting of X-band SAR images of 1-ft resolution. Images
unique imaging mechanism. The handcrafted and pre- are taken under several different operating conditions,
defined features are usually effective for a specific data set or including depression angle, aspect view, configuration,
task but may not be transferred to another data set. In prior occlusion, and articulation. The depression angle refers to
studies, researchers focus mainly on how to design a delicate the angle between the line of sight (i.e., from the radar to
feature by which target recognition can be achieved. Works the target) and the horizontal plane at the radar. For each
are rarely devoted to learning the latent representation from target, images are captured at different depression angles
the original data themselves. To break with this tendency over a full 0 + 359c range of aspect view. The initial images
and improve performance, advanced learning skills, such as are 128 × 128 pixels and are standardized by cropping the
deep neural networks [32], [102], allow an effective repre- center patches.
sentation to be learned. We organize two kinds of experiment plans: standard
This article considers target classification in radar im- verification and extended evaluation. The first focuses
ages. To avoid the multiple complicated procedures re- on target recognition under the standard settings, with
quired in conventional methods, we propose learning an which the impacts of related factors can be studied. The
effective representation using an unsupervised learning second focuses on nonliteral conditions, from which a
skill: an autoencoder neural network [103]. The learned quantitative comparison with state-of-the-art algorithms
representation can then be input to a third-party trained is performed. The network models are realized using the
classifier, or a softmax layer, to predict the class member- Keras and Theano library. All of the experiments are pur-
ship of the query. To form an autoencoder model, we re- sued on Python 3.6 of Windows 7. The source codes are
shape the radar image to be a single array whose entries publicly available at the website https://ganggangdong
are the nodes of the visible layer. The visible layer is then .github.io/homepage.
cast into the hidden layer by the model parameters (i.e.,
weights and bias) and the activation function. Each visible FUNDAMENTAL VALIDATION
node is connected with all of the hidden neurons, some of We first consider the fundamental evaluations. The effects of
which are deactivated or dropped. The network is allowed related factors are studied by changing some configurations,
to transmit only forward; hence, it is also called a feed- including the size of the visible layer, the size of the hidden
forward neural network.
The deep neural network architec-
ture is built by concatenating the ba-
sic autoencoder models hierarchical- Identity
ly. To estimate the model parameters, Weights
a generic method is to optimize a Bias
loss function composed of the recon- Query Softmax
struction error and some regulariza-
tion terms. The deviation is allowed
AE AE AE
to transmit only from the end to the
beginning, i.e., back-propagation.
The latent representation is obtained
by training the network layer by layer
[61]. To implement target classifica- Training
tion, the learned representation is
then input to a softmax layer or a
third-party classifier, by which the FIGURE 9. Target recognition in SAR images via the autoencoder network.

september 2018 ieee Geoscience and remote sensing magazine 55


(a) (b) (c) (d) (e) (f) (g) (h)

FIGURE 10. The center (a) 48 × 48-, (b) 56 × 56-, (c) 64 × 64-, (d) 72 × 72-, (e) 80 × 80-, (f) 88 × 88-, (g) 96 × 96-, and (h) 104 × 104-pixel
patches generated from the original image.

112 × 112 patches, corresponding to 2,304-, 3,136-, 4,096-,


TABLE 2. THE ASPECT VIEW NUMBER FOR BMP2, BTR70, 5,184-, 6,400-, 7,744-, 9,216-, 10,816-, and 12,544-node visi-
AND T72.
ble layers. Figure 10 displays a set of the cropped images. The
Target Series NUMBER Training Testing cropped patches are used to train an autoencoder. The num-
BMP2 SN_9563 233 195 ber of the hidden unit is changed in a range [400, 1,200].
SN_9566 — 196 The results are reported as some statistics (e.g., the mini-
SN_c21 — 196 mum, median, maximum, and 25th and 75th percentiles)
BTR70 SN_c71 233 196 of different settings, as box-plotted in Figure 11. As can be
BMP2 SN_132 232 196
seen, the input data do play an important role in terms of
SN_812 — 195
the effectiveness of learned features. The recognition accu-
SN_s7 — 191
racies are gradually increased when the size of the visible
Total 698 1,365
layer increases from 2,304-D to 12,544-D. The lowest ac-
curacy is obtained by 48 × 48-pixel patches, in which part
layer, activation, loss function, optimizer, the variants of the of the radar shadow and most of the background have been
autoencoder, and the depth of the network. Images of BMP2, excluded. Contrarily, the best rate is produced by 96 × 96-
BTR70, and T72 are used. The details on training and testing and 112 × 112-pixel patches. The most stable performance
are shown in Table 2. is obtained using 96 × 96-pixel patches, while the least
stable performance is generated by 56 × 56-pixel input. The
THE SIZE OF THE VISIBLE LAYER recognition performance is much poorer when the size of
The size of the input layer refers to the dimension of input the visible layer is too small. The results demonstrate that
data. It influences the effectiveness of the learned repre- 96 × 96-pixel patches are suitable for the task of target rec-
sentation to some degree. To study the effect, we pursue a ognition. This set of experiments also confirms that both
set of experiments. The initial images are 128 × 128 pix- radar shadow and background have some discriminative
els. To produce input data of a different size, we standard- power. The performance can be improved by jointly con-
ize the images by cropping the center to 48 × 48, 56 × 56, sidering target, radar shadow, and background.
64 × 64, 72 × 72, 80 × 80, 88 × 88, 96 × 96, 104 × 104, and
THE SIZE OF THE HIDDEN LAYER
The autoencoder neural network is an unsupervised learn-
ing trick. The learned representation is the encoded coef-
0.900
ficients (i.e., hidden neurons). Therefore, the dimension of
0.895 the learned feature is determined by the size of the hidden
layer. To study which number of hidden units is appropri-
0.890
ate for the task of target recognition, we pursue a group of
0.885 experiments. The 88 × 88- and 96 × 96-pixel patches are
employed as the input, while the hidden layer is changed
0.880
from 300 to 2,000.
0.875 The results are reported as the overall recognition rate,
shown in Figure 12. We can see that the recognition rates vary
0.870
irregularly when the hidden layer’s size increases. Two kinds
of input data demonstrate a similar trend. For 96 × 96-pixel
48

56

64

72

80

88

96

11 104

input, the lowest rate is produced by a 300-hidden-unit net-


11
×

×
48

56

64

72

80

88

96

work. Correspondingly, 300-, 700-, and 2,000-hidden-unit


4

2
10

networks produce poor performance for 88 × 88-pixel input.


FIGURE 11. The recognition accuracy over visible layers of differ- The recognition accuracies are not proportional to the size
ent sizes. of the hidden layer. The best recognition rate is obtained

56 ieee Geoscience and remote sensing magazine september 2018


using a 1,600-hidden-node network with 96 × 96-pixel in-
put. The results demonstrate the importance of tuning the
0.900 88 × 88
hidden layer to the appropriate size for the mission at hand. 96 × 96
Neither small nor large hidden layers could learn an effec-
tive representation. 0.895

Accuracy
ACTIVATION
The activation function is a nonlinear mapping. It projects 0.890
the encodings or decodings into a certain range (0,1) and
converts a linear encoder or decoder to be nonlinear. The 0.885
typical activation includes the sigmoid function, hyperbol-
ic tangent, and ReLU:
0.880
◗◗ sigmoid: v ^ x h = 1 ^ 1 + exp ^ - x hh

0
30

50

70

90

00

20

40

60

80

00
◗◗ hyperbolic tangent (tanh): v ^ x h = ^ exp ^ x h - exp ^ - x hh

1,

1,

1,

1,

1,

2,
^ exp ^ x h + exp ^ - x hh Nodes of Hidden Layers
◗◗ ReLU: v ^ x h = max ^ x, 0 h .
To study the effect of activation, a set of experiments is per- FIGURE 12. The recognition accuracy across the number of hid-
formed. We set the size of the visible layer as 96 × 96 pixels den units.
and vary the units of the hidden layer from 400 to 1,600.
Figure 13 draws the recognition accuracy across the size of
the hidden layer. Three activation functions are compared.
The results demonstrated in Figure 13(a) and (b) are
0.92
slightly different. When the cross-entropy function is used
to measure the deviation, the ReLU function produces the
0.91
best performance. The accuracies gradually decrease when
the number of hidden units increases. The sigmoid func-
0.9
tion generates a much poor performance, while the hyper-
Accuracy

bolic tangent function provides a great advantage when


0.89
the MSE function is used to quantify the deviation. The
improvement is significant in terms of the recognition ac-
0.88
curacy. The performance obtained using the sigmoid func-
tion varies sharply, even increasing from 0.8200 to 0.8852.
0.87
The hyperbolic tangent function always produces the most
stable recognition performance. From this round of experi- 400 600 800 1,000 1,200 1,400 1,600
ments, we may conclude that it is better for the hyperbolic Number of Hidden Units
tangent function to be used in conjunction with the loss of (a)
the MSE. 0.92
0.91
THE LOSS FUNCTION 0.9
The loss function is used to quantify the deviation of the 0.89
reconstructed sample from the initial input data. An ideal 0.88
Accuracy

autoencoder neural network aims to build an identity func- 0.87


tion with which the input data can be perfectly reconstruct-
0.86
ed. The deviation of a reconstruction amount from the
0.85
original value is expected to be as minimal as possibly. The
metrics popularly used are MSE and cross entropy: 0.84
1 d
◗◗ MSE: L ^ x, z h = 2d v R i =v 1 {(x i - z i) 2} 0.83
◗◗ cross entropy: 0.82
400 600 800 1,000 1,200 1,400 1,600
L ^ x, z h = / i = 1 " x i · log ^ z ih + ^1 - x i h · log ^1 - z ih, . Number of Hidden Units
dv
(b)

To study the impact of the loss function, a set of ex- Tanh ReLU Sigmoid
periments is pursued. We set the size of the visible layer to
96 × 96 and change the number of hidden units from 400
to 1,600. The experimental results are shown in Figure 14, FIGURE 13. A comparison of activation function with the loss func-
where two loss functions are compared. tion of (a) cross entropy and the (b) MSE.

september 2018 ieee Geoscience and remote sensing magazine 57


Figure 14(a) and (b) demonstrates totally different ten-
dencies. The recognition accuracy obtained using the MSE
0.915 function is much better than the cross-entropy function
when the hyperbolic tangent is used to achieve activation.
0.910 The larger the hidden layer, the better the performance
when the MSE function is employed to measure the devia-
Accuracy

0.905 tion. The 1,200-hidden-neuron network produces the best


performance, while the performance produced by the cross-
0.900 entropy function is much better than that of the MSE func-
tion when the ReLU function is employed to implement ac-
0.895 tivation. The recognition accuracy is gradually decreased as
the number of hidden units increases. The 400-hidden-unit
0.890 network generates the best performance. The results prove
400 800 1,200 1,600
Number of Hidden Units that large-scale hidden-layer networks are suitable for pair-
(a) ing with the MSE function, while small-scale hidden-layer
networks are configured with the cross-entropy function.

0.91 OPTIMIZER
Having determined the architecture of the neural network,
0.90 the next problem is to solve the model parameters (i.e., the
weights and bias) by an optimization scheme. The most
Accuracy

0.89
popularly used method is the gradient descent optimiza-
0.88 tion algorithm. In the community of deep learning, many
variants of gradient descent have been presented. Represen-
0.87
tatives include SGD, RMSprop, adagrad, adadelta, adaptive
0.86 moment estimation (adam), adamax, and Nesterov-accel-
erated adaptive moment estimation (Nadam). A compre-
400 800 1,200 1,600 hensive review of optimization can be found in [104]. To
Number of Hidden Units verify the performance of these algorithms, we pursue a set
(b) of experiments. The results are reported as some statistics
(i.e., the minimum, median, maximum, and 25th and 75th
Cross Entropy MSE percentiles) of ten sample runs, drawn in Figure 15.
As Figure 15 shows, the family of optimization algo-
FIGURE 14. A comparison of the loss function with the activation of rithms demonstrates sharp differences in recognition per-
the (a) hyperbolic tangent and (b) ReLU. formance. The best performance is obtained by the adam
optimizer, as widely studied in prior works. The poorest
performance is generated by the adadelta, which produced
a nearly 20% drop in recognition accuracy. The result con-
forms to prior study [104]. The recognition accuracy pro-
0.925 duced by the SGD, adadelta, and adam optimizers are much
0.9 more robust than the remaining algorithms. The fluctuation
of the recognition rate produced by Nadam is more drastic
0.875
than the remaining algorithms. Therefore, it can be conclud-
0.85
Accuracy

ed that the adam algorithm is a good and general choice to


0.825 solve the model parameters.
0.8
0.775 SPARSE AUTOENCODER
A sparse autoencoder introduces a sparse constraint, KL di-
0.75
vergence between the average activation hidden units and
0.725
the desired value, with which the input inner structure can
be exploited. To validate the performance of the sparse con-
D

ta

am

ax

am
ro

ra
SG

el

am
Ad

ad
Sp

ag

ad

straint, we conduct a set of experiments. The input layer is set


Ad

N
Ad
M

Ad
R

as 96 × 96-pixel patches, while the number of hidden units is


specified as 800. We manually change the desired activation
FIGURE 15. The recognition performance obtained using different value from 0.01 to 0.8. Two different network structures are
optimizers. configured. The first scheme achieves the nonlinear mapping

58 ieee Geoscience and remote sensing magazine september 2018


with the hyperbolic tangent function and measures the re- DENOISING AUTOENCODER
construction error using the cross-entropy function. The A denoising autoencoder is developed to deal with the
second scheme implements the activation using the sigmoid slight disturbance of the observed input. It trains the net-
function and quantifies the loss using the MSE function. The work using manually corrupted data. To verify the perfor-
experimental results are drawn in Figure 16. mance, a group of experiments is conducted. The size of
As can be seen from Figure 16(a) and (b), two totally dif- the visible layer is set as 96 × 96 pixels, while 600- and
ferent tendencies are displayed. Only a sparse autoencoder 1,200-hidden-neuron networks are built. We enhance the
with the desired sparsity values of 0.01, 0.1, and 0.5 out- corruption level gradually from 0.01 to 0.5. The recogni-
performs the prototype of an autoencoder when the hyper- tion accuracies across the level of corruption are drawn in
bolic tangent activation and cross-entropy loss function are Figure 17, which compares a denoising autoencoder with
configured. Conversely, the recognition accuracy obtained the prototype autoencoder.
using a sparse autoencoder with all desired sparsity values As seen in Figure 17, the two network configurations
(from 0.01 to 0.8) is better than the autoencoder when the produce a similar trend. Significant improvement has been
sigmoid activation is used in conjunction with the MSE loss obtained when the input data are manually destructed. The
function. The results prove that the sparse autoencoder is best recognition rate is produced by a denoising autoen-
effective for network configuration of sigmoid activation coder with a 600-hidden-neuron network, which is 3.04%
and MSE loss. The desired sparsity value should be tuned better than that of the prototype autoencoder. Similarly,
according to the task at hand. An inappropriate sparsity the best recognition rate for the denoising autoencoder is
constraint may degrade the recognition performance. 0.9237 when a 1,200-hidden-unit network is built, 2.93%

0.906
0.915

0.910
0.902
Accuracy
Accuracy

0.905

0.898 0.900

0.895
0.894
0.890

0.89
05

10

15

20

25

30

35

40

45

50
0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.

0.

0.

0.

0.

0.

0.

0.

0.

0.
Desired Average Activation Values Corruption Level
(a) (a)
0.89

0.920
0.88
0.915
0.87
Accuracy
Accuracy

0.910
0.86
0.905

0.85 0.900

0.895
0.84
0.890
0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
05

10

15

20

25

30

35

40

45

50
0.

0.

0.

0.

0.

0.

0.

0.

0.

0.

Desired Average Activation Values Corruption Level


(b)
(b)
Sparse AE AE Denoising AE AE

FIGURE 16. A comparison of the autoencoder with and without FIGURE 17. A comparison of the autoencoder with and with-
sparse constraint. (a) The hyperbolic tangent activation and cross- out a denoising trick. (a) A 600-hidden-node network and (b) a
entropy loss and (b) the sigmoid activation and MSE loss. 1,200-hidden-node network.

september 2018 ieee Geoscience and remote sensing magazine 59


better than the prototype autoencoder. Only the accuracy As Figure 18 shows, an autoencoder with and with-
of denoising with a 0.05-level is slightly lower than that of out the dropout trick demonstrates a similar trend when
the autoencoder. A similar conclusion can be made based compared to an autoencoder with and without denoising,
on the preceding study [63]. The results prove that the ef- as shown in Figure 17. An autoencoder with all levels of
fectiveness of learned representation can be promoted by dropout performs much better than the prototype auto-
manually corrupting the input data. encoder. The best recognition accuracy is produced by the
600-hidden-unit network with the dropout trick, which is
DROPOUT 2.07% better than that of an autoencoder without dropout.
This denoising trick corrupts the input data manually, by which Similarly, for the 1,200-hidden-neuron network, the best
a slight perturbance can be allowed. Conversely, the dropout recognition rate for an autoencoder with dropout is 0.9104,
mode abandons the hidden neurons randomly. Its goal is to 1.63% better than that of the archetypal autoencoder. Simul­
handle small training samples. To verify the performance, we taneously, the curves drawn in Figure 18(a) and (b) are slightly
pursue a group of experiments. The visible layer is set as 96 × different. The recognition accuracy is increased approximate-
96 pixels, while 600- and 1,200-hidden-neuron networks are ly monotonically for the 1,200-hidden-unit network, while
configured. We promote the level of dropout gradually from the recognition rate produced by the 600-hidden-neuron net-
0.05 to 0.4. The results are reported in Figure 18, where the work forms a unimodal sequence.
recognition accuracy over the level of dropout is plotted.
CONTRACTIVE AUTOENCODER
In contrast to denoising and dropout tricks, where the cor-
rupted or dropped nodes are randomly determined, a con-
0.905
tractive autoencoder presents a deterministic constraint.
0.902
The new constraint term is defined as the Frobenius norm
of the Jacobian matrix of hidden neurons; with it, a robust
0.899 representation can be learned. A group of experiments is
performed to validate the performance. We set the visible
Accuracy

0.896 layer at 96 × 96 pixels and change the number of hidden


neurons from 400 to 800. The experimental results are
0.893 shown in Figure 19.
We found that a similar conclusion can be reached from
0.890 Figure 19(a) and (b). The performance obtained using a
contractive autoencoder is consistently better than that us-
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 ing the archetypal autoencoder except for the 400-hidden-
Drop Level neuron network with sigmoid activation. The improvement
(a) is much more significant if the hyperbolic tangent function
is employed to achieve activation. The recognition accuracy
is monotonically increased when the number of hidden
0.902 neurons increases. By contrast, the recognition rate of the
network with sigmoid activation forms an approximate un-
Accuracy

imodal pattern. Still, although a contractive autoencoder


0.898
can improve recognition performance, the computational
cost is much more intensive than with an autoencoder.
0.894 The memory consumption of a contractive autoencoder is
much larger than that of an autoencoder as well. Our 2-Gb
0.890 graphics processing unit (GPU) memory, GeForce GTX 750
Ti, could build the network within only 800 hidden neu-
rons. A network with more hidden neurons results in GPU
0.01 0.10 0.15 0.20 0.25 0.30 0.35 0.40
memory overflow.
Drop Level
(b)
VARIATIONAL AUTOENCODER
Dropout AE A variational autoencoder is usually composed of a genera-
tive model and a recognition model. It can be viewed as a
FIGURE 18. The performance obtained using the autoencoder with special form of autoencoder, where the recognition model
and without dropout. Two structural networks with (a) 600 hidden plays the role of an encoder and the generative model is re-
neurons and (b) 1,200 hidden neurons are built. The recognition garded as the decoder. The most arresting characteristics of
accuracy of the former network forms a unimodal sequence, while a variational autoencoder consist in the model parameters,
the performance of the latter network is monotonically increased. which are sampled from a certain statistical distribution.

60 ieee Geoscience and remote sensing magazine september 2018


A variational autoencoder imposes the constraint terms on performance. Moreover, the fluctuation of recognition ac-
the hidden neurons; hence, it is similar to a contractive au- curacy is much sharper than with the others. Therefore, we
toencoder. To verify the performance, we perform a set of may conclude that the recognition performance is not pro-
experiments using the Keras library by manually changing portional to the dimension of learned representation.
the dimension of the latent representation. The experimen-
tal results are given in Figure 20, where some statistics (i.e., CONVOLUTIONAL AUTOENCODER
minimum, median, maximum, and 25th and 75th percen- A CAE is developed by combining a CNN and an autoen-
tiles) of ten sample runs are reported. The model param- coder. In contrast to the archetypal autoencoder, the CAE
eters are sampled from the normal distribution N (0, 1). shares the kernel weights and bias among all locations in the
As Figure 20 shows, the recognition performance ob- input and could preserve the spatial locality. To evaluate the
tained varies slightly. Although the recognition accuracy is performance, we conduct a group of experiments. We manu-
gradually promoted as the dimension of learned representa- ally change the size of the convolutional kernel and pursue
tion increases, the improvement is slight. The 128-D, 160- a quantitative comparison with the deep CNN. Three kinds
D, and 192-D learned representations produce similar per- of convolutional kernels, 3 × 3, 5 × 5, and 7 × 7, are verified.
formances. The performance obtained by the 224-D latent The experimental results are displayed in Figure 21, where
representation is much better than the others. Moreover, the some statistics (i.e., maximum, mean, minimum, and per-
recognition accuracy is much more stable than the baseline. centiles) are reported. A CNN is employed as the baseline. To
The 256-D learned representation generates the poorest avoid overfitting, three-layer CNN and CAE architectures are

0.94
0.902
0.92
0.899
0.90
Accuracy
Accuracy

0.896
0.88
0.893
0.86
0.890
0.84
400 500 600 700 800 96-D 128-D 160-D 192-D 224-D 256-D
Number of Hidden Units Dimension of Representation
(a)
FIGURE 20. The recognition performance obtained using a varia-
tional autoencoder.
0.877

0.873 0.96
Accuracy

0.94
0.869
0.92
Accuracy

0.865
0.90

0.88
400 500 600 700 800
Number of Hidden Units 0.86
(b)

CAE(3) CNN(3) CAE(5) CNN(5) CAE(7) CNN(7)


AE Contractive AE
Convolutional Kernel

FIGURE 19. A comparison of the autoencoder with and without a FIGURE 21. The recognition accuracy obtained using a CAE and
Jacobian constraint. (a) A hyperbolic tangent activation and (b) a CNN. Three kinds of convolutional kernels, 3 × 3, 5 × 5, and 7 × 7,
sigmoid activation. are tested.

september 2018 ieee Geoscience and remote sensing magazine 61


constructed. The experiments are pursued using the Keras li- As can be seen, the performance obtained using a stacked
brary with a tensorflow backend. We built three convolution- autoencoder is much better than that using the single-
al layers, followed by the 2 × 2 max-pooling trick. For a CAE, hidden-layer model. The recognition accuracy for a two-
an upsampling layer is followed by the deconvolutional layer. hidden-layer stacked autoencoder is 4.08% better than that
For a three-layer CNN, the learned representation is flattened of the archetypal autoencoder. The recognition accuracies
to input a softmax classifier, a fully connected layer. are gradually improved as the network deepens: the deeper
As Figure 21 shows, a CAE’s recognition performance is the network, the better the performance. However, the rec-
different for different convolutional kernel sizes. The recog- ognition performance reaches a plateau when the network
nition accuracy obtained using a 5 × 5 convolutional kernel depth is beyond three hidden layers. A stacked autoencoder
is much better than that using only a 3 × 3 or 7 × 7 con- with six hidden layers outperforms one with two hidden
volutional kernel. In addition, CAE(5) produces the most layers by only 1.25%. Therefore, we can conclude that,
robust recognition performance. As for a three-layer CNN, although the representation power can be inherited by a
it always outperforms the CAE, no matter which convolu- stacked autoencoder, the recognition performance was not
tional kernel is employed. The result is not surprising due proportional to the network depth. For the task of classifi-
to its working mechanism. The deep CNN was initially de- cation with limited training samples, the deep architecture
veloped for large-scale classification, such as on ImageNet, usually sinks into overfitting. Therefore, it is important to
GoogLeNet, or VGGNet, with the goal of pattern classifi- configure a network using an appropriate structure.
cation. The CAE is a variant of an autoencoder. Although
it inherits some advantages of a CNN, the fundamental EXTENDED EVALUATION
purpose is to learn a good representation. A good represen- Factors related to an autoencoder have been studied previ-
tation will not necessarily produce ideal classification ac- ously. All experiments are performed under standard condi-
curacy, but we found that the performance obtained using tions. This section is devoted to extended evaluation under
a CAE is much more stable than that using a CNN. nonliteral settings, including changes to the aspect view,
configuration and depression angle, articulation, and occlu-
THE DEPTH OF NETWORK sion. Two popular learning strategies, the SVM and sparse
An autoencoder’s neural network architecture is formed by representation-based classification (SRC) [105], are used to
concatenating the basic model layer by layer. The output (i.e., provide the baseline performance. The input features result
hidden state) of a previous layer is wired to the inputs of the directly from the raw intensity values. A linear kernel is em-
successive layer. To study which depth of network is appropri- ployed for target classification. In prior work [106], a family
ate for this task, we conduct groups of experiments. Several dif- of correlation pattern recognition methods was reviewed.
ferent structural networks are configured. The sigmoid func- The representative applications to radar target recognition
tion is used to achieve activation. The experimental results are include the optional tradeoff synthetic discriminant func-
given in Table 3, where the overall recognition rate is reported. tion filter (OTSDF) [107] and the minimum noise and cor-
relation energy (MINACE) filter [108]. A set of correlation
filters was generated by the Fourier-transformed coeffi-
TABLE 3. THE RECOGNITION ACCURACIES OBTAINED USING cients. The inference is reached according to the correlation
A STACKED AUTOENCODER. response to the generated filters. Monogenic-signal-repre-
THE ARCHiTECTURE OF THE NEURAL NETWORK sentation-based classification (MSRC) [89], [109] is a recently
Input Layer (96 × 96 = 9,216 − NODES) developed strategy for target recognition in radar images.
Autoencoder 1,000 1,000 1,000 1,000 1,000 1,000
The monogenic signal is used to characterize a target’s sig-
Stacked AE (2) i 400 400 400 400 400
nature, while sparse signal modeling is utilized to imple-
Stacked AE (3) i i 1,000 1,000 1,000 1,000 ment the classification. All these methods employ global,
Stacked AE (4) i i i 500 500 500 predefined, handcrafted features for target recognition and
Stacked AE (5) i i i i 1,000 1,000 hence are specified as the baseline.
Stacked AE (6) i i i i i 500
Output Layer Softmax Classifier (Three-Class) CONFIGURATION CHANGE
Accuracy 0.8711 0.9119 0.9119 0.9193 0.9237 0.9244 The configuration refers to the addition or removal of dis-
crete components on the target, such as physical difference
and structural modification. All variants can be categorized
as a single class in the military sense. Examples of configu-
TABLE 4. TYPICAL EXAMPLES OF CONFIGURATION CHANGE. ration variation are listed in Table 4.
Category Examples
To evaluate the performance under configuration change,
Version variants Smoke grenade, launchers, side skirts
images of four targets, BMP2, BTR60, T72, and T62, are used,
Configuration variants Two cables, fuel barrels among which BMP2 and BTR60 are armored personnel
Structural changes Dented fenders, broken antenna mount carriers and T72 and T62 are main-battle tanks. Both pairs
of vehicles demonstrate similar scattering phenomenology.

62 ieee Geoscience and remote sensing magazine september 2018


Moreover, the BMP2 and T72 still have several configu-
ration variants, noted by the series number. The details TABLE 5. THE ASPECT VIEWS OF DIFFERENT CONFIGURA-
TIONS.
are listed in Table 5. The standards (BMP2_SN_9563 and
T72_SN_132), taken at a 17c depression angle, are used to Target Series NUMBER Training Testing
BMP2 SN_9563 233 —
train the algorithms, while the variants (BMP2_SN_9566,
SN_9566 — 196
BMP2_SN_c21 and T72_SN_812, T72_SN_s7), captured at
SN_c21 — 196
a 15c depression angle, are specified for testing. The con-
BTR60 k10yt7532 256 195
figurations used for training are not contained in the query
BMP2 SN_132 232 —
set. Therefore, the task of target recognition is more chal-
SN_812 — 195
lenging than in the previous experiments.
SN_s7 — 191
Table 6 reports the experimental results, where “Stacked
T62 A51 299 273
AE (n)” denotes a stacked autoencoder with n hidden layers.
Total 1,020 1,246
The recognition accuracy of SVM-raw and SRC-raw are con-
sistent with the results reported in preceding works [88], [89].
As Table 6 shows, the sum performance is much poorer than
the standard evaluation in the previous experiments. The
TABLE 6. THE PERFORMANCE UNDER CONFIGURATION
recognition accuracy is 0.8443 for an autoencoder, which is
CHANGES.
lower than the standard verification. The drop of recognition
Handcrafted Feature and
accuracy can be attributed to the hard settings. In this round
Algorithm Network Architecture Accuracy
of experiments, both the configuration and the depression
SVM-Raw The raw intensity values 0.8331
angle are significantly different between the images available
SRC-Raw The raw intensity values 0.8708
for training and those for testing. The performance obtained MSRC The monogenic signal 0.8748
using an autoencoder is comparable to or slightly better than OTSDF The Fourier-transformed coefficients 0.8373
that obtained using an SVM with raw intensity values and MINACE The Fourier-transformed coefficients 0.8354
the correlation filters OTSDF and MINACE. Some improve- Autoencoder 9,216 g1,200 g4 0.8443
ment has been achieved using stacked autoencoders. The Stacked AE (2)
9,216 g1,200 g500 g4 0.8540
recognition accuracy is 0.8540 for Stacked AE (2), 0.8581
Stacked AE (3) 9,216 g1,200 g500 g1,200 g4 0.8581
for Stacked AE (3), 0.8694 for Stacked AE (4), and 0.8726 for
Stacked AE (4) 9,216 g1,200 g500 g1,200 g500 g4 0.8694
Stacked AE (5), i.e., 2.34%, 2.75%, 3.88%, and 4.20% better,
Stacked AE (5) 9,216 g1,200 g500 g1,200 g500 0.8726
respectively, than the single-hidden-layer autoencoder. Al-
g600 g4
though performance can be improved using stacked autoen-
CNN Three-layer CNN 0.8664
coders, the improvement is slight. The recognition accuracy
is no better than comparable to that of sparse representation
(with shallow representation and a three-layer CNN) and
lower than that presented in prior works [89], at 0.8748. The
performance under the configuration changes should be fur-
ther promoted.

ARTICULATION AND OCCLUSION


In the field of radar image interpretation, articulation and
occlusion generally refers to the relative movement between
different attached parts on the target and is usually des- (a) (b)
ignated as continuous (tank turret rotation and gun el-
evation) or discrete (opening and closing the hatches and
FIGURE 22. The target articulation of the ZSU23/4 with the turret
doors). A pair of examples is shown in Figure 22.
(a) straight and (b) articulated.
To validate the performance under the operation of artic-
ulation and occlusion, we pursue a set of experiments. Im-
ages of three military vehicles, ZSU23/4, 2S1, and BRDM_2, Therefore, the task of target recognition is much more dif-
are employed, among which BRDM_2 and ZSU23/4 have ficult than in previous experiments.
several articulated variants. The standards taken at a 17c The results are reported as the overall recognition rate,
depression angle are used for training, while the variants listed in Table 8. The recognition performance is much
collected at a 45c depression angle are employed for test- poorer than in the previous experiments: even the best
ing. The details are tabulated in Table 7. The images avail- recognition accuracy is lower than 0.75. The sharp drop
able for training and those for testing are taken under two in recognition accuracy is caused mainly by the harsh ex-
significantly different operating conditions. Moreover, the perimental setting. Because the images used for training
target may be in different states, such as moving the gun. are taken at a 17c depression angle while the ones used

september 2018 ieee Geoscience and remote sensing magazine 63


factors and verify the configuration of neural networks. The
TABLE 7. THE STANDARD AND ARTICULATED ASPECT VIEWS. latter is meant to evaluate the performance under extended
Testing operating conditions, such as those involving depression
Series angle and configuration changes, articulation, and occlu-
Target NUMBER Training Standard Articulated
sion. The comparative studies demonstrate the following.
2S1 b01 299 303 —
◗◗ An improvement can be obtained using a stacked auto-
BRDM_2 E-71 298 303 120
encoder compared to the basic model.
ZSU23/4 d08 299 303 119
◗◗ The network configuration plays an important role in
Total 896 1,148
representation learning.
◗◗ The autoencoder’s neural network could handle the
nonliteral experimental settings to some degree.
TABLE 8. THE PERFORMANCE ON ARTICULATION Although an autoencoder and the deep models could
AND ­DEPRESSION VARIATIONS. handle extended operating conditions, this problem is far
Algorithm Architecture Accuracy from being solved. It is very difficult to tune a suitable net-
SVM-Raw Raw intensity values 0.5261 work structure. In addition, the performance needs to be
SRC-Raw Raw intensity values 0.5366 promoted further.
MSRC Monogenic signal 0.6394
OTSDF Fourier transformed coefficients 0.4486 CONCLUSIONS
MINACE Fourier transformed coefficients 0.3998 In this article, we attempt to pursue a systematic review
Autoencoder 9,216 g2,400 g3 0.6727 of the autoencoder and its variants. Much more attention
Stacked AE (2) 9,216 g2,400 g400 g3 0.6964 is paid to those applications for remote sensing and radar
Stacked AE (3) 9,216 g2,400 g400 g800 g3 0.7136
image interpretations. The contributions of the existing
works are categorized into four areas: 1) the implementa-
Stacked AE (4) 9,216 g2,400 g400 g800 g600 g3 0.7436
tion of information fusion, 2) the development of the cost
Stacked AE (5) 9,216 g2,400 g400 g800 g600 0.7390
g500 g3
function, 3) the renewal of the learning process, and 4) the
CNN Three-layer CNN 0.6871 achievement of transfer learning. Some comparative stud-
ies are performed from the perspective of target recognition
in SAR images. The performance is validated by two kinds
for testing are collected at a 45c depression angle, a drastic of experimental plans: fundamental verification under the
change of 28c exists between the images available for train- standard setting and extended evaluation under nonliteral
ing and for testing. In addition, BRMD_2 and ZSU23/4 conditions. The quantitative comparison leads to interest-
have several articulated variants, as detailed in Table 7. The ing suggestions about the logical configuration of a deep
recognition accuracy obtained using the correlation filters network for the task at hand. The main conclusions reached
(i.e., OTSDF and MINACE) are only around 0.4. The per- from our study are four-fold.
formance obtained using the autoencoder neural network ◗◗ The latent representation learned by deep neural net-
is much better than the baseline, even though the experi- works outperforms those handcrafted, shallow, pre-
mental setting is difficult. The single-hidden-layer autoen- defined features, as proved in prior works.
coder produces recognition accuracy of 0.6727, 14.66%, ◗◗ A drastic fluctuation in recognition performance can be
13.61%, and 3.33% better, respectively, than its competi- produced if the networks are configured unsuitably. A
tors, SVM-raw, SRC-raw, and MSRC. It is slightly lower than neural network with inappropriate configurations may
the three-layer CNN, 0.6871. The best performance, 0.7436, degrade performance.
is obtained by Stacked AE (4). It is 0.46%, 3.0%, 4.72%, and ◗◗ There is no particular network structure that could con-
7.09% better, respectively, than Stacked AE (5), Stacked sistently provide optimal performance. Therefore, it is im-
AE (3), Stacked AE (2), and the autoencoder. The recognition portant to tune the network flexibly for the task at hand.
accuracy obtained using Stacked AE (4) is better than that ◗◗ The performance can be further improved by deep neu-
using Stacked AE (5). The phenomenon proves that classifi- ral networks, especially for target recognition under ex-
cation accuracy is not necessarily proportional to the depth tended operating conditions.
of the neural network. Therefore, it is necessary to tune an This article discusses the attempt to apply an autoen-
appropriate structure of the neural network according to coder to target recognition in SAR images. Although some
the task at hand. improvement has been achieved, increasingly difficult work
should be pursued further. The most urgent problem is how
DISCUSSION to handle the real-world sources of variability. Continued
This section is devoted to the verification of the autoen- research will focus on dealing with target recognition un-
coder and the variants. Two families of experimental plans, der nonliteral conditions. Some new tricks will be devel-
fundamental validation and extended evaluation, are oped, including initialization via pretraining, updating
pursued. The former aims to study the impacts of related of the learning algorithm, and adaptive neural network

64 ieee Geoscience and remote sensing magazine september 2018


architectures (i.e., the tuning skill of the network struc- Technology. His current research interests include remote
ture). Although relative studies have been completed in sensing, synthetic aperture radar (SAR) image processing,
prior works [2], [43], [110], further research in SAR target change detection, SAR ground moving target indication,
recognition still needed. and classification with polarimetric SAR images.

ACKNOWLEDGMENT REFERENCES
This work was supported by the National Science Fund [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learn-
for Distinguished Young Scholars of China under Grant ing: A review and new perspectives,” IEEE Trans. Pattern Anal.
61525105. We would like to thank the associate editor and Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
the anonymous reviewers for their great contribution to [2] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep
this article. Dr. Ganggang Dong would like to thank Dr. feature extraction for remote sensing image classification,” IEEE
Zhouhan Lin for his kind help. Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar.
2016.
AUTHOR INFORMATION [3] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised
Ganggang Dong ([email protected]) received feature learning for scene classification,” IEEE Trans. Geosci. Re-
his M.S. and Ph.D. degrees in information and communi- mote Sens., vol. 53, no. 4, pp. 2175–2184, Apr. 2015.
cation engineering from the National University of Defense [4] Y. Li, X. Huang, and H. Liu, “Unsupervised deep feature learning
Technology, Changsha, China, in 2012 and 2016, respec- for urban village detection from high-resolution remote sensing
tively. Since 2014, he has authored more than 20 scientific images,” ISPRS J. Photogrammetry Remote Sens., vol. 83, no. 8, pp.
papers in peer-reviewed journals and conferences, includ- 567–579, Aug. 2017.
ing IEEE Transactions on Image Processing, IEEE Transactions [5] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm
on Geoscience and Remote Sensing, IEEE Geoscience and Remote for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–
Sensing Magazine, IEEE Journal of Selected Topics in Applied 1554, 2006.
Earth Observations and Remote Sensing, IEEE Geoscience and [6] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality
Remote Sensing Letters, and IEEE Signal Processing Letters. His of data with neural networks,” Science, vol. 313, no. 5786, pp.
research interests include the applications of compressed 504–507, 2006.
sensing and sparse representation, pattern recognition, mani- [7] G. Hinton and R. Zemel, “Autoencoders, minimum description
folds learning, and deep neural networks. length and Helmholtz free energy,” in Proc. 6th Int. Conf. Neural
Guisheng Liao ([email protected]) received his B.S. Information Processing Systems (NIPS), 1994, pp. 3–10.
degree from Guangxi University, Nanning, China, in 1985 [8] N. L. Roux and Y. Bengio, “Representational power of restricted
and his M.S. and Ph.D. degrees from Xidian University, Boltzmann machines and deep belief networks,” Neural Compu-
Xi’an, China, in 1990 and 1992, respectively. He is a pro- tation, vol. 20, no. 6, pp. 1631–1649, June 2008.
fessor with Xidian University, where he is also dean of the [9] S. Lawrence, C. Giles, A. C. Tsoi, and A. Back, “Face recognition:
School of Electronic Engineering. He has been a senior vis- A convolutional neural-network approach,” IEEE Trans. Neural
iting scholar with the Chinese University of Hong Kong. His Netw., vol. 8, no. 1, pp. 98–113, Jan. 1997.
research interests include synthetic- aperture radar (SAR), [10] D. T. Grozdic and S. T. Jovicic, “Whispered speech recognition
space–time adaptive processing, SAR ground moving target using deep denoising autoencoder and inverse filtering,” IEEE/
indication, and distributed small satellite SAR system de- ACM Trans. Audio, Speech, Language Processing, vol. 25, no. 12, pp.
sign. He is a member of the National Outstanding Person 2313–2322, Dec. 2017.
and the Cheung Kong Scholars in China. [11] Y. Dai and G. Wang, “Analyzing tongue images using a concep-
Hongwei Liu ([email protected]) received his B.Eng. tual alignment deep autoencoder,” IEEE Access, vol. 6, no. 3, pp.
degree in electronic engineering from the Dalian University 1137–1145, Mar. 2018.
of Technology, China, in 1992 and his M.Eng. and Ph.D. [12] D. Park, Y. Hoshi, and C. C. Kemp, “A multimodal anomaly de-
degrees in electronic engineering from Xidian University, tector for robot-assisted feeding using an LSTM-based variation-
Xi’an, China, in 1995 and 1999, respectively. He is currently al autoencoder,” IEEE Robot. Autom. Mag. Lett., vol. 3, no. 3, pp.
director and a professor with the National Laboratory of Ra- 1544–1551, July 2018.
dar Signal Processing, Xidian University. His research inter- [13] J. Yu, C. Hong, Y. Rui, and D. Tao, “Multitask autoencoder model
ests include radar automatic target recognition, radar signal for recovering human poses,” IEEE Trans. Ind. Electron., vol. 65,
processing, and adaptive signal processing. no. 6, pp. 5060–5068, June 2018.
Gangyao Kuang ([email protected]) received [14] M. Ma and C. S. X. Chen, “Deep coupling autoencoder for fault
his B.S. and M.S. degrees from the Central South University diagnosis with multimodal sensory data,” IEEE Trans. Ind. Infor-
of Technology, Changsha, China, in 1991 and 1998, respec- mat., vol. 14, no. 3, pp. 1137–1145, Mar. 2018.
tively, and his Ph.D. degree from the National University [15] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective
of Defense Technology, Changsha, in 1995. He is currently and efficient midlevel visual elements-oriented land-use classi-
a professor and director of the Remote Sensing Informa- fication using VHR remote sensing images,” IEEE Trans. Geosci.
tion Processing Laboratory, National University of Defense Remote Sens., vol. 53, pp. 4238–4249, Aug. 2015.

september 2018 ieee Geoscience and remote sensing magazine 65


[16] X. Sun, F. Zhou, J. Dong, F. Gao, Q. Mu, and X. Wang, “Encoding [31] P. Planinsic and D. Gleich, “Temporal change detection in SAR
spectral and spatial context information for hyperspectral image images using log cumulants and stacked autoencoder,” IEEE
classification,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 12, pp. Geosci. Remote Sens. Lett., vol. 15, pp. 297–301, Feb. 2018.
2250–2254, Dec. 2017. [32] L. Zhang, W. Ma, and D. Zhang, “Stacked sparse autoencoder in
[17] X. Ma, H. Wang, and J. Geng, “Spectral spatial classification of PolSAR data classification using local spatial information,” IEEE
hyperspectral image based on deep auto-encoder,” IEEE J. Sel. Geosci. Remote Sens. Lett., vol. 13, pp. 1359–1363, Sept. 2016.
Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 9, pp. 4073– [33] S. Malek, F. Melgani, Y. Bazi, and N. Alajlan, “Reconstructing
4085, Sept. 2016. cloud-contaminated multispectral images with contextualized
[18] Y. Chen, L. Jiao, Y. Li, and J. Zhao, “Multilayer projective dic- autoencoder neural networks,” IEEE Trans. Geosci. Remote Sens.,
tionary pair learning and sparse autoencoder for PolSAR image vol. 56, pp. 2270–2282, Apr. 2018.
classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, pp. 6683– [34] S. Deng, L. Du, C. Li, J. Ding, and H. Liu, “SAR automatic target
6694, Dec. 2017. recognition based on Euclidean distance restricted autoencod-
[19] F. Lv, M. Han, and T. Qiu, “Remote sensing image classification er,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no.
based on ensemble extreme learning machine with stacked au- 7, pp. 3323–3333, July 2017.
toencoder,” IEEE Access, vol. 5, pp. 9021–9031, 2017. [35] K. Liang, H. Chang, Z. Cui, S. Shan, and X. Chen, “Represen-
[20] W. Xie, L. Jiao, B. Hou, W. Ma, J. Zhao, S. Zhang, and F. Liu, tation learning with smooth autoencoder,” in Proc. Asian Conf.
“POLSAR image classification via Wishart-AE model or Wishart- Computer Vision, Singapore, 2014, pp. 72–86.
CAE model,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., [36] M. Gong, J. Liu, H. Li, Q. Cai, and L. Su, “A multiobjective sparse
vol. 10, no. 8, pp. 3604–3615, May. 2017. feature learning model for deep neural networks,” IEEE Trans. Neu-
[21] J. Geng, H. Wang, J. Fan, and X. Ma, “Deep supervised and con- ral Netw. Learn. Syst., vol. 26, no. 12, pp. 3263–3277, Dec. 2015.
tractive neural network for SAR image classification,” IEEE Trans. [37] Y. Sun, J. Li, W. Wang, A. Plaza, and Z. Chen, “Active learning
Geosci. Remote Sens., vol. 55, no. 4, pp. 2442–2459, Apr. 2017. based autoencoder for hyperspectral imagery classification,”
[22] E. Li, P. Du, A. Samat, Y. Meng, and M. Che, “Mid-level feature in Proc. IEEE Int. Geosciences and Remote Sensing Symp., Beijing,
representation via sparse autoencoder for remotely sensed scene China, July 2016, pp. 469–472.
classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., [38] J. Tang, C. Deng, and G. B. Huang, “Extreme learning machine
vol. 10, no. 3, pp. 1068–1081, Mar. 2017. for multilayer perceptron,” IEEE Trans. Neural Netw. Learn. Syst.,
[23] B. Hou, H. Kou, and L. Jiao, “Classification of Polarimetric SAR vol. 27, no. 4, pp. 809–821, Apr. 2016.
images using multilayer autoencoders and superpixels,” IEEE J. [39] Y. Zhou and Y. Wei, “Learning hierarchical spectral-spatial fea-
Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. 3072– tures for hyperspectral image classification,” IEEE Trans. Cybern.,
3081, July 2016. vol. 46, no. 7, pp. 1667–1678, July 2016.
[24] C. Tao, H. Pan, Y. Li, and Z. Zou, “Unsupervised spectral-spatial [40] H. Wu, B. Liu, W. Su, W. Zhang, and J. Sun, “Deep filter banks for
feature learning with stacked sparse autoencoder for hyperspec- land-use scene classification,” IEEE Geosci. Remote Sens. Lett., vol.
tral imagery classification,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1895–1899, Dec. 2016.
12, no. 12, pp. 2438–2442, Dec. 2015. [41] H. Kim and A. Hirose, “Unsupervised fine land classification
[25] W. Zhou, Z. Shao, C. Diao, and Q. Cheng, “High-resolution re- using quaternion autoencoder-based polarization feature extrac-
mote-sensing imagery retrieval using sparse features by auto-en- tion and self-organizing mapping,” IEEE Trans. Geosci. Remote
coder,” Remote Sensing Lett., vol. 6, no. 10, pp. 775–683, Aug. 2015. Sens., vol. 56, pp. 1839–1851, Mar. 2018.
[26] E. Othmana, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani, [42] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught
“Using convolutional features and a sparse autoencoder for land- learning: Transfer learning from unlabeled data,” in ACM Proc.
use scene classification,” Int. J. Remote Sens., vol. 37, no. 10, pp. 24th Int. Conf. Mach. Learn., June 2007, pp. 759–766.
1977–1995, 2016. [43] R. Kemker and C. Kanan, “Self-taught feature learning for hy-
[27] Z. Lin, Y. Chen, X. Zhao, and G. Wang, “Spectral-spatial classi- perspectral image classification,” IEEE Trans. Geosci. Remote Sens.,
fication of hyperspectral image using autoencoders,” in Proc. 9th vol. 55, no. 5, pp. 2693–2705, May 2017.
Int. Conf. Information, Communications and Signal Processing, Dec. [44] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo, “Semantic anno-
2013. tation of high-resolution satellite images via weakly supervised
[28] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning- learning,” IEEE Trans. Geosci. Remote Sens., vol. 54, pp. 3660–
based classification of hyperspectral data,” IEEE J. Sel. Topics 3671, June 2016.
Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107, June [45] Z. Shao, L. Zhang, and L. Wang, “Stacked sparse autoencoder
2014. modeling using the synergy of airborne LiDAR and satellite opti-
[29] J. Geng, J. Fan, H. Wang, X. Ma, B. Li, and F. Chen, “High-res- cal and SAR data to map forest above-ground biomass,” IEEE J.
olution SAR image classification via deep convolutional au- Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 12, pp.
toencoders,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. 5569–5582, Dec. 2017.
2351–2355, Nov. 2015. [46] H. Li and S. Misra, “Prediction of subsurface NMR T2 distribu-
[30] M. Kang, K. Ji, X. Leng, X. Xing, and H. Zou, “Synthetic aperture tions in a shale petroleum system using variational autoencoder-
radar target recognition with feature fusion based on a stacked based neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 14,
autoencoder,” MDPI Sensors, vol. 17, pp. 192, 2017. pp. 2395–2397, Dec. 2017.

66 ieee Geoscience and remote sensing magazine september 2018


[47] A. Elshamli, G. W. Taylor, A. Berg, and S. Areibi, “Domain ad- Proc. Advances in Neural Information Processing Systems, Vancouver,
aptation using representation learning for the classification of Canada, 2007, pp. 1137–1144.
remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ. Re- [63] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, “Extracting and
mote Sens., vol. 10, no. 9, pp. 4198–4209, Sept. 2017. composing robust features with denoising autoencoders,” in Proc.
[48] S. De, L. Bruzzone, A. Bhattacharya, F. Bovolo, and S. Chaud- ACM 25th Int. Conf. Machine Learning, July 2008, pp. 1096–1103.
huri, “A novel technique based on deep learning and a synthetic [64] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manza-
target database for classification of urban areas in PolSAR data,” gol, “Stacked denoising autoencoders: Learning useful represen-
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 1, tations in a deep network with a local denoising criterion,” J.
pp. 154–170, Jan. 2018. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010.
[49] L. Windrim, R. Ramakrishnan, A. Melkumyan, and R. J. Mur- [65] N. Srivastava, G. Hinton, A. Krizhevsky, and I. Sutskever, “Drop-
phy, “A physics-based deep learning approach to shadow invari- out: A simple way to prevent neural networks from overfitting,”
ant representations of hyperspectral images,” IEEE Trans. Image J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
Process, vol. 27, no. 2, pp. 665–677, Feb. 2018. [66] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Con-
[50] Y. Sun, Z. Liu, S. Todorovic, and J. Li, “Adaptive boosting for SAR tractive auto-encoders: Explicit invariance during feature extrac-
automatic target recognition,” IEEE Trans. Aerosp. Electron. Syst., tion,” in Proc. ACM 28th Int. Conf. Machine Learning, Bellevue,
vol. 43, no. 1, pp. 112–125, Jan. 2007. Washington, 2011, pp. 833–840.
[51] U. Srinivas, V. Monga, and R. G. Raf, “SAR automatic target rec- [67] D. Kingma, S. Mohamed, D. Rezende, and M. Welling, “Semi-
ognition using discriminative graphical models,” IEEE Trans. supervised learning with deep generative models,” in Proc. Ad-
Aerosp. Electron. Syst., vol. 50, no. 1, pp. 591–606, Jan. 2014. vances in Neural Information Processing Systems, Montreal, Canada,
[52] A. A. Popescu, I. Gavat, and M. Datcu, “Contextual descriptors 2014, pp. 3581–3589.
for scene classes in very high resolution SAR images,” IEEE Geos- [68] D. Rezende, S. Mohamed, and D. Wierstra, “Stochastic back-
ci. Remote Sens. Lett., vol. 9, no. 1, pp. 80–84, Jan. 2012. propagation and approximate inference in deep generative mod-
[53] M. Liu, Y. Wu, P. Zhang, Q. Zhang, Y. Li, and M. Li, “SAR target els,” in Proc. ACM 31st Int. Conf. Machine Learning, June 2014, pp.
configuration recognition using locality preserving property and 1278–1286.
Gaussian mixture distribution,” IEEE Geosci. Remote Sens. Lett., [69] J. Masci, U. Meier, D. Cirean, and J. Schmidhuber, “Stacked con-
vol. 10, no. 2, pp. 268–272, Mar. 2013. volutional auto-encoders for hierarchical feature extraction,” in
[54] M. Li, Y. Guo, M. Li, G. Luo, and X. Kong, “Coupled dictionary Proc. Int. Conf. Artificial Neural Networks, 2011, pp. 52–59.
learning for target recognition in SAR images,” IEEE Geosci. Re- [70] A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, and C. Suen. (2013). UFLDL
mote Sens. Lett., vol. 14, no. 6, pp. 791–795, June 2017. Tutorial: Stacked Autoencoders. [Online]. Available: http://ufldl
[55] B. Ding, G. Wen, X. Huang, C. Ma, and X. Yang, “Target recogni- .stanford.edu/wiki/index.php/Stacked_ Autoencoders
tion in synthetic aperture radar images via matching of attribut- [71] S. Hao, W. Wang, Y. Ye, T. Nie, and L. Bruzzone, “Two-stream
ed scattering centers,” IEEE J. Sel. Topics Appl. Earth Observ. Remote deep architecture for hyperspectral image classification,” IEEE
Sens., vol. 10, no. 7, pp. 3334–3347, July 2017. Trans. Geosci. Remote Sens, vol. 56, no. 4, pp. 2349–2361, 2018.
[56] A. M. Cheriyadat, “Unsupervised feature learning for aerial [72] X. Zhang, Y. Liang, C. Li, N. Huyan, L. Jiao, and H. Zhou, “Recur-
scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 52, pp. sive autoencoders-based unsupervised feature learning for hy-
439–451, Jan. 2014. perspectral image classification,” IEEE Geosci. Remote Sens. Lett.,
[57] Y. Yang and S. Newsam, “Geographic image retrieval using local vol. 14, no. 11, pp. 1928–1932, 2017.
invariant features,” IEEE Trans. Geosci. Remote Sens., vol. 51, pp. [73] D. Zhang, J. Han, J. Han, and L. Shao, “Cosaliency detection
818–832, Feb. 2013. based on intrasaliency prior transfer and deep intersaliency
[58] X. Zheng, X. Sun, K. Fu, and H. Wang, “Geographic image re- mining,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 6, pp.
trieval using local invariant features,” IEEE Geosci. Remote Sens. 1163–1176, June 2016.
Lett., vol. 10, pp. 652–656, July 2013. [74] L. Cao, W. Huang, and F. Sun, “Building feature space of extreme
[59] W. Shao, W. Yang, and G. Xia, “Extreme value theory-based cali- learning machine with sparse denoising stacked-autoencoder,”
bration for the fusion of multiple features in high-resolution sat- Neurocomputing, vol. 174, no. Part A, pp. 60–71, Jan. 2016.
ellite scene classification,” Int. J. Remote Sens., vol. 34, pp. 8588– [75] J. Feng, L. Liu, X. Zhang, R. Wang, and H. Liu, “Hyperspectral
8602, Dec. 2013. image classification based on stacked marginal discriminative
[60] D. M. McKeown, S. D. Cochran, S. J. Ford, J. C. McGlone, J. A. autoencoder,” in Proc. IEEE Int. Geosciences and Remote Sensing
Shufelt, and D. A. Yocum, “Fusion of HYDICE hyperspectral data Symp., July 2017, pp. 3688–3671.
with panchromatic imagery for cartographic feature extraction,” [76] S. Paul and D. Kumar, “Spectral-spatial classification of hyper-
IEEE Trans. Geosci. Remote Sens., vol. 37, pp. 1261–1277, May 1999. spectral data with mutual information based segmented stacked
[61] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy autoencoder approach,” ISPRS J. Photogrammetry Remote Sens.,
layer-wise training of deep networks,” in Proc. Advances in Neu- vol. 138, no. 1, pp. 265–280, Apr. 2018.
ral Information Processing Systems, Vancouver, Canada, 2007, pp. [77] X. Han, Y. Zhong, and L. Zhang, “Spatial-spectral unsupervised
153–160. convolutional sparse auto-encoder classifier for hyperspectral
[62] M. Ranzato, C. Poultney, S. Chopra, and Y. Cun, “Efficient learn- imagery,” ISPRS J. Photogrammetry Remote Sens., vol. 83, no. 3, pp.
ing of sparse representations with an energy-based model,” in 195–206, Mar. 2017.

september 2018 ieee Geoscience and remote sensing magazine 67


[78] Z. Lin, K. Ji, M. Kang, X. Leng, and H. Zou, “Deep convolutional [95] M. Martorella, E. Giusti, A. Capria, F. Berizzi, and B. Bates, “Au-
highway unit network for sar target classification with limited tomatic target recognition by means of polarimetric ISAR images
labeled training data,” IEEE Geosci. Remote Sens. Lett., vol. 14, pp. and neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 47,
1091–1095, July 2017. pp. 3786–3794, Nov. 2009.
[79] Q. Song and F. Xu, “Zero-shot learning of SAR target feature [96] L. Novak, G. Owirka, and W. Brower, “Performance of 10- and
space with deep generative neural networks,” IEEE Geosci. Remote 20-target MSE classifiers,” IEEE Trans. Aerosp. Electron. Syst., vol.
Sens. Lett., vol. 14, pp. 2245–2249, Dec. 2017. 36, pp. 1279–1289, Oct. 2000.
[80] Z. Zhang, H. Wang, F. Xu, and Y.-Q. Jin, “Complex-valued convo- [97] A. M. Atto, E. Trouve, Y. Berthoumieu, and G. Mercier, “Multidate
lutional neural network and its application in polarimetric SAR divergence matrices for the analysis of SAR image time series,” IEEE
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, pp. Trans. Geosci. Remote Sens., vol. 51, no. 4, pp. 1922–1938, Apr. 2013.
7177–7188, Dec. 2017. [98] J. Inglada and G. Mercier, “A new statistical similarity measure
[81] G. Cheng, Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan, “Auto- for change detection in multitemporal SAR images and its exten-
matic road detection and centerline extraction via cascaded end- sion to multiscale change analysis,” IEEE Trans. Geosci. Remote
to-end convolutional neural network,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 5, pp. 1432–1445, May 2007.
Sens., vol. 55, pp. 3322–3337, June 2017. [99] A. C. Frery, A. D. C. Nascimento, and R. J. Cintra, “Analytic ex-
[82] S. Chen, H. Wang, F. Xu, and Y.-Q. Jin, “Target classification us- pressions for stochastic distances between relaxed complex Wis-
ing the deep convolutional networks for SAR images,” IEEE Trans. hart distributions,” IEEE Trans. Geosci. Remote Sens., vol. 52, no.
Geosci. Remote Sens., vol. 54, pp. 4806–4817, Aug. 2016. 2, pp. 1213–1226, Feb 2014.
[83] E. Keydel, S. Lee, and J. Moore, “MSTAR extended operating con- [100] P. Guccione, A. M. Guarnieri, and M. Zonno, “Azimuth anten-
ditions—A tutorial,” in Proc. SPIE Algorithms for Synthetic Aperture na maximum likelihood estimation by persistent point scatter-
Radar Imagery III, June 1996, pp. 228–242. ers in SAR images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no.
[84] J. Thiagarajan, K. Ramamurthy, P. Knee, A. Spanias, and V. Beri- 2, pp. 947–955, Feb. 2014.
sha, “Sparse representation for automatic target classification in [101] P. Iervolino and R. Guida, “A novel ship detector based on the
SAR images,” in Int. Symp. Communcitaion, Control and Signal Pro- generalized-likelihood ratio test for SAR imagery,” IEEE J. Sel.
cessing, Mar. 2010, pp. 1–4. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 8, pp. 3616–
[85] Q. Zhao and J. C. Principe, “Support vector machines for SAR 3630, May 2017.
automatic target recognition,” IEEE Trans. Aerosp. Electron. Syst., [102] P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza, “Advanced
vol. 37, no. 2, pp. 643–654, 2001. spectral classifiers for hyperspectral images: A review,” IEEE
[86] S. Papson and R. M. Narayanan, “Classification via the shadow Geosci. Remote Sens. Mag., vol. 5, no. 1, pp. 8–32, Mar. 2017.
region in SAR imagery,” IEEE Trans. Aerosp. Electron. Syst., vol. 48, [103] H. Kamyshanska and R. Memisevic, “The potential energy of
no. 2, pp. 969–980, Apr. 2012. an autoencoder,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37,
[87] J. Zhu, X. Qiu, Z. Pan, Y. Zhang, and B. Lei, “An improved shape no. 6, pp. 1261–1273, June 2015.
contexts based ship classification in SAR images,” Remote Sens- [104] S. Ruder. (2016). An overview of gradient descent optimiza-
ing, vol. 9, no. 2, 2017. tion algo­rithms. arXiv. [Online]. Available: https://arxiv.org/
[88] G. Dong, G. Kuang, N. Wang, and W. Wang, “Classification via abs/1609.04747
sparse representation of steerable wavelet frames on Grassmann [105] J. Wright, A. Yang, and A. Ganesh, “Robust face recognition via
manifold: Application to target recognition in SAR image,” IEEE sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell.,
Trans. Image Process, vol. 26, no. 6, pp. 2892–2904, June 2017. vol. 31, no. 2, pp. 210–227, Feb. 2009.
[89] G. Dong and G. Kuang, “Classification on the monogenic scale [106] V. K. Kumar, M. Savvides, and C. Xie, “Correlation pattern rec-
space: Application to target recognition in SAR image,” IEEE ognition for face recognition,” Proc. IEEE, vol. 94, no. 11, pp.
Trans. Image Process, vol. 24, no. 8, pp. 2527–2539, Aug. 2015. 1963–1976, Nov. 2006.
[90] L. Potter and R. Moses, “Attributed scattering centering for SAR [107] R. Singh and B. Kumar, “Performance of the extended maxi-
ATR,” IEEE Trans. Image Process, vol. 6, no. 1, pp. 79–91, June 1997. mum average correlation height filter and the polynomial dis-
[91] J. Zhou, Z. Shi, X. Cheng, and Q. Fu, “Automatic target recognition tance classifier correlation filter for multiclass SAR detection
of SAR images based on global scattering center model,” IEEE Trans. and classification,” in Proc. SPIE Algorithms for SAR Imagery IX,
Geosci. Remote Sens., vol. 49, no. 10, pp. 3713–3729, Oct. 2011. vol. 4727, Aug. 2002, pp. 265–279.
[92] H. Liu, B. Jiu, F. Li, and Y. Wang, “Attributed scattering center ex- [108] R. Patnaik and D. Casasent, “MINACE filter classification al-
traction algorithm based on sparse representation with diction- gorithms for ATR using MSTAR data,” in Proc. SPIE Automatic
ary refinement,” IEEE Trans. Antennas Propag., vol. 65, no. 5, pp. Target Recognition XV, vol. 5807, Aug. 2005, pp. 100–111.
2604–2614, May 2017. [109] G. Dong, N. Wang, and G. Kuang, “Sparse representation of mono-
[93] M. Koets and R. Moses, “Feature extraction using attributed scat- genic signal: With application to target recognition in SAR images,”
tering center models on SAR imagery,” in Proc. SPIE Algorithms for IEEE Signal Process. Lett., vol. 21, no. 8, pp. 952–956, Aug. 2014.
Synthetic Aperture Radar Imagery VI, Apr. 1999, pp. 104–115. [110] M. S. Seyfioglu, A. M. zbayoglu, and S. Z. Gurbuz, “Deep convolu-
[94] E. Ertin and R. L. Moses, “Through-the-wall SAR attributed scat- tional autoencoder for radar-based classification of similar aided and
tering center feature estimation,” IEEE Trans. Geosci. Remote Sens., unaided human activities,” IEEE Trans. Aerosp. Electron. Syst., 2018.
vol. 47, pp. 1338–1348, May 2009.  grs

68 ieee Geoscience and remote sensing magazine september 2018

You might also like