AutoEncoder1
AutoEncoder1
Output
Input
(a) (a)
"
x x h h
Corruption Corruption
Encode Encode Decode
Decode
Input Reconstructed z Input Reconstructed z
Representation Data x Representation Data
Data Data
(b) (b)
FIGURE 2. The (a) prototype of the autoencoder and (b) denoising FIGURE 3. The (a) archetypal autoencoder and (b) drawn autoen-
autoencoder. The input data are manually corrupted at a certain coder with dropout. The input data are mapped to the hidden layer,
level. The corrupted data are then mapped to the hidden layer, in which some neurons are deactivated. Only the remaining units
from which the initial data can be reconstructed. are employed to reconstruct the input data.
z = d (W h h + b h). (9)
Mean
x Encoder h Decoder z
log_var
FIGURE 5. An illustration of a variational autoencoder. The model parameters and the latent states are sampled from a parameterized
statistical distribution.
4 1.25
ILLUSTRATIVE EXAMPLES 1
The generic autoencoder model is composed of an encoder 2 0.75
and a decoder. The learned feature is the encoded coeffi-
cient. For the nonlinearly separable data set, a network 0.5
0
with many more hidden neurons is usually built (i.e., over- 0.25
complete representation). For high-dimensional examples, 0
the dimension can be reduced by training a network with –2 0 2 4 6 8 10
fewer hidden neurons. Therefore, an autoencoder can be (b)
viewed as a dimension-reduction trick. The network with
nonlinear activation plays a similar role to that of nonlinear FIGURE 6. The scattering map of the learned two-dimensional
mappings, such as locally linear embedding and Laplacian feature by (a) the autoencoder and (b) PCA.
representation.
d b^ l h J^W, b h, b v h = d^l + 1h.
THE DEEP MODEL Due to the representation power, the stacked autoen-
A deep architecture is formed by the composition of mul- coder has been widely used in earlier works. Chen et al.
tiple levels of representations. Likewise, the basic autoen- build a deep learning architecture by stacking an autoen-
coder can be concatenated to build the deep-model stacked coder, with which the useful high-level features can be
autoencoder. It is composed of a visible layer, several hid- learned from the hyperspectral data [28]. Geng et al. re-
den layers, and an output layer. The output of each previous fine the hand-engineered features by a contractive neural
FIGURE 10. The center (a) 48 × 48-, (b) 56 × 56-, (c) 64 × 64-, (d) 72 × 72-, (e) 80 × 80-, (f) 88 × 88-, (g) 96 × 96-, and (h) 104 × 104-pixel
patches generated from the original image.
56
64
72
80
88
96
11 104
×
48
56
64
72
80
88
96
2
10
Accuracy
ACTIVATION
The activation function is a nonlinear mapping. It projects 0.890
the encodings or decodings into a certain range (0,1) and
converts a linear encoder or decoder to be nonlinear. The 0.885
typical activation includes the sigmoid function, hyperbol-
ic tangent, and ReLU:
0.880
◗◗ sigmoid: v ^ x h = 1 ^ 1 + exp ^ - x hh
0
30
50
70
90
00
20
40
60
80
00
◗◗ hyperbolic tangent (tanh): v ^ x h = ^ exp ^ x h - exp ^ - x hh
1,
1,
1,
1,
1,
2,
^ exp ^ x h + exp ^ - x hh Nodes of Hidden Layers
◗◗ ReLU: v ^ x h = max ^ x, 0 h .
To study the effect of activation, a set of experiments is per- FIGURE 12. The recognition accuracy across the number of hid-
formed. We set the size of the visible layer as 96 × 96 pixels den units.
and vary the units of the hidden layer from 400 to 1,600.
Figure 13 draws the recognition accuracy across the size of
the hidden layer. Three activation functions are compared.
The results demonstrated in Figure 13(a) and (b) are
0.92
slightly different. When the cross-entropy function is used
to measure the deviation, the ReLU function produces the
0.91
best performance. The accuracies gradually decrease when
the number of hidden units increases. The sigmoid func-
0.9
tion generates a much poor performance, while the hyper-
Accuracy
To study the impact of the loss function, a set of ex- Tanh ReLU Sigmoid
periments is pursued. We set the size of the visible layer to
96 × 96 and change the number of hidden units from 400
to 1,600. The experimental results are shown in Figure 14, FIGURE 13. A comparison of activation function with the loss func-
where two loss functions are compared. tion of (a) cross entropy and the (b) MSE.
0.91 OPTIMIZER
Having determined the architecture of the neural network,
0.90 the next problem is to solve the model parameters (i.e., the
weights and bias) by an optimization scheme. The most
Accuracy
0.89
popularly used method is the gradient descent optimiza-
0.88 tion algorithm. In the community of deep learning, many
variants of gradient descent have been presented. Represen-
0.87
tatives include SGD, RMSprop, adagrad, adadelta, adaptive
0.86 moment estimation (adam), adamax, and Nesterov-accel-
erated adaptive moment estimation (Nadam). A compre-
400 800 1,200 1,600 hensive review of optimization can be found in [104]. To
Number of Hidden Units verify the performance of these algorithms, we pursue a set
(b) of experiments. The results are reported as some statistics
(i.e., the minimum, median, maximum, and 25th and 75th
Cross Entropy MSE percentiles) of ten sample runs, drawn in Figure 15.
As Figure 15 shows, the family of optimization algo-
FIGURE 14. A comparison of the loss function with the activation of rithms demonstrates sharp differences in recognition per-
the (a) hyperbolic tangent and (b) ReLU. formance. The best performance is obtained by the adam
optimizer, as widely studied in prior works. The poorest
performance is generated by the adadelta, which produced
a nearly 20% drop in recognition accuracy. The result con-
forms to prior study [104]. The recognition accuracy pro-
0.925 duced by the SGD, adadelta, and adam optimizers are much
0.9 more robust than the remaining algorithms. The fluctuation
of the recognition rate produced by Nadam is more drastic
0.875
than the remaining algorithms. Therefore, it can be conclud-
0.85
Accuracy
ta
am
ax
am
ro
ra
SG
el
am
Ad
ad
Sp
ag
ad
N
Ad
M
Ad
R
0.906
0.915
0.910
0.902
Accuracy
Accuracy
0.905
0.898 0.900
0.895
0.894
0.890
0.89
05
10
15
20
25
30
35
40
45
50
0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
Desired Average Activation Values Corruption Level
(a) (a)
0.89
0.920
0.88
0.915
0.87
Accuracy
Accuracy
0.910
0.86
0.905
0.85 0.900
0.895
0.84
0.890
0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
05
10
15
20
25
30
35
40
45
50
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
FIGURE 16. A comparison of the autoencoder with and without FIGURE 17. A comparison of the autoencoder with and with-
sparse constraint. (a) The hyperbolic tangent activation and cross- out a denoising trick. (a) A 600-hidden-node network and (b) a
entropy loss and (b) the sigmoid activation and MSE loss. 1,200-hidden-node network.
0.94
0.902
0.92
0.899
0.90
Accuracy
Accuracy
0.896
0.88
0.893
0.86
0.890
0.84
400 500 600 700 800 96-D 128-D 160-D 192-D 224-D 256-D
Number of Hidden Units Dimension of Representation
(a)
FIGURE 20. The recognition performance obtained using a varia-
tional autoencoder.
0.877
0.873 0.96
Accuracy
0.94
0.869
0.92
Accuracy
0.865
0.90
0.88
400 500 600 700 800
Number of Hidden Units 0.86
(b)
FIGURE 19. A comparison of the autoencoder with and without a FIGURE 21. The recognition accuracy obtained using a CAE and
Jacobian constraint. (a) A hyperbolic tangent activation and (b) a CNN. Three kinds of convolutional kernels, 3 × 3, 5 × 5, and 7 × 7,
sigmoid activation. are tested.
ACKNOWLEDGMENT REFERENCES
This work was supported by the National Science Fund [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learn-
for Distinguished Young Scholars of China under Grant ing: A review and new perspectives,” IEEE Trans. Pattern Anal.
61525105. We would like to thank the associate editor and Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
the anonymous reviewers for their great contribution to [2] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep
this article. Dr. Ganggang Dong would like to thank Dr. feature extraction for remote sensing image classification,” IEEE
Zhouhan Lin for his kind help. Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar.
2016.
AUTHOR INFORMATION [3] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised
Ganggang Dong ([email protected]) received feature learning for scene classification,” IEEE Trans. Geosci. Re-
his M.S. and Ph.D. degrees in information and communi- mote Sens., vol. 53, no. 4, pp. 2175–2184, Apr. 2015.
cation engineering from the National University of Defense [4] Y. Li, X. Huang, and H. Liu, “Unsupervised deep feature learning
Technology, Changsha, China, in 2012 and 2016, respec- for urban village detection from high-resolution remote sensing
tively. Since 2014, he has authored more than 20 scientific images,” ISPRS J. Photogrammetry Remote Sens., vol. 83, no. 8, pp.
papers in peer-reviewed journals and conferences, includ- 567–579, Aug. 2017.
ing IEEE Transactions on Image Processing, IEEE Transactions [5] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm
on Geoscience and Remote Sensing, IEEE Geoscience and Remote for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–
Sensing Magazine, IEEE Journal of Selected Topics in Applied 1554, 2006.
Earth Observations and Remote Sensing, IEEE Geoscience and [6] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality
Remote Sensing Letters, and IEEE Signal Processing Letters. His of data with neural networks,” Science, vol. 313, no. 5786, pp.
research interests include the applications of compressed 504–507, 2006.
sensing and sparse representation, pattern recognition, mani- [7] G. Hinton and R. Zemel, “Autoencoders, minimum description
folds learning, and deep neural networks. length and Helmholtz free energy,” in Proc. 6th Int. Conf. Neural
Guisheng Liao ([email protected]) received his B.S. Information Processing Systems (NIPS), 1994, pp. 3–10.
degree from Guangxi University, Nanning, China, in 1985 [8] N. L. Roux and Y. Bengio, “Representational power of restricted
and his M.S. and Ph.D. degrees from Xidian University, Boltzmann machines and deep belief networks,” Neural Compu-
Xi’an, China, in 1990 and 1992, respectively. He is a pro- tation, vol. 20, no. 6, pp. 1631–1649, June 2008.
fessor with Xidian University, where he is also dean of the [9] S. Lawrence, C. Giles, A. C. Tsoi, and A. Back, “Face recognition:
School of Electronic Engineering. He has been a senior vis- A convolutional neural-network approach,” IEEE Trans. Neural
iting scholar with the Chinese University of Hong Kong. His Netw., vol. 8, no. 1, pp. 98–113, Jan. 1997.
research interests include synthetic- aperture radar (SAR), [10] D. T. Grozdic and S. T. Jovicic, “Whispered speech recognition
space–time adaptive processing, SAR ground moving target using deep denoising autoencoder and inverse filtering,” IEEE/
indication, and distributed small satellite SAR system de- ACM Trans. Audio, Speech, Language Processing, vol. 25, no. 12, pp.
sign. He is a member of the National Outstanding Person 2313–2322, Dec. 2017.
and the Cheung Kong Scholars in China. [11] Y. Dai and G. Wang, “Analyzing tongue images using a concep-
Hongwei Liu ([email protected]) received his B.Eng. tual alignment deep autoencoder,” IEEE Access, vol. 6, no. 3, pp.
degree in electronic engineering from the Dalian University 1137–1145, Mar. 2018.
of Technology, China, in 1992 and his M.Eng. and Ph.D. [12] D. Park, Y. Hoshi, and C. C. Kemp, “A multimodal anomaly de-
degrees in electronic engineering from Xidian University, tector for robot-assisted feeding using an LSTM-based variation-
Xi’an, China, in 1995 and 1999, respectively. He is currently al autoencoder,” IEEE Robot. Autom. Mag. Lett., vol. 3, no. 3, pp.
director and a professor with the National Laboratory of Ra- 1544–1551, July 2018.
dar Signal Processing, Xidian University. His research inter- [13] J. Yu, C. Hong, Y. Rui, and D. Tao, “Multitask autoencoder model
ests include radar automatic target recognition, radar signal for recovering human poses,” IEEE Trans. Ind. Electron., vol. 65,
processing, and adaptive signal processing. no. 6, pp. 5060–5068, June 2018.
Gangyao Kuang ([email protected]) received [14] M. Ma and C. S. X. Chen, “Deep coupling autoencoder for fault
his B.S. and M.S. degrees from the Central South University diagnosis with multimodal sensory data,” IEEE Trans. Ind. Infor-
of Technology, Changsha, China, in 1991 and 1998, respec- mat., vol. 14, no. 3, pp. 1137–1145, Mar. 2018.
tively, and his Ph.D. degree from the National University [15] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective
of Defense Technology, Changsha, in 1995. He is currently and efficient midlevel visual elements-oriented land-use classi-
a professor and director of the Remote Sensing Informa- fication using VHR remote sensing images,” IEEE Trans. Geosci.
tion Processing Laboratory, National University of Defense Remote Sens., vol. 53, pp. 4238–4249, Aug. 2015.