Liu 2017
Liu 2017
Bing Liu, Xuchu Yu, Pengqiang Zhang, Xiong Tan, Anzhu Yu & Zhixiang Xue
To cite this article: Bing Liu, Xuchu Yu, Pengqiang Zhang, Xiong Tan, Anzhu Yu & Zhixiang Xue
(2017) A semi-supervised convolutional neural network for hyperspectral image classification,
Remote Sensing Letters, 8:9, 839-848, DOI: 10.1080/2150704X.2017.1331053
Article views: 9
Download by: [The UC San Diego Library] Date: 26 May 2017, At: 05:09
REMOTE SENSING LETTERS, 2017
VOL. 8, NO. 9, 839–848
https://doi.org/10.1080/2150704X.2017.1331053
1. Introduction
Hyperspectral remote sensing has became a research focus in remote sensing, with the
continuous improvement of the spectral resolution of remote sensors. The classification is
an important research content of hyperspectral image processing and application, and its
purpose is to assign a unique label to each pixel in the images. Hyperspectral images consists
of several hundreds of narrow contiguous wavelength bands and can provide a wealth of
spectral and spatial information for classification. At the same time, the complex structure of
hyperspectral images makes the features extraction difficult. Given the complex data struc-
tures and limited labeled samples, the classification of hyperspectral remote sensing images
still faces great challenges.
In the early stage of hyperspectral image classification, several types of discriminant
functions are applied, such as nearest neighbor, decision trees and linear functions.
However, the main problem of these classic classifiers is their sensitivity to the Hughes effect
(Bioucas-Dias et al. 2013). Then support vector machine (SVM) with kernel methods is intro-
duced to deal with the Hughes phenomenon and becomes the mainstream methods of
classification for a long time (Camps-Valls et al. 2005). Meanwhile, extreme learning machine
(Li et al. 2015), active learning (Sun et al. 2015), sparse representation (Liu et al. 2013) and other
classifiers for hyperspectral image are investigated to get higher classification accuracy.
Until recently, deep learning-based methods have drawn increasing attention in remote
sensing image analysis (Li et al. 2016). Stacked autoencoder (SAE) is a simple deep learning
method and is firstly introduced into hyperspectral image classification in (Chen et al. 2014).
Later, a series of improved hyperspectral image classification methods based on SAE are
proposed to obtain better performance (Ma, Wang, and Geng 2016). CNN is firstly used to
extract spectral features for hyperspectral image classification which can get better perfor-
mance than SVM (Hu et al. 2015). Then, CNN are used to extract spatial-spectral features for
hyperspectral image classification and get excellent performance (Yue et al. 2015; 2016;
Ghamisi, Chen, and Zhu 2016). In general, there are a large number of parameters to be
tuned in deep learning methods. In this context, the majority of deep learning-based methods
for hyperspectral image classification can only yield promising results when the number of
labeled samples for training is sufficiently large. A virtual sample enhanced method was
proposed to tackle the problem of limited labeled samples (Chen et al. 2016). A pixel-pair
method was also proposed to significantly increase the number of training samples, which
ensures that the advantage of CNN can be offered (Li et al. 2016). Many semi-supervised
algorithms (Tuia and Gustavo 2009; Muoz-Mar et al. 2010) have demonstrated that the use of
unlabeled data is useful to improve classification performance. However, the current hyper-
spectral images classification methods based on deep learning do not take good advantage of
enormous amounts of unlabeled data.
The main goal of this letter is to deal with the complex data structures and limited
labeled samples in hyperspectral images. In more detail, the main contributions of this letter
can be summarized as follows. 1) A CNN architecture is designed to directly extract spatial-
spectral features from the hyperspectral images cube. 2) Ladder network is introduced to
the CNN architecture in order to make the network suitable for semi-supervised learning. 3)
In order to deal with limited labeled samples, the CNN is trained by semi-supervised method
to simultaneously minimize the sum of supervised and unsupervised cost functions.
Figure 1. Architecture of the semi-supervised CNN, ~x ! ~zð1Þ ! ~zð2Þ ! ~zð3Þ ! ~y is the corrupted
encoder, x ! zð1Þ ! zð2Þ ! zð3Þ ! y is the clean encoder, ~y ! ^zð3Þ ! ^zð2Þ ! ^zð1Þ ! ^x is the decoder.
In equation (1), hð0Þ ¼ x, y ¼ zð4Þ , NB ðxi Þ ¼ ðxi μ ^xi is the component-wise batch normal-
^xi Þ=σ
ðlÞ ðl1Þ ^ ^xi are the mean and standard deviation of the
ization. xi is the component of W h . μxi and σ
minibatch respectively. W ðlÞ is the weight matrix between the layer l and the layer l 1. γðlÞ
and βðlÞ are the trainable parameters. ϕðxÞ is the softmax activation function for the output
layer and ϕðxÞ is the rectified linear unit (ReLU) activation function for other layers.
We choose K K B neighborhoods of a pixel as the input of the network where B is the
number of hyperspectral image bands. W ð1Þ is a 3 3 B1 kernel with a stride of 1 for the
convolutional layer. W ð2Þ is a 3 3 B1 kernel with a stride of 2 for the pooling layer. B1 is the
number of the ouput bands for the convolutional and pooling layer. The pooling process is
achieved by convolution with W ð2Þ in order to reduce the dimensionality of intermediate
representations. The features need to be flattened to connect with the fully connected layer
after pooling layer.
The corrupted encoder is formulated similar as the clean encoder. In more detail,
each layer in the corrupted encoder is formulated as
ðlÞ
~zpre ~ðl1Þ where : l ¼ 1; 2; 3; 4
¼ W ðlÞ h
ðlÞ
~zðlÞ ¼ NB ð~zpre Þ þ nðlÞ where : l ¼ 1; 2; 3 (2)
ðlÞ
~ ¼ ϕðγðlÞ ð~zðlÞ þ βðlÞ ÞÞwhere : l ¼ 1; 2; 3
h
~ð0Þ ¼ ~x ¼ x þ nð0Þ , nðlÞ ,Nð0; σ2 Þ ðl ¼ 0; 1; 2; 3Þ is the Gaussian noise, y
In which, h ~¼
ð4Þ
NB ð~zpre Þ and other parameters are the same as for the clean encoder. We need to collect
ðlÞ
~zpre to calculate the unsupervised cost.
layer is trained by minimizing of the difference between zðlÞ and ^zðlÞ . The reconstruction
^zðlÞ is calculated based on ^zðlþ1Þ ,(^zðlÞ ¼ gð^zðlþ1Þ Þ, gðxÞ is the reconstruction function). ^zðlþ1Þ
need to reserve lots of details to reconstruct ^zðlÞ with small errors. However, supervised
training make the network focus on classification. This is the contradiction between
supervised learning and unsupervised learning, namely that unsupervised learning
requires the retention of sufficient detail information to reconstruct the original obser-
vation and supervised learning only requires the retention of useful information for
classification(Rasmus et al. 2015).
Ladder network proposed by Valpola (Valpola 2015) adds a skip connection between
each layer of the encoder and the decoder. This skip connection means that the
reconstruction ^zðlÞ is calculated based on ^zðlþ1Þ and ~zðlÞ ,(^zðlÞ ¼ gð^zðlþ1Þ ; ~zðlÞ Þ,gðxÞ can be
treated as the denoising function). This serves three purposes. Firstly, it allow the
networks to focus on abstract invariant features on the higher levels (Rasmus, Raiko,
and Valpola 2014). Secondly, it makes the network more robust for noise by learning the
denoising function. Thirdly, such skip connections makes it possible for higher levels of
the network to leave some of the details for lower levels to represent and makes the
network a good fit with semi-supervised learning (Rasmus et al. 2015a; Pezeshki et al.
2016; Rasmus et al. 2015b). All above are helpful to improve the classification accuracy,
so we apply ladder network to our network.
The added noise subject to Gaussian distribution, so we follow Rasmus (Rasmus et al.
2015) and choose the parametrization that supports the optimal denosing of Gaussian
latent variables. The corrupted layer variable value ~zðlÞ has the form ~zðlÞ ¼ zðlÞ þ nðlÞ . zðlÞ is
the variable value of the clean encoder layer l and has a Gaussian distribution with
variance σ2z . nðlÞ is the Gaussian noise with variance σ2n . The goal of denoising function is
to learn to estimate ^zðlÞ from ~zðlÞ by minimizing the difference between ^zðlÞ and zðlÞ . When
the functional form of ^zðlÞ ¼ gðzðlÞ Þ is linear, the denoising cost can be minimized.
Specifically, the result can be described by a weighted vðlÞ and a prior μðlÞ , as shown
in equation (3). vðlÞ and μðlÞ are denoising parameters to be trained.
Furthermore, we assume that the latent variables are independent conditional on the
latent variables of the layer above. The final formulation of the denosing function is
shown in equation (4), where V ðlþ1Þ is the weight matrix between the layer l þ 1 and the
ðlÞ ðlÞ ðlÞ
layer l of the decoder and has the same dimension as the transpose of W ðlþ1Þ , ^zi ; ~zi ; ui
are the component of ^zðlÞ ; ~zðlÞ ; uðlÞ , respectively. μi ðÞ and vi ðÞ are functions of uðlÞ .
ðlÞ ðlÞ
sigmoidðxÞ ¼ ð1 þ ex Þ1 is the sigmoidal function, and a1;i ; . . . ; a10;i are the skip con-
nection parameters to be trained.
3. Experimental results
The proposed semi-supervised CNN (SS-CNN) is implemented using TensorFlow library.
The results are generated on a PC equipped with an Intel Core i7-5700HQ with 2.7GHz and
a Nvidia GeForce GTX 970M. The PC’s memory is 32G. The University of Pavia scenes data
set consisting of 103 spectral bands with 610 340 pixels is employed to evaluate the
844 B. LIU ET AL.
Total Update
Back
cost propagation parameters
Unlabelled samples Corrupted encoder,
Supervised
clean encoder
9×9×B and decoder cost
performance of SS-CNN. There are 42,776 labeled pixels with nine classes in the University
of Pavia data set. 200 labeled samples per class are randomly selected for supervised
training and all the other samples without label are used to test. The number of labeled
training samples and testing samples are listed in Table 1. The results using different
proportion of unlabeled data are shown in Figure 3, which demonstrats that the use of
enormous amount of unlabeled samples can improve the classification accuracy. Although
using a small amount of unlabeled samples will lead to overfitting, which makes the
classification accuracy lower. So we ultilize all unlabeled samples to train the network.
The input of the SS-CNN for the university of Pavia data set is 9 9 103 (K ¼ 9; B ¼ 103)
neighborhoods of a pixel. Selecting larger neighborhoods as the input can get better accuracy.
However, it need more time to train the network with the larger neighborhoods as the input.
The results with different neighborhoods size are listed in Table 2. Considering that there is
strong correlation between differnet bands of hyperspectral image and the labeled samples
are limited, the value of B1 is set to be 80 smaller than the value of B in order to decrease the
number of parameters. In addition, we experiment with different values of B1 , and find that the
effect of B1 on classification accuracy is relatively small when the value of B1 (e.g.
60,80,120,200) is large enough. σ2 (the variance of the Guassian noise) is set to be 0.01. We
also test with various σ2 , e.g., 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, and the corresponding accuracy
(%) are 85.43, 90.14, 97.87,98.32, 98.29,98.31, respectively. It is found that a larger σ2 can reduce
the classification accuracy and σ2 shuold be set relatively smaller than the value of pixel in
hyperspectral image. Note that the hyperspectral image is scaled to 0 and 1 before training the
network. The learning rate is set to 0.001 empirically. The number of epochs is set to be 40. λl is
the weight of unsupervised loss for different layers. We fix λ2 ¼ 1:0; λ3 ¼ 1:0; λ4 ¼ 1:0 and
increase the value of λ1 (e.g. 1.0, 10.0, 100.0). The classification accuaracy (e.g. 96.81%, 97.47%,
97.96%) increases with the increase of λ1 , which reveals that the weight of unsupervised loss
for lower layer has greater contribution to the performance of classification. Consequently, we
set λ1 ¼ 10; λ2 ¼ 1; λ3 ¼ 0:1; λ4 ¼ 0:1 and obtain promising results. The classification accu-
racy of SS-CNN with batch normalization and without batch normalization are shown in
Figure 4, which demonstrate that using batch normalization can accelerate the convergence
and improve the classification accuracy.
Figure 4. Classification accuracy for different numbers of training epochs, BN denotes SS-CNN with
batch normalization, Non-BN denotes SS-CNN without batch normalization.
Training and testing time of different classifiers are shown in Table 4. Batch normalization
can accelerate the convergence. So the number of total trainable parameter is set to be 40
41 in SS-CNN. The number of total trainable parameter is set to be 81,408 and 629,648 in CNN
and CNN-PPF, respectively. Consequently the training time of SS-CNN is much less than that of
CNN and CNN-PPF. However, due to use the enormous amount of unlabeled samples, training
the network is more time-consuming than SVM and ISODATA-SVM. Note that the training time
Table 3. Class-specific accuracy (%) and overall accuracy (OA)with different techniques.
Class no. SVM CNN S-CNN S-CNN-LN ISODATA-SVM CNN-PPF SS-CNN
1 86.46 88.38 85.15 87.02 94.16 97.42 97.16
2 90.17 91.27 96.51 97.30 97.62 95.76 98.72
3 85.04 85.88 92.71 93.76 84.33 94.05 96.86
4 96.64 97.24 97.81 98.17 94.88 97.52 99.25
5 99.78 99.91 100.0 100.0 97.92 100.0 100.0
6 94.89 96.41 95.17 95.70 95.27 99.13 98.59
7 95.19 93.62 91.88 92.93 98.72 96.19 98.80
8 85.36 87.45 88.02 89.27 97.94 93.62 96.88
9 99.89 99.57 99.58 99.68 99.89 99.60 100.0
OA 90.62 92.27 93.88 94.72 96.08 96.48 98.32
Background
Asphalt
Meadows
Gravel
Trees
Sheets
Bare Soil
Bitumen
Bricks
Shadows
0 100 200
m
(a) (b)
Figure 5. Experiment of the university of Pavia data set. (a) Ground-truth map (b) Classification map
obtained by SS-CNN.
REMOTE SENSING LETTERS 847
Figure 6. Classification accuracy for different methods with different number of labeled samples per class.
4. Conclusion
In this letter, a semi-supervised convolutional neural network is constructed for hyper-
spectral image classification. The experimental results demonstrate that the convolution
neural network with ladder network can effectively extract the spatial-spectral features
from the original hyperspectral image cube, and the semi-supervised training strategy
using the enormous amount of unlabeled samples can improve classification accuracy
even with a small number of labeled samples for training. However, due to use the
enormous amount of unlabeled samples, training the network is time-consuming.
Acknowledgement
We thank Prof. Paolo Gamba for providing the Pavia data set.
Funding
This work was supported by the [State Key Laboratory of Geo-information Engineering] under
Grant [SKLGIE2015-M-3-1, SKLGIE2015-M-3-2]; [National Natural Science Foundation of China]
under Grant [41201477]; and [Scientific and Technological Project in Henan Province] under
Grant [152102210014].
References
Bioucas-Dias, J. M., A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot. 2013.
“Hyperspectral Remote Sensing Data Analysis and Future Challenges.” IEEE Geoscience and
Remote Sensing Magazine 1 (2): 6–36. doi:10.1109/MGRS.2013.2244672.
848 B. LIU ET AL.
Camps-Valls, G., and L. Bruzzone. 2005. “Kernel-Based Methods for Hyperspectral Image Classification.”
IEEE Transactions on Geoscience and Remote Sensing 43 (6): 1351–1362. doi:10.1109/TGRS.2005.846154.
Chen, Y., H. Jiang, L. Chunyang, X. Jia, and P. Ghamisi. 2016. “Deep Feature Extraction and Classification
of Hyperspectral Images Based on Convolutional Neural Networks.” IEEE Transactions on Geoscience
and Remote Sensing 54 (10): 6232–6251. doi:10.1109/TGRS.2016.2584107.
Chen, Y., Z. Lin, X. Zhao, G. Wang, and G. Yanfeng. 2014. “Deep Learning-Based Classification of
Hyperspectral Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing 7 (6): 2094–2097. doi:10.1109/JSTARS.2014.2329330.
Ghamisi, P., Y. Chen, and X. X. Zhu. 2016. “A Self-Improving Convolution Neural Network for the
Classification of Hyperspectral Data.” IEEE Geoscience and Remote Sensing Letters 13 (10): 1537–
1541. doi:10.1109/LGRS.2016.2595108.
Hu, W., Y. Huang, L. Wei, F. Zhang, and H. Li. 2015. “Deep Convolutional Neural Networks for
Hyperspectral Image Classification.” Journal of Sensors 2015: 1–12. doi:10.1155/2015/258619.
Ioffe, S., and C. Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift.” Arxiv Preprint Arxiv:1502.03167 2015 37: 448–456.
Li, W., C. Chen, S. Hongjun, and D. Qian. 2015. “Local Binary Patterns and Extreme Learning
Machine for Hyperspectral Imagery Classification.” IEEE Transactions on Geoscience and Remote
Sensing 53 (7): 1–13. doi:10.1109/TGRS.2014.2381602.
Li, W., W. Guodong, F. Zhang, and D. Qian. 2016. “Hyperspectral Image Classification Using Deep Pixel-
Pair Features.” IEEE Transactions on Geoscience and Remote Sensing. doi:10.1109/TGRS.2016.2603190.
Liu, J., W. Zebin, Z. Wei, L. Xiao, and L. Sun. 2013. “Spatial-Spectral Kernel Sparse Representation for
Hyperspectral Image Classification.” IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing 6 (6): 2462–2471. doi:10.1109/JSTARS.2013.2252150.
Ma, X., H. Wang, and J. Geng. 2016. “Spectralspatial Classification of Hyperspectral Image Based on Deep
Auto-Encoder.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (9):
4073-4085.
Muoz-Mar, J., F. Bovolo, L. Gmez-Chova, L. Bruzzone, and G. Camp-Valls. 2010. “Semisupervised
One-Class Support Vector Machines for Classification of Remote Sensing Data.” IEEE Transactions
on Geoscience and Remote Sensing 48 (8): 3188–3197.
Pezeshki, M., L. Fan, P. Brakel, A. Courville, and Y. Bengio. 2016. “Deconstructing the Ladder
Network Architecture.” In editors C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R.
Garnett International Conference on Machine Learning. 2368–2376.
Rasmus, A., T. Raiko, and H. Valpola. 2014. “Denoising Autoencoder with Modulated Lateral
Connections Learns Invariant Representations of Natural Images.” Arxiv Preprint Arxiv 31(4): 55-63.
Rasmus, A., H. Valpola, and T. Raiko. 2015a. “Lateral Connections in Denoising Autoencoders
Support Supervised Learning.” Computer Science 31 (4): 555–563.
Rasmus, A., M. Berglund, M. Honkala, H. Valpola, and T. Raiko. 2015b. “Semi-Supervised Learning
with Ladder Networks.” In Advances in Neural Information Processing Systems, Montreal, Canada:
Curran Associates, Inc. 3546–3554.
Sun, S., P. Zhong, H. Xiao, and R. Wang. 2015. “Active Learning with Gaussian Process Classifier for
Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 53 (4):
1746–1760. doi:10.1109/TGRS.2014.2347343.
Tuia, D., and C.-V. Gustavo. 2009. “Semisupervised Remote Sensing Image Classification with Cluster
Kernels.” IEEE Geoscience and Remote Sensing Letters 6 (2): 224–228. doi:10.1109/LGRS.2008.2010275.
Valpola, H. 2015. “From Neural PCA to Deep Unsupervised Learning.” Adv. in Independent
Component Analysis and Learning Machines 2015: 143–171.
Yue, J., S. Mao, and L. Mei. 2016. “A Deep Learning Framework for Hyperspectral Image Classification Using
Spatial Pyramid Pooling.” Remote Sensing Letters 7 (9): 875–884. doi:10.1080/2150704X.2016.1193793.
Yue, J., W. Zhao, S. Mao, and H. Liu. 2015. “Spectral-Spatial Classification of Hyperspectral Images
Using Deep Convolutional Neural Networks.” Remote Sensing Letters 6 (6): 468–477. doi:10.1080/
2150704X.2015.1047045.
Yuliya, T., J. A. Benediktsson, and J. Chanussot. 2009. “Spectral-Spatial Classification of
Hyperspectral Imagery Based on Partitional Clustering Techniques.” IEEE Transactions on
Geoscience and Remote Sensing 47 (8): 2973–2987. doi:10.1109/TGRS.2009.2016214.