1. Introduction
Hyperspectral remote sensing, containing a rich triad of spatial, radiometric and spectral information, is a frontier area of remote-sensing technology. The hyperspectral remote sensor with remarkable features of high spectral resolution (5~10 nm) and wide spectral range (0.4 μm~2.5 μm) can use dozens or even hundreds of narrow spectral bands to collect information. All the bands can be arranged together to form a continuous and complete spectral curve, which covers the full range of electromagnetic radiation from the visible to the near-infrared wavelength. Hyperspectral image (HSI) implements the effective integration of spatial and spectral information of remote-sensing data and thus addresses important remote-sensing applications, e.g., agriculture [
1], environmental monitoring [
2], and physics [
3].
Traditional spectral-based methods such as k-nearest neighbors [
4], multinomial logistic regression (MLR) [
5], and support vector machines (SVM) [
6], tend to treat the raw pixels directly as input. However, given the large number of spectral bands in HSI, the classifier must deal with these features in a high-dimensional space. Due to the numerous spectral bands in HSI, the classifier is confronted with high-dimensional features and the limited samples makes it difficult to train a classifier with high accuracy. This problem is known as the curse of dimensionality or the Hughes phenomenon. To tackle this problem, dimensionality reduction such as feature selection [
7] or feature extraction [
8] is a common tactic. Moreover, considering that neighboring pixels probably belong to the same class, another line of research aims at focusing on spatial information. Gu et al. [
9] and Fang et al. [
10] used SVM as a classifier with a multiple kernel learning strategy to process the HSI data and obtained the desired results. In [
11], the original HSI data was fused with multi-scale superpixel segmentation maps and then fed into SVM for processing. Methods of this sort essentially implement feature engineering with the help of spectral–spatial information on the HSI and then create a classification map.
However, the aforementioned approaches can be considered to be traditional feature engineering, which means that the performance depends on the handcrafted features. Furthermore, as these methods belong to shallow models, the generated features should also be regarded as shallow features, which are unable to capture the essential characteristics of the observed object and therefore tend to underperform in sophisticated scenarios [
12].
Due to the impressive ability to automatically extract non-linear hierarchical features, deep learning (DL) has gradually supplanted numerous traditional algorithms in recent years, gaining an overwhelming advantage in many computer vision tasks including objection detection [
13], semantic segmentation [
14], and image generation [
15]. Naturally, HSI classification, as a typical classification task, is constantly benefiting from the state-of-the-art deep-learning techniques. Several deep-learning-based methods have been proposed for HSI classification. In [
16], Chen et al. introduced a stacked autoencoder (SAE) to extract abundant features for HSI classification. Zhao et al. [
17] also leveraged a stacked sparse autoencoder to derive hierarchical more abstract and deeper features from spectral vectors, spatial vectors and spectral–spatial vectors. Li et al. [
18] investigated deep belief networks (DBNs) for spectral–spatial features extraction, improving the accuracy of HSI classification. Zhong et al. [
19] improved prior diversity during pre-training and fine-tuning of the DBN model, resulting in improved HSI classification performance.
Among the DL-based methods, the convolutional neural network (CNN) [
20] is the predominant formulation for extracting spectral–spatial features by virtue of its local perception and parameter sharing characteristics. Mei et al. [
21] proposed a CNN model incorporating spectral features with spatial context by computing the mean of the pixel neighborhood and the mean and standard deviation of each spectral band in that neighborhood. Similarly, Lee et al. [
22] presented a contextual deepCNN (CDCNN) for feature extraction. Moreover, Zhao and Du [
23] combined a spatial feature extraction process with a spectral feature extraction process based on the CNN model. Concretely, the local discriminative embedding is performed first, followed by stacked features and classification. Although these methods employ different techniques to extract spectral–spatial information separately apart from CNN, they do not fully leverage the joint spectral–spatial information. In view of the fact that hyperspectral data can be represented in a 3D cube format, 3D convolution in spectral and spatial dimensions can naturally be a ‘silver bullet’ in simultaneously extracting the spectral–spatial features of HSI [
24,
25]. Furthermore, inspired by the deeper network such as residual network (ResNet) [
26] and the dense convolutional network (DenseNet) [
27], Zhang et al. [
28] proposed a spectral–spatial residual network (SSRN), which stacks the spectral and spatial residual blocks consecutively. Wang et al. [
29] employed DenseNet in their fast dense spectral–spatial convolution (FDSSC) algorithm.
On the other hand, it is worth noting that different spectral bands and different spatial patches in the HSI cube may make different contributions to feature extraction. Accordingly, there has been a surge of interest in the attention mechanism [
30,
31,
32]. By focusing on important features and suppressing unnecessary features, attention mechanisms can augment model sensitivity to informative spectral bands and spatial positions. Thus, Ma et al. [
33] designed a double-branch multi-attention mechanism network (DBMA), obtaining desirable results. Furthermore, based on DBMA and dual-attention network (DANet) [
34], Li et al. [
35] proposed the double-branch dual-attention mechanism network (DBDA) for HSI classification.
In this paper, inspired by these advanced techniques, we propose an attention-aided spectral–spatial CNN model for hyperspectral image classification. Instead of following the traditional approach of using standard 3D convolution to extract features from HSI, we apply the pyramidal convolution which can extract hierarchical features. Furthermore, a latest attention mechanism is adopted to refine the features for better classification. Our new deep model is composed of two branches, which extract spectral and spatial features, respectively. In each branch, pyramidal convolution is introduced to exploit abundant features at different scales. Then, a novel iterative attention mechanism is applied to refine the feature maps. By concatenating or using weighted addition, we fuse the double-branch features. Finally, the fused spectral–spatial features are fed into the fully connected layer to obtain classification results with the SoftMax function. The main contributions of this article are as follows:
- (1)
A new double-branch model based on pyramidal 3D convolution is proposed for HSI classification. Two branches can separately extract spatial features and spectral features efficiently.
- (2)
A new iterative attention mechanism, expectation-maximization attention (EMA), is introduced to HSI classification. It can refine the feature map by highlighting relevant bands or pixels and suppressing the interference of irrelevant bands or pixels.
- (3)
Some effective techniques, such as the new activation function Mish, dynamically varying learning rates and early stopping, are applied in the proposed model and satisfactory results are obtained.
The rest of this paper is organized as follows: In
Section 2, we briefly describe the related work. Our proposed architecture is described in detail in
Section 3. In
Section 4 and
Section 5, we conduct several experiments and analyze the experimental results. Finally, conclusions and future work are presented in
Section 6.
3. Methodology
This section is structured as follows. First, we introduce the framework of the proposed method. Second, two branches respectively focusing on spectral information and spatial information are described in detail. Third, fusion operations of spectral and spatial branches are discussed. Finally, several techniques aimed at boosting the network performance are covered.
3.1. Framework of the Proposed Model
The flowchart in
Figure 4 depicts the proposed model for HSI classification. Generally, it consists of two branches: the spectral branch and the spatial branch. Moreover, Expectation-Maximization attention modules are incorporated into both branches to apply attention-based feature refinement. Concatenation or weighted sum are implemented subsequently to fuse bipartite features. Finally, classification is performed with the SoftMax function.
Concretely, let the HSI data set be , where , and denote the height and width of the spatial dimensions and the spectral bands. Assume that is composed of labeled pixels and the corresponding category label set is , where represents the numbers of land cover classes. To effectively exploit the inherent information in HSI, a common practice is to form a 3D patch cube with several pixels surrounding the given pixel. In this manner, can be decomposed into a new data set , where is the width of cubes. If the target pixel is on the edge of the image, the values of adjacent missing pixels are set to zero. Then, is randomly divided into training, validation and testing sets denoted by , and . Accordingly, their corresponding label sets are , and . For each configuration of the model, the training set is used to optimize the parameters while the validation set is used to supervise the training process and select the best-trained model. Finally, the test set is used to verify the performance of the best-trained model.
3.2. Pyramidal Spectral Branch and Pyramidal Spatial Branch
As shown in
Figure 4, the spectral branch and the spatial branch consist of PyConv and EMA. First, the pyramidal blocks used in two branches will be described in detail.
Generally, a 3D convolutional layer is first applied to perform a feature transformation on the HSI cube in the spectral dimension, reducing the computational overhead. Then, a pyramidal spectral block is attached. As shown in
Figure 5, each layer in the pyramidal convolution consists of three 3D convolution operations with decreasing levels in the spectral dimension, discriminated by blue, yellow and red, respectively. The kernel sizes of the 3D convolution operations in each layer are set to
,
,
, respectively. Furthermore, to make the network powerful and converge rapidly, each convolution is subsequently followed by a batch normalization (BN) layer to regularize and an activation function Mish [
46] to learn a non-linear representation. The number of output channels in each layer is consistent and can be set to
, then the number of the final output of the block can be formulated as:
where
is the number of the output channel of the preceding 3D convolution layer and
actually is the number of 3D convolution kernels. However, since only the spectral dimension of these convolution kernels varies and is never equal to 1, it can be assumed that mainly the spectral information is explored.
Similar to the pyramidal spectral block, the pyramidal spatial block is built by leveraging the interspatial relationships of feature maps. As illustrated in
Figure 6, in contrast to the pyramidal spectral block, the kernel size of the pyramidal spatial block changes in the spatial dimension while keeping fixed in the spectral dimension. Moreover, a 3D convolution layer is also applied before to compact the spectral dimension of the HSI cube, which is exhibited in
Figure 4. Again, each layer in the block not only includes a 3d convolutional layer, but also is combined with a batch normalization layer and a Mish activation function layer. The relationship between the input and output of the pyramidal spatial block is aligned with Equation (7).
3.3. Expectation-Maximization Attention Block
After attaching the pyramidal spectral or spatial block, a 3D convolutional layer is needed to ‘resize’ intermediate feature maps for subsequent input to the EMA block. Then, the EMA block follows to refine feature maps. In view of the fact that for the same object, the spectral response may vary dramatically on different bands. In addition, different positions of the extracted feature maps can provide different semantic information for HSI classification. The performance for HSI classification can be improved if such prior information can be properly taken into account. Therefore, the EMA block is introduced. Two EMA blocks located in the spectral and spatial branches are designed with a similar structure. The EMA block located in the spectral branch iterates the attention map along the spectral dimension (denoted as spectral attention), while the EMA block located in the spatial branch iterates the attention map along the spatial dimension (denoted as spatial attention).
As shown in
Figure 7, given an intermediate feature map
as input, a compact base set is initialized with Kaiming’s initialization [
47]. Then, attention maps can be generated in
step and the base set can be updated in
step, as described in
Section 2.3. After a few iterations, with the converged bases and attention maps, a new refined feature map
can be obtained. Instead of outputting
directly, a small factor
is adopted to equilibrate
with
. Multiplying
by
and then adding it to
, the final output
is generated. This operation facilitates the stability of the training and empirical performance validates the potency.
Back to the initialization of bases, this is actually a key point. The procedure described above only portrays the steps to implement EMA on a single image. However, thousands of images must be processed in the HSI classification task. The spectral feature distribution and spatial feature distribution are distinct for each image, so the bases
computed upon an image should not be the paradigm for all images. In this paper, we choose to run EMA on each image and consistently update the initial values of the bases
during the training process with the following strategy:
where
represents the initial values of bases,
is generated after iterating over an image and
.
3.4. Fusion of Spectral and Spatial Branches
With the aid of the spectral branch and spatial branch, multiple feature maps are generated. Then, how to fuse them to obtain a desirable classification result is a problem. Generally, there are two options, add or concatenation. Here, spatial features and spectral features are added with a certain weight, which is constantly adjusted by back-propagation during the training process. Both fusion operations are experimented and the results are detailed in
Section 5.5. Once the fusion is finished, the feature maps subsequently flow through the fully connected layer and the SoftMax activation function and finally the classification result is obtained.
3.5. Network Training
3.5.1. A New Activation Function
The activation function is an important element in a deep neural network and the rectified linear unit (ReLU) is often favored. Recently, Mish [
46], a self-regularized non-monotone activation function, has received increasing attention. The formula for Mish is as follows:
where
is the input of the activation function.
The graph of Mish and ReLU can be seen in
Figure 8. Unlike ReLU, Mish allows small negative inputs inflow to improve the model performance and keep the network sparsity instead of pruning all the negative inputs. Moreover, Mish is a smooth function and continuously differentiable, which is beneficial to optimization and generalization.
3.5.2. Other Training Tricks
To mitigate the overfitting problem, dropout [
48] is a typical strategy. Given a percentage
, which is selected as 0.5 in the proposed model, the network would drop out hidden or visible units temporarily. In the case of stochastic gradient descent, a new network is trained in each mini-batch due to the property of random dropping. Moreover, dropout can make only a few units in the network possess high activation ability, which is conducive to the sparsity of the network. In our framework, a dropout layer is applied after the EMA block.
In addition, the early stopping strategy, and the dynamic learning rate adjustment method are also adopted to accelerate the network training. Specifically, early stopping means stopping the training if the loss function no longer decreases in a couple of training epochs (which is 20 in our method). Dynamic learning rate means that we adjust the learning rate during the training process to avoid the model trapped in a local optimum. Herein, we use the cosine annealing [
49] strategy, which is formulated as follows:
where
is the learning rate for the
-th run while
and
are ranges for the learning rate.
denotes how many epochs have been executed since the last restart and
represents the number of epochs in one restart cycle.
4. Experiment
4.1. Datasets Description
In the experiments, four publicly available datasets, the Pavia University (UP)dataset, the Indian Pines (IP) dataset, the Salinas Valley (SV) dataset, and the Botswana dataset (BS), are applied to conduct a series of experiments.
Pavia University (UP): captured by the reflective optics imaging spectrometer (ROSIS-3) sensor at the University of Pavia, northern Italy, the Pavia University dataset is comprised of 103 bands with spatial resolution of 1.3 mpp in the wavelength ranging from 0.43 to 0.86 . The spatial size is pixels and 9 land cover classes are involved.
Indian Pines (IP): captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in the north-western Indiana, the Indian Pines dataset is comprised of 200 bands with spatial resolution of 20 mpp in the wavelength ranging from 0.4 to 2.5 . The spatial size is pixels and 16 land cover classes are involved.
Salinas Valley (SV): captured by the AVIRIS sensor the AVIRIS sensor over the agricultural area described as SV in California, CA, USA, the Salinas Valley dataset is comprised of 204 bands with spatial resolution of 3.7 mpp in the wavelength ranging from 0.4 to 2.5 . The spatial size is pixels and 16 land cover classes are involved.
Botswana (BS): captured by the NASA EO-1 satellite over the Okavango Delta, Botswana, the Botswana dataset is comprised of 145 bands with spatial resolution of 20 mpp in the wavelength ranging from 0.4 to 2.5 . The spatial size is pixels and 14 land cover classes are involved.
The performance of deep-learning-based models strongly depends on the data. Generally, the more labeled data used for training, the better the model performs. Currently, many HSI classification methods can achieve almost 100% accuracy with sufficient training samples. Model performance given the lack of training samples is noteworthy. Therefore, the size of the training samples and validation samples in the experiments are set relatively small to challenge the proposed model. In addition, to conveniently compare with the previous methods, we follow the settings in [
35], i.e., the proportion of samples for training and validation is both set to 3% for IP, 0.5% for UP and SV and 1.2% for BS.
4.2. Experimental Configuration
All experiments were executed on the same platform configured with Intel Core i7-8700K processor at 3.70 GHz, 32 GB of memory and an NVIDIA GeForce GTX 1080Ti GPU. The software environment is the system of window 10 (64 bit) home and deep-learning frameworks of PyTorch.
Optimization is performed by Adam optimizer with the batch size of 16 and learning rate of 0.0005. To assess the results quantitatively, three metrics are adopted: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.
To assess the effectiveness of our approach, several methods are adopted for comparison. The SVM with a radial basis function (RBF) kernel [
6] is selected as a representative of the traditional methods. CDCNN [
22], SSRN [
28] and FDSSC [
29] are chosen on behalf of the deep-learning-based approaches. DBMA [
33] and DBDA [
35], similar to our model with a two-branch structure, are selected as the state-of-the-art double-branch models. The parameters of each model are set according to the original paper. Given that the codes are available, the results of the classification with these methods on the four datasets are in accordance with our own replication. For a fair comparison, all algorithms are executed ten times and the best results are retained.
4.3. Classification Results
4.3.1. Classification Results for the IP Dataset
The accuracy for the IP dataset obtained by different methods is shown in
Table 1, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in
Figure 9.
The proposed model yields the best results, i.e., 95.90% in OA, 96.19% in AA and 0.9532 in Kappa, as shown in
Table 1. CDCNN obtains the lowest accuracy since the training samples are too limited for the 2DCNN-based model. Compared with CDCNN, SVM performs a little better; however, the pepper noise is quite severe, which is shown in
Figure 9b. Owing to the integration of spatial and spectral information by 3DCNN, both SSRN and FDSSC are far superior to SVM and CDCNN, exceeding them by almost 20% in OA. Furthermore, FDSSC draws on the dense connection, resulting in better performance. DBMA and DBDA follow basically the same idea i.e., two branches are used to extract spectral and spatial features and the attention mechanism are introduced. However, they are prone to overfitting when the training samples are limited. Moreover, the attention mechanisms they use are simple and cannot distinguish different classes well. In contrast, our proposed model not only uses two branches to extract features, but also introduces an attention mechanism based on the EM algorithm, which can iteratively update the attention map and reduce the intra-class feature variance, thus making it easier to distinguish different class targets. As can be seen in
Table 1, our model performs well balanced and excellent on each category, without extremely low scores. This demonstrates the superior discriminative capability of our model for each category.
4.3.2. Classification Results for the UP Dataset
The accuracy for the UP dataset obtained by different methods is shown in
Table 2, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in
Figure 10.
As shown in
Table 2, our method achieves the best results on the three metrics. In particular, the average improvement over the second-best model, DBDA, is +1.29%, +1.37%, 1.74% for OA, AA, and Kappa metrics, respectively. Specifically, for each class, the best results are obtained by our method in 5 out of 9 classes. In addition, it is worth noting that in class 8, which is the most difficult to classify, only our model exceeds 90% in classification accuracy. Class 8 is represented by the dark gray line in
Figure 10a, which is too slender for models to capture. Please note that only DBMA, DBDA and our method achieve the accuracy over 80 % on category 8. This illustrates the advantage of the attention mechanism in capturing fine features. Moreover, the accuracy of our method exceeds 90%, indicating that the attention mechanism adopted by our model stands out.
4.3.3. Classification Results for the SV Dataset
The accuracy for the SV dataset obtained by different methods is shown in
Table 3, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in
Figure 11.
Again, the proposed model obtains the best results with 98.33% OA, 98.91% AA, and 0.9814 Kappa. On the class 15, none of the methods achieves over 90% accuracy except ours. This can be observed in
Figure 11. If we concentrate on the yellow area and the gray area in the upper left corner of classification maps, it can be found that these two areas interfere with each other terribly in all the models except ours.
4.3.4. Classification Results for the BS Dataset
The accuracy for the BS dataset obtained by different methods is shown in
Table 4, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in
Figure 12.
Since the BS dataset is small and only with 3248 labeled samples, training samples may be scarce for the model. Nevertheless, the proposed method yields the best results, which demonstrates the competency of our method in exploiting spectral information and spatial information.
6. Conclusions
In this paper, we propose a novel HSI classification method that consists of a double branch with the pyramidal convolution and an iterative attention. First, the input of the whole framework is not subjected to dimensionality reduction such as PCA. The original 3D data is cropped into 3D cubes as input. Then, two branches are constructed with two novel techniques, namely pyramidal convolution and an iterative attention mechanism, EM attention, to extract spectral features and spatial features, respectively. Meanwhile, a new activation function, Mish, is introduced to accelerate the network convergence and improve the network performance. Finally, with several experiments, we analyze our model from multiple perspectives and demonstrate that the proposed model yields the best or competitive results on four datasets on comparison to other algorithms.
A future direction of our work is to explore better attention mechanisms to obtain finer feature representations. Furthermore, it seems interesting to further reduce the data requirements with new techniques such as a few-shot learning or zero-shot learning.