1. Introduction
A hyperspectral image (HSI) consists of hundreds of narrow spectral bands providing a detailed spectrum regarding the physical properties of materials and abundant spatial information enhancing the characterization of HSI scenes. Benefiting from affluent spectral and spatial features, HSI has been widely applied in numerous fields, such as military detection [
1], change detection [
2], and environmental monitoring [
3]. Recently, HSI classification is one of the most popular technologies in the field of hyperspectral remote sensing. However, HSI classification also struggles with many challenges, such as lack of labeled samples, curse of dimension, and large spatial variability in spectral signatures. Therefore, it is still a relevant but challenging research topic in remote sensing.
Incipiently, many scholars focus on utilizing spectral information to settle “the curse of dimension” problem. Representative algorithms included band selection [
4], linear discriminant analysis [
5], collaborative representation classifier [
6], maximum likelihood [
7], etc. In addition to spectral features, spatial dependency was also incorporated into many classification frameworks, such as Markov random field [
8], superpixel segmentation [
9], 3-D morphological profile [
10], and multiple kernel learning [
11]. Although the abovementioned methods obtained good classification accuracy, they did not incorporate spectral features. Consequently, a promising method was presented to combine spatial and spectral information for classification. Li et al. proposed a spectral–spatial kernel SVM to obtain spectral and spatial features [
12]. According to the structural similarity, a nonlocal weighted joint SRC method was built [
13]. The above classification approaches, whether based on spatial features, spectral features, or spectral–spatial joint features, all relied on prior knowledge and lacked robust representation and generalization ability.
Of late, due to powerful feature extraction capacity, deep learning (DL) has shown promising performance and has been gradually introduced into HSI classification. Hu et al. first applied the concept of CNN to HSI classification [
14]. Li et al. designed a CNN to extract pixel-pair features to obtain the correlation between hyperspectral pixels [
15]. However, these methods need to transform the input data into a 1-D vector, resulting in loss of rich spatial information. To further improve classification accuracy, many classification approaches based on 2-D CNNs and 3-D CNNs had been developed to extract spectral and spatial information. Cao et al. built a compressed convolutional neural network, which was composed of a teacher model and a student model, for HSI classification [
16]. Roy et al. employed 2-D CNN and 3-D CNN to excavate spatial–spectral joint features [
17]. Zhang et al. presented a novel CNN exploiting diverse region inputs to capture contextual interactional information [
18]. To extract quality feature maps, Ahmad devised a fast 3-D CNN [
19]. To address the fixed problem of traditional convolutional kernels, Zhu et al. constructed a deformable CNN for HSI classification [
20]. Li et al. trained a 3-D CNN model superior to traditional classification methods utilizing 2-D CNN, which can directly extract spatial–spectral joint information from the original HSI [
21].
With the breakthrough of DL, some auxiliary technologies have emerged, such as residual learning, dense connection, multiscale feature extraction, and multilevel feature fusion. For example, considering the strong complementarity among different layers, Xie et al. proposed a multiscale densely connected convolutional network, which could make full use of information at diverse scales for HSI classification [
22]. To eliminate redundant information and improve processing efficiency, Xu et al. designed a multiscale spectral–spatial CNN based on a novel image classification framework [
23]. Zhang et al. presented a spectral–spatial fractal residual CNN to effectively excavate the spectral–spatial features [
24]. Gao et al. devised a multiscale dual-branch feature fusion and attention network, which integrated the feature reuse property of residual learning and the feature exploration capacity of dense connection [
25]. Song et al. improved the classification performance by introducing a deep residual network [
26]. To obtain spectral-, spatial-, and multiscale-enhanced representations, Li et al. built a long short-term memory neural network for classification tasks [
27].
To further obtain more discriminative and representative features, the attention mechanism is also applied to CNNs. To highlight the validity of sensitive pixels, Zhou et al. developed an attention module [
28]. Yang et al. utilized a cross-spatial attention block to generate spatial and spectral information [
29]. Hang et al. adopted a spectral attention subnetwork to classify spectral information and a spatial attention subnetwork to classify spatial information; then, the adaptive weighted summation approach was utilized to aggregate spectral and spatial classification results [
30]. To boost the classification accuracy, Xiang et al. constructed a multilevel hybrid attention end-to-end model to acquire spatial–spectral fusion features [
31]. Tu et al. designed a local–global hierarchical weighting fusion network, which was composed of a spectral subnetwork and a spatial subnetwork, including a pooling strategy based on local attention [
32].
Inspired by the abovementioned advanced approaches, in this article, we propose a multiscale cross interaction attention network (MCIANet) for HSI classification. First, we design an interaction attention module (IAM) to highlight the distinguishability of HSI by learning the importance of different spectral bands, spatial pixels, and cross dimensions and to dispel redundant information. Then, the obtained interaction attention-enhanced features are fed into a multiscale cross feature extraction module (MCFEM), which is constructed to extract spectral–spatial features at different convolutional layers, scales, and branches. Finally, we introduce global average pooling to compress multiscale spectral–spatial features and utilize two dropout layers, two fully connected layers, and a SoftMax layer to obtain the output classification results.
The main contributions of this article are summarized as follows:
- (1)
To strengthen the distinguishability of HSI and dispel the interference of redundant information, we design an interaction attention module (IAM). IAM can highlight spectral–spatial features favorable for classification by learning the importance of different spectral bands, spatial contexts, and cross dimensions.
- (2)
To enrich the multiformity of spectral–spatial information, we devise a multiscale cross feature extraction module (MCFEM) based on an innovative multibranch lower triangular fusion structure. For one thing, MCFEM utilizes multiple available receptive fields to extract multiscale spectral–spatial features. For another thing, MCFEM introduces “up-to-down” and “down-to-up” fusion strategies to maximize use of information flows between different convolutional layers and branches.
- (3)
IAM and MCFEM constitute the proposed HSI classification method. Compared with the state-of-the-art results of DL methods, the experimental results on three benchmark datasets show competitive performance, which indicates the proposed method exhibits potential to capture more discriminative and representative multiscale spectral–spatial features.
The remainder of this article is organized as follows. In
Section 2, the related works on development of HSI classification are described. In
Section 3, the overall framework of our designed model is presented. In
Section 4, we provide the experimental results, with an analysis on three benchmark datasets. Finally,
Section 5 provides conclusions.
2. Related Works
HSI classification methods are generally classified into two categories: machine learning (ML)-based and deep learning (DL)-based methods. Classification methods based on ML usually design features manually and then send these features into classifiers for training. Representative algorithms are principal components analysis (PCA) [
33], support vector machine (SVM) [
34], and 3-D Gabor filters [
35]. These methods rely on handcrafted features with insufficient generalization ability, leading to an unsatisfactory classification result. In contrast, DL-based approaches can not only spontaneously capture high-level features in a hierarchical extraction way but also provide excellent classification performance. The DL-based methods include stack autoencoders (SAEs) [
36], recurrent neural networks (RNNs) [
37], convolutional neural networks (CNNs) [
38], deep belief networks (DBNs) [
39], generation adversarial network (GANs) [
40], and graph convolutional networks (GCNs) [
41]. Among the various DL algorithms, CNNs-based classification methods exhibit outstanding capability for HSI classification.
Numerous existing HSI classification networks are devoted to extracting spectral–spatial features at different scales to boost the classification performance. Many multiscale-features-based classification methods have been developed. For example, to capture complex multiscale spatial–spectral features, Wang et al. presented a multiscale dense connection network for HSI classification [
42]. Yu et al. built a dual-channel convolutional network that not only learned global features but also took full advantage of spectral–spatial features at different scales [
43]. Gao et al. constructed a multiscale feature extraction module to obtain granular level features [
25]. Lee et al. placed a multiscale filter bank on the first layer of the developed contextual deep CNN, aiming to achieve multiscale feature extraction [
44]. To learn spectral–spatial features at different scales, Li et al. devised a multiscale deep middle-level feature fusion network [
45]. Zhao et al. trained a multiscale CNN to extract contextual information at different scales for HSI classification [
46]. To reduce parameters and obtain the contextual features at different scales, Xu et al. constructed a multiscale octave 3-D CNN [
47]. Fu et al. designed a segmentation model utilizing the multiscale 2-D-singular spectrum analysis method to capture joint spectral–spatial features [
48]. Most existing multiscale-features-based HSI classifications utilize a functional module to obtain spectral–spatial features with different scales. The functional module usually can be divided into two main categories: one is that first adopting multiple available receptive fields to capture spectral–spatial features with diverse scales, respectively, then utilizing a concatenated operation to aggregate these features to obtain multiscale spectral–spatial features. The other is that exploiting multibranch strategy to extract spectral–spatial features with different scales, where each branch uses diverse receptive fields, then utilizing a concatenated operation to aggregate these features to obtain multiscale spectral–spatial features. However, these methods only utilize an easy concatenated operation to integrate features from different receptive fields or branches and do not explore the cross interaction of different receptive fields or branches, which results in spectral–spatial information loss, and they are averse to classification accuracy.
The attention mechanism has been successfully used in various visual tasks, such as salient object detection [
49,
50], super-resolution reconstruction [
51,
52,
53], and semantic segmentation [
54,
55,
56]. Due to the abilities of substantial information locating and extraction from input data, the attention mechanism is also applied to remote sensing problems. Guo et al. combined a spatial attention module with a spectral attention module, which can enhance the distinguishability of spatial and spectral information [
57]. Xiong et al. utilized the dynamic routing between attention initiation modules to adaptively learn the proposed architecture [
58]. To facilitate classification accuracy, Mou et al. introduced an end-to-end spectral attention block [
59]. An end-to-end attention recurrent CNN was developed to classify high-resolution remote sensing scenes [
60]. Aiming to enhance the discriminative capacity of spectral–spatial features, Xue et al. used the attention mechanism to adaptively weight spectral-wise and spatial-wise responses [
21]. Xi et al. designed a hybrid residual attention to settle the overfitting problem [
61]. To better characterize spectral–spatial data, attention mechanisms were incorporated into ResNet [
62]. Most existing HSI classification approaches usually utilize spectral attention mechanisms, spatial attention mechanisms, or spectral–spatial joint attention mechanisms to enhance the HSI’s representation ability. However, these approaches rarely consider the close interdependencies between the
and
dimensions of HSI.