A Multiscale Dual-Branch Feature Fusion and Attention Network For Hyperspectral Images Classification
A Multiscale Dual-Branch Feature Fusion and Attention Network For Hyperspectral Images Classification
A Multiscale Dual-Branch Feature Fusion and Attention Network For Hyperspectral Images Classification
14, 2021
Abstract—Recently, hyperspectral image classification based on which not only detect 2-D space characters but also 1-D spectral
deep learning has achieved considerable attention. Many con- information of targets. HSIs have the following advantages over
volutional neural network classification methods have emerged conventional remote sensing images. To begin, the spectral reso-
and exhibited superior classification performance. However, most
methods focus on extracting features by using fixed convolution lution is high, making for the acquisition of continuous spectral
kernels and layer-wise representation, resulting in feature extrac- curves for varieties of ground objects. Meanwhile, the spectral
tion singleness. Additionally, the feature fusion process is rough and coverage range is expanded, allowing for more detection of
simple. Numerous methods get accustomed to fusing different levels ground object responses to electromagnetic waves. Additionally,
of features by stacking modules hierarchically, which ignore the HSIs incorporate both spatial and spectral features and contain
combination of shallow and deep spectral-spatial features. In order
to overcome the preceding issues, a novel multiscale dual-branch a greater amount of detailed information. Exactly due to these
feature fusion and attention network is proposed. Specifically, we characteristics, HSIs play a significant role in agricultural detec-
design a multiscale feature extraction (MSFE) module to extract tion [1], [2], medical diagnosis [3], [4], atmospheric monitoring
spatial-spectral features at a granular level and expand the range of [5], [6], hydrological detection [7] and other fields. The essence
receptive fields, thereby enhancing the MSFE ability. Subsequently, of the hyperspectral remote sensing image classification is as-
we develop a dual-branch feature fusion interactive module that
integrates the residual connection’s feature reuse property and signing each pixel vector to a specific land cover class. How to
the dense connection’s feature exploration capability, obtaining fully exploit the abundant spatial and spectral features becomes
more discriminative features in both spatial and spectral branches. a great challenge in HSIs classification.
Additionally, we introduce a novel shuffle attention mechanism The traditional classification methods of HSIs are all based on
that allows for adaptive weighting of spatial and spectral features, the handcrafted feature. Early-stage classification methods such
further improving classification performance. Experimental re-
sults on three benchmark datasets demonstrate that our model as support vector machine (SVM) [8], random forest (RF) [9],
outperforms other state-of-the-art methods while incurring the and multiple logistic regression [10], they are all aimed at utiliz-
lower computational cost. ing 1-D spectral features to complete the classification. Although
Index Terms—Convolutional neural network (CNN), dual-
the large number of spectral bands usually implies more potential
branch feature fusion (DBFM), hyperspectral image (HSI) information, the classification accuracy rises at first and then
classification, multiscale feature extraction (MSFE) module, shuffle decreases owning to the high-dimensional data characteristics
attention block. of HSIs that contribute to the Hughes phenomenon [11]. As
a result, more and more studies focus on the dimensionality
I. INTRODUCTION reduction of the spectral dimension [12]. Currently, the widely
used methods include principal component analysis (PCA) [13]
YPERSPECTRAL images (HSIs) have recently gained
H increased attention in the field of remote sensing. Hyper-
spectral remote sensing is a multidimensional signal acquisition
and LDA [14]. While these methods compress the spectral
dimension and reduce the spectral redundancy, noise typically
exists, which is caused by lighting and imaging equipment. Due
technology that combines imaging and spectroscopy technology, to the spatial resolution limitation and the complexity of the
imaging process, the phenomena of the same land-cover may
Manuscript received June 15, 2021; revised July 25, 2021; accepted August exhibit spectral dissimilarity, while the spectral properties of
4, 2021. Date of publication August 12, 2021; date of current version August different materials may be indistinguishable [15].
30, 2021. This work was supported in part by the National Natural Science
Foundation of China under Grant 62071168, in part by the National Key
In recent years, deep learning (DL) occupies a dominant
Research and Development Program of China under Grant 2018YFC1508106, position in computer vision due to its robust feature repre-
in part by the Fundamental Research Funds for the Central Universities of sentation ability. DL eliminates the tedious process of feature
China under Grant B200202183, and in part by the China Postdoctoral Science
Foundation under Grant 2021M690885. (Corresponding author: Chenming Li.)
engineering. Through an end-to-end structure, the network can
The authors are with the College of Computer and Information, Hohai Uni- automatically extract abstract features hierarchically. DL has
versity, Nanjing 211100, China (e-mail: [email protected]; zhangyiyan achieved great success in the fields of image classification [16],
@hhu.edu.cn; [email protected]; [email protected]).
Digital Object Identifier 10.1109/JSTARS.2021.3103176
target recognition [17] and semantic segmentation [18]. For the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
GAO et al.: MULTISCALE DUAL-BRANCH FEATURE FUSION AND ATTENTION NETWORK FOR HSIs CLASSIFICATION 8181
first time, Chen et al. [19] apply DL to HSIs classification. For the latter, He et al. [39] propose a residual network that
Till then, more and more DL methods [20]–[25] have been uses skip connections to ensure that gradients circulate smoothly
existed in HSIs classification. For instance, [26] perform the in the deeper network, alleviating the problem of gradient
feature extraction and classification with deep belief network vanishing. Soon after, residual networks gained popularity in the
simultaneously. Yuan et al. [27] generate HSIs classification field of computer vision, and they were also applied to the classi-
map by combination of stacking encoders. While both of the fications of HSIs. For instance, Nickolls and Dally [38] designed
preceding methods have demonstrated considerable success, a spectral-spatial residual network (SSRN) with two consecutive
whereas they rely on the spectral vector of pixels to complete the residual blocks in order to learn the discriminative features in
classification and miss the spatial distribution of image pixels. the HSIs, which can perform well with small training samples.
The spatial context information of the original data is destroyed, Lee et al. [41] enhance the learning efficiency of traditional
resulting in the loss of useful spatial information. As a result, CNN models by introducing residual network and use multiscale
the research on HSIs classifications need to be further carried convolution kernels to explore the spatial-spectral features in
on. Researchers begin to place great emphasis on the spatial HSIs. Song et al. [42] develop a deep residual network with
structure information of HSIs. A lot of methods based on 2-D an attention mechanism to learn HSIs discriminative features
CNN have been proposed to apply in the HSIs classification and obtain further improvement in classification performance.
[28]–[31]. For instance, Makantasis et al. [28] developed a Paoletti et al. [43] designed a deep pyramidal residual network
neural network model based on 2-D CNN. The intermediate for HSIs classification.
pixels are packed into fixed-size cubes by filling surrounding Recent works in attention mechanisms have shown it to be an
pixels and then sent into the neural network to extract spatial extremely powerful tool to boost the classification performance
information. This data processing technique is quite novel and According to the biological cognitive research, human being
achieve an excellent classification performance. Li et al. [32] receive significant information by focusing on a few critical
proposed a novel pixel-pair method to exploit the similarity items and ignoring others [44]. Similarly, attention in neural
between pixels and use a majority voting strategy to generate networks has the same function, which has been successfully
the final label. Pan et al. [33] designed a small-scale data-driven applied to various tasks in the computer vision [45], [46]. In the
method, multigrained network to deal with the limited samples in HSIs classification tasks, many methods based on existing atten-
HSIs classifications. Cao et al. [34] developed a Bayesian HSIs tion mechanism also emerged, demonstrating the effectiveness
classification method, which combines the CNN and a smooth of improving the classification performance.
Markov random field to exploit the spatial information. How- Meanwhile, multiscale feature extraction (MSFE) is a critical
ever, the most distinguishing features of HSIs are the spectral component of HSI classification, as it has a significant impact
diversities. Studies frequently place a greater emphasis on spatial on the classification performance. Existing multiscale extractors
characteristics but appear to overlook spectral characteristics. [47] are limited to extracting features from fixed receptive fields,
Therefore, later researches begin to explore the combination and thus cannot extract both global and local features simulta-
of spatial and spectral features to complete the classification neously. To say the least, even if the multiscale features have
tasks. For the first time, [35] proposed an HSI classification been extracted from the front layers of the network, the fusion
algorithm based on the spectral-spatial features, in which spec- process of the features is rough, resulting in information loss in
tral information was fused with the spatial information through front layers.
the transformation of the network. The classification task was Drawing intuition from the success of the abovementioned
carried out on the fused features, and the results were excellent. methods, a novel 3-D dual-branch feature extraction and fusion
Li et al. [36] proposed a double-branch spatial-spectral extrac- attention network is proposed for HSIs classification. The main
tion and fusion method based on 2-D convolutional network contributions of this article are summarized as follows.
which further improved the discriminative feature extraction 1) Many existing classification methods based on CNN ex-
capacity. Liu et al. [37] introduced LSTM to HSIs classification tract the multiscale spatial-spectral features with layer-
in a novel way, including spectral and spatial LSTM blocks. wise representations and fixed kernel size. Different from
The method passed each pixel’s spectral-spatial features to the them, we design a 3-D MSFE module which refers to
softmax layer, which generated two distinct types of results, and the multiple available receptive fields at a more granular
then used the decision fusion method to generate classification level. The MSFE is capable of performing MSFE in a
renderings. lightweight and efficient manner.
CNN has strong feature extraction capabilities which can 2) In order to fully excavate the potential of spatial and
achieve the high-level abstract features by stacking modules spectral feature representation of HSIs, a 3-D dual-branch
and deepening the network layers. Nonetheless, On the one feature interactive module (DBFM) is proposed for clas-
hand, a deeper network introduces additional parameters into sification. Different from the existing parallel processing
the training process and lengthens the training time. On the other network with stacked convolution modules, DBFM is a
hand, gradient vanishing impairs back propagation and degrades dual-branch structure that consists of multiple filters with
the classification performance. For the former, the continued additive links and concatenative links. To be brief, con-
developments of the high-performance graphics processing units catenative links focus on new effective feature exploration
(GPUs) [38] have resulted in a significant reduction in training in HSIs, while additive links enhance the feature reuse
time when confronted with a large training parameter network. in previous layers. We integrate the two types of links
8182 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
Fig. 1. (a) Overall flowchart of the proposed MSDBFA. (b) Structure of the MSFE. (c) Structure of the DBFM.
in DBFM for fusing the spatial and spectral features in A. Overview of Proposed Model
different levels of the network and assimilate the particular
The main procedure of the proposed MSDBFA is shown in
features from previous layers.
Fig. 1(a). We take the Indian Pines dataset for example to
3) Given that the significant contribution of distinct spatial illustrate the detail process of the algorithm. First, PCA is applied
and channel features to classification results in HSIs, we
to reduce the spectral dimension and suppress the band noise
introduce a 3-D spatial-channel attention block to boost in original HSIs. Additionally, PCA effectively mitigates the
the network’s feature representation capability. Existing Hughes effect and thereby improves classification performance.
attention blocks focus on capturing the dependencies in
The HSIs are then segmented into 3-D image cubes centered on
spectral dimension, while our proposed 3-D attention labeled pixels and sent to the MSFE. The MSFE is intended to
block improves classification performance by creatively extract multi-scale spatial-spectral features at a granular level
altering the conventional weight distribution method in
and thus expand the range of receptive fields. Following that,
both channel and spatial dimensions. the MSFE-processed image is evenly divided into two feature
4) Extensive experiments on four publicly available datasets subsets and fed into the 3-D DBFM. To achieve deep feature
are conducted. The results indicate that our model outper-
fusion in both spatial and spectral dimensions, we use hierar-
forms state-of-the-art methods. chical layers comprised of three DBFM modules with different
The rest of this article is organized as follows. The proposed
kernel filters. Each DBFM has a spatial and spectral branch
MSFE, DBFM, attention block and corresponding algorithms
corresponding to it. The two branches combine shallow and deep
are described in Section II. The Section III details the associated features via additive and concatenative links to produce discrim-
experiments and analysis. Finally, Section IV concludes this
inative spatial-spectral features. Additionally, a Shuffle attention
article.
block is inserted into the network to adaptively filter out critical
features for classification, allowing the network to focus more on
II. PROPOSED METHOD sensitive features while suppressing weaker ones. As a result, we
will have discriminative feature maps for various classes. After
This section begins with a brief overview of the proposed completing abovementioned operations, the feature maps are
MSDBFA model. And next we will elaborate the MSFE, DBFM converted to vectors using an average pooling layer and then fed
and attention block.
GAO et al.: MULTISCALE DUAL-BRANCH FEATURE FUSION AND ATTENTION NETWORK FOR HSIs CLASSIFICATION 8183
TABLE I
CONFIGURATION OF THE MSDBFA MODEL FOR THE INDIAN PINES DATASET (SPATIAL SIZE=15×15)
into the fully connected layers via softmax function to obtain the existing methods, we aim to improve the layer-wise multiscale
final classification maps. A detailed summary of the proposed representation capability and to achieve multiple available re-
model in terms of the layer names, input map dimensions, and ceptive fields at a more granular level. As a result, we developed
number of parameters is given in Table I. the MSFE module for the purpose of extracting multiscale fea-
tures from HSIs. As shown in Fig. 1(b), the input of original HSIs
B. Structure of MSFE can be denoted as U ∈ ∗ ∗ R∗∗C×D×W ×H ,where U represents
the image patch. C, D, W, and H denotes the channel, spectral
Multiscale feature representations are essential for various dimension, width, and height of the image patch, respectively.
computer vision tasks. At the moment, the majority of methods We subdivided the original feature map into four subsets along
rely on stacking multiple kernel filters in hierarchical layers the channel, denoted by Ui ,where i ∈ {1, 2, 3, 4}. They all retain
to extract multiscale features. For instance, [46] makes use of the same spatial sizes and spectral dimensions, but the channel
spatial pyramid pooling to enhance the multiscale ability in each count is reduced to 1/4 in comparison to the original feature map
layer. Deng et al. [47] develops a feature pyramid that combines U . Subsequently, each feature subset is sent to the convolutional
features at various scales. sequential (Conv-BatchNorm-ReLU) with kernel size 3×3×3
However, these methods extract features in a layer-wise man- to generate a new feature map. To avoid size inconsistency,
ner and with relatively fixed receptive fields. In contrast to these
8184 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
we fill the data with the padding operation. The convolutional Xfus = X12 ⊕ X22
sequential operation is denoted byFi (). To strengthen the feature
reuse of previous layer and reduce the calculating parameters, we Xspatial = concat X11 , Xfus
omit convolution for U1 in the process of forward propagation,
Xspectal = concat X21 , Xfus (2)
namely U1 = Y1 . After adding with the output ofFi−1 (), the
remaining feature subsets Yi,i∈{2,3,4} are fed into Fi (). As a
result,Yi can be written as where Xfus represents the fused features which contain discrim-
⎧ inative spatial and spectral features.cocnat[. · .]is indicated that
⎨ U1 i=1
concatenative links between the original feature subset and fused
Yi = Fi (U1 ) i=2 . (1)
⎩ features. It will enhance the feature fusion and explore some new
Fi (Ui + Yi−1 ) 2 < i ≤ 4
features in other way.
According to the forward propagation, it can be found that
the potential receptive field of each convolutional layer is a
segmentation of {Ui , i ≤ 4}. Each time a convolutional operator D. Structure of Attention Block
is applied, the outputs will have a larger range of receptive
As we all know, various features contribute differently to the
field. Due to the effect of the combination, the final output of
HSIs classification. Based on the fact, we introduce the attention
the module may contain multiple receptive fields with varying
mechanism here to allow the network to focus on useful features
scales. In order to further improve the representative ability of
and neglect unsignificant ones. Existing attention block includes
the model, we concatenate all the feature subsets and pass them
SENet [51], CBAM [52], GCNet [53], etc. Among them, SENet
through a 1×1×1 convolution with ReLU activation to obtain
is a representative channel attention architecture which apply
more nonlinear characteristics.
the global average pooling (GAP) and fully connected lay-
ers to recalibrate channel-wise feature response and remodel
C. Structure of DBFM
interdependencies between channels. GCNet is a lightweight
As is well known, ResNet [39] can be achieved by sequentially and effective attention block which is used to construct global
stacking residual blocks. The features are added element-wisely context feature. CBAM separates spatial and channel attention
to the output ones through shortcut connections, which not only in order to capture representative features respectively, and then
enhances the information propagation but also speeds up the combines them to create a weighted feature map. Based on the
network’s training. While concatenative links in DenseNet [50] fact that both spatial and channel attention are critical for HSI
enable each layer to receive raw data generated by preceding classification. Inspired by [54], we propose a novel lightweight
layers, which is useful for exploring new feature. Fig. 2 shows spatial-channel attention block capable of effectively combining
the connection pattern differences between additive links and two distinct types of attention. As shown in Fig. 3, for a given
concatenative links. To sum up, we propose a 3-D DBFM HSI cube X ∈ RC×D×W ×H , where C, D, W, and H refers
for fusing these multiscale features in a novel way that in- to the channel, dimension, width and height respectively. We
cludes multiple filters with additive and concatenative links in divide X into G groups along the channel dimension, denoted
order to obtain discriminative spatial-spectral fused features. by X = [X1 , . . . , XG ], Xk ∈ RC/2G×D×H×W . Each feature
As is shown in the Fig. 1(c), the original input HSI cube map will be segmented along the channel dimension into two
X is indicated by X ∈ RC×D×H×W , which has been evenly branches, denoted by Xk1 , Xk2 ∈ RC/2G×D×H×W . One branch
divided into two cubes along the channel C, denoted by Xi ∈ is used to generate channel attention feature maps by acquiring
RC/2×D×H×W , i ∈ {1, 2}. We pass the two feature subsets into the inter-relationship of channels, while the other branch gen-
spatial and spectral branch separately in the DBFM block. In erates spatial attention maps via the analysis of space location
the spatial branch, for the feature subset X1 , we adopt 1×3×3 relationships.
spatial kernels with subsampling strides of (2, 1, 1) to obtain fea- For the channel attention branch, we first adjust the channel-
ture maps with representative spatial features. Similarly for the wise dimension of the feature map Xk1 via GAP operation,
spectral branch, we apply 5×3×3 kernels with strides of (2, 1, 1) and then use sigmoid activation to recalibrate the channel-wise
GAO et al.: MULTISCALE DUAL-BRANCH FEATURE FUSION AND ATTENTION NETWORK FOR HSIs CLASSIFICATION 8185
⎡ ⎛ ⎞ ⎤
H W
1
Xk1 = σ ⎣ w1 ∗ ⎝ Xk1 (i, j)⎠ + b1 ⎦ ∗ Xk1
H × W i=1 j=1
H W (3)
1
where H×W i=1 X
j=1 k1 (i, j) represents the GAP oper-
ation. w1 ∈ RC/2G×1×1×1 and b1 ∈ RC/2G×1×1×1 are the two
dynamic parameters to scale the feature map, (i, j)refers to
the specific spatial position in HSIs cube, and σis the sigmoid
activation. For the spatial attention branch, we first use group
norm (GN) to obtain spatial-wise statistics, and then, similarly
to the channel attention branch, we introduce sigmoid activation
to create a gating mechanism and generate the weighted feature
map. As a result, the final output of spatial attention branch is
followed by
TABLE II optimize the parameters. The initial learning rate is 0.0025 and
NUMBERS OF TRAINING AND TESTING SAMPLES FOR IP DATASET
decreases by 1% every 50 epochs. We repeat all the experiments
five times and average the results in order to avoid errors.
C. Analysis of Parameters
The classification performance is based on the proposed
model structure and the selection of network parameters. PCA
is first used to process the HSIs in order to obtain the C principal
components. Then, the input datasets are neighborhood blocks
with C×d×s×s centered on the label pixels, where s×s refers to
the spatial size of input data. We will elaborate on the analysis
of the effects of these parameters.
1) Effect of Principal Component C: This section examines
the effect of varying the number of principal components C on
the proposed model’s classification performance. We adopt PCA
to decrease the dimension of the spectral. C is empirically set to
TABLE III 10, 20, 30, and 40. It can be observed in Fig. 5(a) that the overall
NUMBERS OF TRAINING AND TESTING SAMPLES FOR SA DATASET accuracies rise significantly from 10 to 30 and then plateau at 30.
However, when the number of principal components exceeds 30,
the OA begins to decline. The phenomenon demonstrates that
in a certain degree, the greater the number of principal compo-
nents, the more detailed the spectral information contained in
HSIs. Simultaneously, the neural network can extract additional
discriminative features from these components. As the number
of principal components continues to increase, the classification
performance degrades due to spectral redundancy. In addition,
excessive principal components will inevitably generate com-
putational complexity. Therefore, C is set to 30 for all three
datasets.
2) Effect of Spatial Size s×s: In HSIs classification, the
spatial size of the image cube means how many pixels are pro-
cessed simultaneously by the neural network. We select image
patches with different sizes to test the classification performance.
Specifically, the spatial sizes are varied from 7×7 to 17×17
TABLE IV
NUMBERS OF TRAINING AND TESTING SAMPLES FOR BT DATASET with the interval of 2. The overall accuracies of our model
classification performance on different spatial sizes are shown
in Fig. 5(b). From the observation of the figure, we can find the
7×7 spatial size has the worst performance, as it is too small to
extract sufficient spatial-spectral information for classification.
With the continuous expansion of spatial size, the patch contains
more discriminative information, and classification performance
improves steadily. The peak value for different datasets appears
at 15×15. When the spatial size exceeds 15×15, the overall
accuracies begin to decline due to the excessive redundant
features. As a result, we conclude that either an excessively large
or excessively small spatial size is detrimental to classification
performance.
Fig. 5. Results of parameter analysis and ablation study. (a) Effect of C on overall accuracies on three HSIs datasets. (b) Effect of spatial size on overall accuracies
on three HSIs datasets. (c) Effect of MSFE module on three HSIs datasets. (d) Effect of attention block on three HSIs datasets. (e) Effect of DBMA module on
three HSIs datasets.
TABLE V
CLASSIFICATION RESULTS OF THE PROPOSED MODEL WITH DIFFERENT TRAINING RATIOS
1.5% of each land-cover category. Table V gives the overall ac- of three datasets of comparison models are displayed in
curacies of different ratios of training samples in three datasets. It Fig. 5(c)–(e). It can be observed that the MSFE module improve
can be observed that the overall accuracies rise steadily with the the value of overall accuracies by approximately 0.18%–0.41%.
increase of the training samples. Simultaneously, the proposed The reason for the phenomenon is that the MSFE module
model exhibits robust performance when training samples are introduces multiple sizes of kernels to capture the rich spatial-
insufficient. spectral information and more effectively fuse information at
different scales. Simultaneously, the model with the attention
block achieves higher overall accuracies (approximately 0.19%–
E. Ablation Study 0.52%) than the model without the attention block, demon-
In order to demonstrate the effectiveness of the proposed strating that the proposed attention block can adaptively assign
MSFE module, attention block and DBFM module, we design different weights to spatial-channel regions and selectively
three specific ablation experiments. The models used for com- strengthen valuable features during the HSI feature extraction
parison are consistent with the network of the proposed method process. Simultaneously, the proposed DBMA module exhibits
except for the removal of the MSFE module and attention block the superior performance compared with single-branch 3-D
from the original networks. While for the DBFM, we replace the CNN network in three HSIs datasets. DBMA module is com-
module with single branch layer-wise 3-D CNN. The principal posed of multiple filters with additive links and concatenative
components and the spatial size are set to 30 and 15×15 to links, which improves the spatial-spectral feature fusion with
guarantee the fairness of the experiments. The overall accuracies low resolution land-cover and captures discriminative features.
8188 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
Fig. 6. Classification maps for IP. (a) Ground truth. (b)–(j) Predicted classification maps for SVM (OA=81.78%), MLR (OA=75.49%), RF (OA=73.09%),
1D-CNN (OA=80.60%), 2-D-CNN (90.27%), Hybird-SN (OA=97.17%), SSRN (OA=97.69%), A2S2KResNet (OA=98.63%), and proposed HSMSN-HFF
(99.14%).
F. Compare With Different Methods the IP dataset’s various classes. The class of Alfalfa, Grass-p-m,
In order to evaluate the performance of our proposed method and Oats, for example, has only 46, 28, and 20 samples per class,
MSDBFA, we select eight classification methods to compare respectively. That is a great challenge for the HSIs classification
with our model. The eight methods are SVM with radial ba- resulting in problems with unbalanced sample training. Notably,
sis function kernel, multinomial logistic regression (MLR), our proposed method achieves 100% overall accuracies on the
RF, spectral CNN (CNN-1-D), spatial CNN with 2-D kernels, grass-pasture, grass-t, grass-p-m, hay-w, and oats classes. By
Hybrid-SN [55], SSRN [40], A2S2KResnet [56]. The Fig. 6–8 comparison, SVM, RF, and MLR have relatively poor classifica-
show the classification maps on IP, SA and Botswana dataset. tion performance. To be precise, the SVM classifier has the high-
Among them, SVM, RF, and MLR are classical machine learn- est OA among the three machine learning methods. The MLR
ing classification methods. They complete the classification classifier’s values fall precipitously when dealing with small
using the spectral dimension of HSIs, which has been widely sample sizes. While RF classifier performs the worst (OA =
used in previous classification research, but has low accuracies. 76.09%). Comparatively, some DL classification methods have
In order to highlight the progressiveness of our method, we also superior performance; for example, the A2S2Resnet method
employ a variety of DL-based methods. 1-D CNN is an early achieves the best classification results with 98.63% value of OA
neural network model using spectral dimension to classify HSIs; among all the comparative methods.
2-D CNN classifies based on spatial features; SSRN is a classical For the SA and Botswana datasets, our proposed model
spatial-spectral classification model that incorporates residual MSDBFA also achieves the highest value of OA. Among all
connections to mitigate gradient disappearance and shorten the classification method, RF model performs the worst again,
training time. Hybrid-SN creatively utilizes the 3-D and 2-D indicating that the random forest algorithm cannot deal well
convolution to explore the shallow and deep features respectively with the complex spatial-spectral features in HSIs. At the
in HSIs. A2S2kResnet introduces an adaptive spectral-spatial same time, some simple DL methods such as 1-D CNN and
kernel improved residual network with spectral attention for the 2-D CNN, the classification accuracy has been significantly
purpose of capturing discriminative spectral-spatial features in improved compared with the conventional machine learning
HSIs. methods. However, they still have their own limitations. For
In order to ensure the fairness of experiments, the spatial size example, 1-D CNN utilizes the redundant spectral information
and the number of principal components are set to 15×15 and to complete the classification, which is bounded to be affected
30 for all DL methods, respectively. Due to the fact that SSRN by the Hughes phenomenon. 2-D CNN relies on the spatial
does not follow the PCA as described in the original paper, so distribution and characteristics of ground objects to classify,
we do not apply PCA operation to the SSRN model. Other whereas ignoring the characteristics of rich spectral of HSIs. For
parameters of the network are configured according to their many recent DL methods that employ spatial-spectral feature
papers. Our proposed method outperforms the other methods by fusion strategies, including SSRN, Hybrid-SN, A2S2KResnet,
approximately 0.51%–26.05% in terms of OA, 0.5%–40.78% they generally outperform the former methods (1-D-CNN and
in terms of AA, and 0.58%–30.27% in terms of Kappa in IP 2-D-CNN), especially when the number of training samples is
dataset. The sample distribution is extremely unbalanced across relatively small. It’s demonstrated that in the case of a limited
number of training samples, the hierarchical fusion mechanism
GAO et al.: MULTISCALE DUAL-BRANCH FEATURE FUSION AND ATTENTION NETWORK FOR HSIs CLASSIFICATION 8189
Fig. 7. Classification maps for SA. (a) Ground truth. (b)–(j) Predicted classification maps for SVM (OA=90.25%), MLR (OA=89.79%), RF (OA=86.18%),
1-D-CNN (OA=90.81%),2-D-CNN (OA=95.58%), Hybird-SN (OA=97.68%), SSRN (OA=97.09%), A2S2KResNet (OA=97.01%), and proposed HSMSN-HFF
(99.71%).
Fig. 8. Classification maps for Botswana. (a) Ground truth. (b)–(j) Predicted classification maps for SVM (OA=92.65%), MLR (OA=92.40%), RF (OA=84.49%),
1D-CNN (OA=93.51%), 2-D-CNN (OA=98.63%), Hybird-SN (OA=98.07%), SSRN (OA=98.27%), A2S2KResNet (OA=98.31%), and proposed HSMSN-HFF
(99.91%).
8190 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
TABLE VI
CLASSIFICATION RESULTS OF DIFFERENT METHODS ON THE IP DATASET
TABLE VII
CLASSIFICATION RESULTS OF DIFFERENT METHODS ON THE SA DATASET
can combine the complementary and relevant information from slightly longer than the Hybrid-SN, but achieves significantly
the output of distinct convolutional layers, making the extracted better classification performance in all three datasets. On the
features more effective for classification. Tables VI–VIII show whole, the MSDBFA model outperforms other methods in terms
the results in terms of the OA, AA, and Kappa for above- of both classification performance and computational cost.
mentioned methods. Additionally, to evaluate the computational
cost and complexity of the proposed model. Table IX gives
the total trainable parameters (TTP), floating-point operations IV. CONCLUSION
(FLOPs), and training times for various models with the IP In this article, a novel multiscale dual-branch feature fusion
dataset. As can be seen, hybrid-SN has the largest parameters and and attention network has been proposed. Specifically, we pro-
highest FLOPs compared with other methods, owning to its large pose an MSFE by constructing multiple residual-like connec-
kernel filters and batch size. A2S2KResNet has approximately tions, thus the structure of the module can obtain multiscale
the same number of parameters as SSRN, but SSRN takes higher features at a granular level. Moreover, we design the DBFM to
FLOPs due to the fact that it does not use PCA to reduce the complete the deep fusion of spatial-spectral features via concate-
spectral dimension of the HSIs as conventional methods do. native and additive links, which can not only enhance the feature
Instead, it utilizes a 3-D kernel filter to squeeze the dimen- reuse at shallow level but also explore new discriminative infor-
sion hierarchy. Our method has the lowest FLOPs due to the mation from the fused spatial-spectral features. In addition, we
lightweight multiscale extraction module and effective shuffle introduce a novel shuffle attention block to improve performance
attention block. In terms of training time, our method is only over the network by creatively altering the conventional weight
GAO et al.: MULTISCALE DUAL-BRANCH FEATURE FUSION AND ATTENTION NETWORK FOR HSIs CLASSIFICATION 8191
TABLE VIII
CLASSIFICATION RESULTS OF DIFFERENT METHODS ON THE BOTSWANA DATASET
TABLE IX [9] J. Xia, P. Ghamisi, N. Yokoya, and A. Iwasaki, “Random forest ensembles
TRAINABLE PARAMETERS, FLOPS, AND TRAINING TIMES OF DIFFERENT and extended multiextinction profiles for hyperspectral image classifi-
MODELS FOR IP DATASET cation,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 1, pp. 1–216,
Jan. 2018.
[10] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Semisupervised hyperspectral
image classification using soft sparse multinomial logistic regression,”
IEEE Geosci. Remote Sens. Lett., vol. 10, no. 2, pp. 318–322, Mar. 2013.
[11] M. C. Alonso, J. A. Malpica, and A. Martínez de Agirre, “Consequences
of the hughes phenomenon on some classification techniques,” in Proc.
Annu. Conf. ASPRS, May 2011, pp. 1–12.
[12] J. Ham, D. Lee, S. Mika, and B. Schölkopf, “A kernel view of the
dimensionality reduction of manifolds,” in Proc. Int. Conf. Mach. Learn.,
2004, pp. 47–54.
[13] F. Palsson, J. R. Sveinsson, M. O. Ulfarsson, and J. A. Benedikts-
son, “Model-based fusion of multi- and hyperspectral images using
distribution method in channel and spatial dimensions, thereby PCA and wavelets,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5,
enhancing the representation ability of the feature map. The pp. 2652–2663, May 2015.
[14] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hy-
obtained results on three HSIs datasets reveal that our proposed perspectral images with regularized linear discriminant analysis,” IEEE
MSFDBA model provides competitive results compared to the Trans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, Mar. 2009.
other state-of-the-art approaches for classification performance. [15] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, “An augmented linear
mixing model to address spectral variability for hyperspectral unmixing,”
IEEE Trans. Image Process., vol. 28, no. 4, pp. 1923–1938, Apr. 2019.
REFERENCES [16] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep learning for remote
sensing image classification: A survey,” WIREs Data Mining Knowl.
[1] X. Zhang, Y. Sun, K. Shang, L. Zhang, and S. Wang, “Crop classification Discovery, vol. 8, no. 6, pp. e1264–e1280, May 2018.
based on feature band set construction and object-oriented approach using [17] D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, “Weakly supervised
hyperspectral images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote learning for target detection in remote sensing images,” IEEE Geosci.
Sens., vol. 9, no. 9, pp. 4117–4128, Sep. 2016. Remote Sens. Lett., vol. 12, no. 4, pp. 701–705, Apr. 2015.
[2] X. Yang and Y. Yu, “Estimating soil salinity under various moisture [18] R. Pierdicca et al., “Point cloud semantic segmentation using a deep
conditions: An experimental study,” IEEE Trans. Geosci. Remote Sens., learning framework for cultural heritage,” Remote Sens, vol. 12, no. 6,
vol. 55, no. 5, pp. 2525–2533, May 2017. Mar. 2020, Art. no. 1005.
[3] Q. Li and Z. Liu, “Tongue color analysis and discrimination based [19] Y. Chen, Z. Lin, Z. Xing, W. Gang, and Y. Gu, “Deep learning-based
on hyperspectral images,” Comput. Med. Imag. Graph., vol. 33, no. 3, classification of hyperspectral data,” IEEE J. Sel. Topics Appl. Earth
pp. 217–221, 2009. Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.
[4] Z. Liu, J. Yan, D. Zhang, and Q. Li, “Automated tongue segmenta- [20] D. Hong et al., “Interpretable hyperspectral AI: When non-convex
tion in hyperspectral images for medicine,” Appl. Opt., vol. 46, no. 34, modeling meets hyperspectral remote sensing,” IEEE Geosci. Re-
pp. 8328–8334, 2007. mote Sens. Mag., vol. 9, no. 2, pp. 52–87, Jun. 2021, doi:
[5] A. Brook, E. Ben-Dor, and R. Richter, “Fusion of hyperspectral images 10.1109/MGRS.2021.3064051.
and LiDAR data for civil engineering structure monitoring,” in Proc. 2nd [21] D. Hong et al., “More diverse means better: Multimodal deep learning
Workshop Hyperspectral Image Signal Process. Evol. Remote Sens., 2010, meets remote-sensing imagery classification,” IEEE Trans. Geosci. Re-
pp. 1–5. mote Sens., vol. 59, no. 5, pp. 4340–4354, May 2021.
[6] N. Acito and M. Diani, “Unsupervised atmospheric compensation of [22] D. Hong, L. Gao, J. Yao, B. Zhang, A. Plaza, and J. Chanussot, “Graph con-
airborne hyperspectral images in the VNIR spectral range,” IEEE Trans. volutional networks for hyperspectral image classification,” IEEE Trans.
Geosci. Remote Sens., vol. 56, no. 7, pp. 3924–3940, Apr. 2018. Geosci. Remote Sens.,vol. 59, no. 7, pp. 5966–5978, Jul. 2021.
[7] J. Behmann, J. Steinrücken, and L. Plümer, “Detection of early plant stress [23] L. Gao, D. Hong, J. Yao, B. Zhang, P. Gamba, and J. Chanussot,
responses in hyperspectral images,” ISPRS J. Int. Soc. Photogrammetry “Spectral superresolution of multispectral imagery with joint sparse and
Remote Sens., vol. 93, pp. 98–111, Jul. 2014. low-rank learning,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 3,
[8] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM and pp. 2269–2280, Mar. 2021.
MRF-based method for accurate classification of hyperspectral images,” [24] Z. Lei, Y. Zeng, P. Liu, and X. Su, “Active deep learning for hyperspectral
IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740, Oct. 2010. image classification with uncertainty learning,” IEEE Geosci. Remote
Sens. Lett., to be published, doi: 10.1109/LGRS.2020.3045437.
8192 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021
[25] A. Sellami and I. R. Farah, “Spectra-spatial graph-based deep restricted [49] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
boltzmann networks for hyperspectral image classification,” in Proc. Pho- “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
ton. Electromagn. Res. Symp.-Spring, 2019, pp. 1055–1062. Comput. Vis. Pattern Recognit., 2017, pp. 2117–2125.
[26] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyper- [50] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
spectral data based on deep belief network,” IEEE J. Sel. Topics Appl. connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, Jun. 2015. Pattern Recognit., 2017, pp. 4700–4708.
[27] X. Yuan, B. Huang, Y. Wang, C. Yang, and W. Gui, “Deep learning- based [51] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
feature representation and its application for soft sensor modeling with Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
variable-wise weighted SAE,” IEEE Trans. Ind. Informat., vol. 14, no. 7, [52] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block
pp. 3235–3243, Jul. 2018. attention module,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 3–19.
[28] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep [53] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
supervised learning for hyperspectral data classification through convo- meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
lutional neural networks,” in Proc. Geosci. Remote Sens. Symp., 2015, Conf. Comput. Vis. Workshop, Oct. 2019, pp. 1971–1980.
pp. 4959–4962. [54] Q. Zhang and Y. Yang. “Sa-net: ‘Shuffle attention for deep convolutional
[29] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benediktsson, neural networks,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2021,
“Generalized composite kernel framework for hyperspectral image classi- pp. 2235–2239.
fication,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 9, pp. 4816–4829, [55] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “HybridSN:
Sep. 2013. Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image clas-
[30] L. Fang, S. Li, W. Duan, J. Ren, and J. A. Benediktsson, “Classification sification,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 2, pp. 277–281,
of hyperspectral images by exploiting spectral–spatial information of su- Feb. 2020, doi: 10.1109/LGRS.2019.2918719.
perpixel via multiple kernels,” IEEE Trans. Geosci. Remote Sens., vol. 53, [56] S. K. Roy, S. Manna, T. Song, and L. Bruzzone, “Attention-Based
no. 12, pp. 6663–6674, Dec. 2015. adaptive spectral-spatial kernel resnet for hyperspectral image classi-
[31] F. Luo, B. Du, L. Zhang, L. Zhang, and D. Tao, “Feature learning using fication,” IEEE Trans. Geosci. Remote Sens., to be published, doi:
spatial–spectral hypergraph discriminant analysis for hyperspectral im- 10.1109/TGRS.2020.3043267.
age,” IEEE Trans. Cybern., vol. 49, no. 7, pp. 2406–2419, Jul. 2019.
[32] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens., vol. 55,
no. 2, pp. 844–853, Feb. 2017.
[33] B. Pan, Z. Shi, and X. Xu, “MugNet: Deep learning for hyperspectral image
Hongmin Gao (Member, IEEE) received the Ph.D.
classification using limited samples,” ISPRS J. Photogramm. Remote Sens., degree in computer application technology from Ho-
vol. 145, pp. 108–119, Nov. 2018.
hai University, Nanjing, China, in 2014.
[34] X. Cao, J. Yao, Z. Xu, and D. Meng, “Hyperspectral image classification
He is currently a Professor with the College of
with convolutional neural network and active learning,” IEEE Trans.
Computer and Information, Hohai University. His
Geosci. Remote Sens., vol. 58, no. 7, pp. 4604–4616, Jul. 2020. research interests include deep learning, information
[35] A. J. X. Guo and F. Zhu, “A CNN-based spatial feature fusion algorithm for
fusion, and image processing in remote sensing.
hyperspectral imagery classification,” IEEE Trans. Geosci. Remote Sens.,
vol. 57, no. 9, pp. 7170–7181, Sep. 2019.
[36] R. Li, S. Zheng, C. Duan, Y. Yang, and X. Wang, “Classification of
hyperspectral image based on double-branch dual-attention mechanism
network,” Remote Sens, vol. 12, no. 3, p. 582, Feb. 2020.
[37] Q. Liu, F. Zhou, R. Hang, and X. Yuan, “Bidirectional-convolutional Yiyan Zhang (Student Member, IEEE) received
LSTM based spectral-spatial feature learning for hyperspectral image the B.S. degree in Internet of Things engineering
classification,” Remote Sens, vol. 9, no. 12, p. 1330, 2017. from Jiangsu University of Technology, Changzhou,
[38] J. Nickolls and W. J. Dally, “The GPU computing era,” IEEE Micro, vol. 30, China, in 2020.
no. 2, pp. 56–69, Mar./Apr. 2010. He is a graduate Student with the College of
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image Computer and Information, Hohai University, Nan-
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. jing, China. His research interests include hyperspec-
2016, pp. 770–778. tral images processing, pattern recognition and deep
[40] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral–spatial residual learning.
network for hyperspectral image classification: A 3-D deep learning frame-
work,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 847–856,
Feb. 2018.
[41] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyperspec- Zhonghao Chen (Student Member, IEEE) received
tral image classification,” IEEE Trans. Image Process., vol. 26, no. 10, his B.S. degree in electronics and information engi-
pp. 4843–4855, Oct. 2017. neering from West Anhui University, Luan, China, in
[42] W. Song, S. Li, L. Fang, and T. Lu, “Hyperspectral image classification 2019.
with deep feature fusion network,” IEEE Trans. Geosci. Remote Sens., He is a graduate student of the College of Computer
vol. 56, no. 6, pp. 3173–3184, Jun. 2018. and Information, Hohai University, Nanjing, China.
[43] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep&dense convo- His research interest includes photogrammetry, re-
lutional neural network for hyperspectral image classification,” Remote mote sensing images processing.
Sens., vol. 10, no. 9, p. 1454, 2018. [Online]. Available: http://www.mdpi.
com/2072-4292/10/9/1454
[44] R. Xu, Y. Tao, Z. Lu, and Y. Zhong, “Attention-mechanism-containing
neural networks for high-resolution remote sensing image classification,”
Remote Sens, vol. 10, no. 10, 2018, Art. no. 1602. Chenming Li received the B.S., M.S., and Ph.D. de-
[45] R. A. Jarvis, “A perspective on range finding techniques for computer grees in computer application technology from Hohai
vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-5, no. 2, University, Nanjing, China, in 1993, 2003, and 2010,
pp. 122–139, Mar. 1983. respectively.
[46] N. Komodakis, N. Paragios, and G. Tziritas, “MRF energy minimization He is currently a Professor and the Deputy Dean
and beyond via dual decomposition,” IEEE Trans. Pattern Anal. Mach. with the College of Computer and Information, Ho-
Intell., vol. 33, no. 3, pp. 531–552, Jan. 2011. hai University. His research interests include infor-
[47] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale object mation processing systems and applications, system
detection in remote sensing imagery with convolutional neural networks,” modeling and simulation, multisensor systems, and
ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22, Nov. 2018. information processing.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep Dr. Li is a Senior Member of the China Computer
convolutional networks for visual recognition,” in Proc. 13th Eur. Conf. Federation and the Chinese Institute of Electronics.
Comput. Vis., 2014, pp. 346–361.