Remote Sensing: An Enhanced Spectral Fusion 3D CNN Model For Hyperspectral Image Classification
Remote Sensing: An Enhanced Spectral Fusion 3D CNN Model For Hyperspectral Image Classification
Article
An Enhanced Spectral Fusion 3D CNN Model for Hyperspectral
Image Classification
Junbo Zhou, Shan Zeng *, Zuyin Xiao, Jinbo Zhou, Hao Li and Zhen Kang
School of Mathematics and Computer Science, Wuhan Polytechnic University, Wuhan 430023, China
* Correspondence: [email protected]
Abstract: With the continuous development of hyperspectral image technology and deep learning
methods in recent years, an increasing number of hyperspectral image classification models have
been proposed. However, due to the numerous spectral dimensions of hyperspectral images, most
classification models suffer from issues such as breaking spectral continuity and poor learning of
spectral information. In this paper, we propose a new classification model called the enhanced
spectral fusion network (ESFNet), which contains two parts: an optimized multi-scale fused spectral
attention module (FsSE) and a 3D convolutional neural network (3D CNN) based on the fusion of
different spectral strides (SSFCNN). Specifically, after sampling the hyperspectral images, our model
first implements the weighting of the spectral information through the FsSE module to obtain spectral
data with a higher degree of information richness. Then, the weighted spectral data are fed into
the SSFCNN to realize the effective learning of spectral features. The new model can maximize the
retention of spectral continuity and enhance the spectral information while being able to better utilize
the enhanced information to improve the model’s ability to learn hyperspectral image features, thus
improving the classification accuracy of the model. Experiment results on the Indian Pines and Pavia
University datasets demonstrated that our method outperforms other relevant baselines in terms of
classification accuracy and generalization performance.
Citation: Zhou, J.; Zeng, S.; Xiao, Z.;
Zhou, J.; Li, H.; Kang, Z. An Keywords: deep learning; hyperspectral image classification; attention mechanism; feature fusion;
Enhanced Spectral Fusion 3D CNN 3D CNN
Model for Hyperspectral Image
Classification. Remote Sens. 2022, 14,
5334. https://doi.org/10.3390/
rs14215334 1. Introduction
Academic Editors: Junjun Jiang, In recent years, with the continuous development of hyperspectral image (HSI) tech-
Jiayi Ma and Leyuan Fang nology [1], the analysis and processing of hyperspectral data has become one of the hotspots
Received: 31 August 2022
in many research areas [2]. HSIs are characterized by high information content, strong
Accepted: 20 October 2022
spectral continuity, high spectral resolution and so on. These characteristics allow HSIs to
Published: 25 October 2022
be used in an increasingly wide range of applications, such as environmental monitoring,
agricultural production, mineral development and other fields [3–5]. Among the many
Publisher’s Note: MDPI stays neutral
applications of HSIs, the classification of pixels in images is one of the main research
with regard to jurisdictional claims in
tasks [6].
published maps and institutional affil-
HSI classification is more complex than traditional image classification to a certain
iations.
extent. This is mainly reflected in two points: First, the number of HSIs is much smaller than
the number of conventional images. Taking the COCO public dataset [7] as an example, it
contains 91 easily recognizable object categories with a total of 2.5 million tagged instances
Copyright: © 2022 by the authors.
in 328,000 images. The common public dataset of HSIs generally has only one original
Licensee MDPI, Basel, Switzerland. dataset containing spectral data and label data. Second, the spectral dimension of an HSI
This article is an open access article is much larger than that of a traditional image. The large amount of spectral data makes
distributed under the terms and it difficult for general classifiers to achieve high accuracy, especially when the training
conditions of the Creative Commons samples are extremely limited. Therefore, HSI classification can be studied from two
Attribution (CC BY) license (https:// aspects: classification and spectral information processing.
creativecommons.org/licenses/by/ Early researchers faced with the problem of how to deal with complex spectral in-
4.0/). formation mainly processed spectral information via the following methods: principal
convolutional neural networks or other deep learning methods are FusionNet [35],
HSI bidirectional encoder representation from transformers (HSI-BERT) [36], spatial–
spectral transformers (SST) [37] and two-stream spectral-spatial residual networks
(TSRN) [38]. With the CNN model being studied for a long time, the 3D CNN model
was proposed by Tran et al. [39]. The biggest advantage of 3D CNN over 2D CNN
is that the features of the channel dimension can be extracted, which is very suitable
for HSIs. Chen et al. [40] applied 3D CNN to the classification of HSIs. After that,
many researchers have begun using 3D CNN for HSI classification. For example,
Ahmad et al. [41] proposed a 3D CNN model that can rapidly classify hyperspectral
images. Zhong et al. [42] designed a residual module based on 3D CNN to extract
spatial and spectral information and applied it to HSI classification. Laban et al. [43]
proposed a 3D deep learning framework which combined PCA and 3D CNN. Due
to the advantages of 3D convolution, other models for hyperspectral image study
using 3D CNN are: spectral four-branch multi-scale networks (SFBMSN) [44], 3D ×
2D CNN [45] and 3D ResNet50 [46]. However, as 3D CNN has the ability to extract
both spatial and spectral information, there is no need to extract spatial and spectral
features separately.
By analyzing these two main lines, we can find some new ideas or problems that can
be solved: (1) Can the two main lines of research be better integrated? Although research
on HSIs serves classification models, ordinary classification models are not effective in
extracting the main features of the spectra due to the complexity of the original spectral
information of hyperspectral images. So, can we design a network structure that can better
learn the spectral features after processing? (2) In terms of classification models, 3D CNN
is theoretically well suited to HSIs. It is worthwhile to try to make the design idea of the
new network structure more closely fit 3D CNN.
In order to solve the above problems, we designed and tested a new HSI classification
model (ESFNet). The innovations of our model can be divided into two parts:
(1) We optimize the SeKG module [29], termed FsSE. In order to better process and utilize
the spectral information while preserving the continuity between spectra as much as
possible, we reduce the convolution of multiple scales in the SeKG module to two
scales and set the scaling parameter in the excitation layer to 1. These two optimiza-
tions allow the module to extract correlations between spectra more efficiently while
retaining maximum spectral continuity, so that the classification model can better
learn the spectral features.
(2) We propose a new network named the spectral stride fusion network (SSFCNN). The
new network implements the fusion of different strides by taking advantage of the fact
that 3D CNN can slide in the spectral dimension. This structure not only enhances the
learning ability of the model regarding spectral features, but also solves the problem
of redundant spectra.
Our model effectively solves the problem of integrating the two main lines mentioned
above. On the one hand, the usefulness of the FsSE module cannot be realized if the
enhanced information of this module is not effectively utilized. On the other hand, without
the support of enhanced features, the advantages of SSFCNN cannot be better demonstrated.
Therefore, the two parts are complementary and indispensable, which greatly enhances
the model’s ability to learn spectral characteristics. A series of experiments shows that
our proposed ESFNet is effective, and its overall accuracy is better than that of other
classification models.
The rest of the paper is organized as follows. Section 2 introduces the FsSE module
and SSFCNN. Section 3 presents the datasets used for the experiments, the experimental
environment and the training and test sets. Section 4 focuses on the relevant experimental
analysis. Finally, conclusions and discussions are summarized in Section 5.
environment and the training and test sets. Section IV focuses on the relevant experi-
mental analysis. Finally, conclusions and discussions are summarized in Section V.
The input of the model is hyperspectral data X h×w×c , h and w are the length and width
of the input data, respectively, and c is the number of spectral bands. After global averaging
pooling, a one-dimensional spectral channel vector Xc = { X1 , X2 , . . . , Xc } can be obtained.
The formula can be expressed as:
h w
∑ ∑ Xl (i, j)
i =1 j =1
Xl = , l = (1, 2, 3, . . . , c) (1)
h×w
After obtaining one-dimensional spectral channel vectors, we use multiscale convolu-
tion to weigh the spectral characteristics to enhance the correlation between the spectra.
We only set up two convolution cores of different scales Ks = {Ks1 , Ks2 }. The layer is
a 1D convolution, and the size of the convolution kernel is 1 × 1 × ck , ck = {3, 5, 7, . . .}.
The size of the convolution kernel can be adjusted according to the experiment. The
convolution kernel slides in the direction of the spectral dimension, and the stride length is
1. The value generated by the convolution represents the correlation between the spectra
at that size. The size after convolution is ensured to be the same as the original size by
zero-padding. Finally, we used the ReLU function to ensure that the channel correlation is
positive. The specific calculation formula is as follows:
c k −1
Yl = ∑ Xl +i · Ksi+1 + b, l = (1, 2, 3, . . . , c)
i =0 (2)
Xl +i = 0, l + i > c
where Yc = {Y1 , Y2 , . . . , Yc } represents the result after convolution, and the size of the
output is still 1 × 1 × c, as we only use two scales Yc = {Y1 , Y2 }. b represents the bias value.
In order to obtain a wealth of spectral information, we fused the results of the obtained
spectral correlations at different convolution scales by channel. This can be expressed in a
formula as:
Fs = Xc ⊕ Y1 ⊕ Y2 (3)
Fs represents the spectral features after fusion. Xc represents the original one-dimensional
spectral channel vector. Y1 and Y2 represent the results obtained at two convolution scales.
⊕ represents the summation of these three vectors. Each channel of the fused feature
contains the original spectral information and the related features of the adjacent spectrum,
which can better generate channel weights that match the HSI.
In order to obtain the mask of the spectral channels, we need to input the fused results
into an excitation module consisting of two fully connected layers. In the SE module and
SeKG module, in order to reduce the amount of computation, the first fully connected
layer usually reduces the dimensionality of the data to c/r (c represents the number of
channels, while r represents the scaling parameter). The second fully connected layer
restores the dimensions to the original dimensions. This processing is a good choice for
ordinary images. However, for HSIs, the wealth of spectral information is the biggest
characteristic. In addition, we further enriched the spectral information by multi-scale
fusion. The method to reduce the dimension before restoring it will undoubtedly cause
part of this rich information to be lost. Therefore, we used a fully connected layer with the
same number of nodes in both layers (i.e., the scaling parameter is set to 1), which not only
reduces the effectiveness of this module but also preserves the spectral information. The
formula for calculating the channel mask can be expressed as
In Formula (4), L represents the sigmoid function. ∂ represents the ReLU function.
Li represents the fully connected layer. After obtaining the final channel weights M, the
weighted result is obtained by multiplying M with the two-dimensional matrix input to the
module through the scale operation.
Remote Sens. 2022, 14, x FOR PEER REVIEW 6 of 25
(a) (b)
Figure 2. The difference between 3D convolution and 2D convolution. (a) Schematic diagram of 2D
Figure 2. The difference between 3D convolution and 2D convolution. (a) Schematic diagram of 2D
convolution; (b) Schematic diagram of 3D convolution.
convolution; (b) Schematic diagram of 3D convolution.
In 3D CNN, the calculation formula of the output 𝑂 value of the neuron node (x,
In 3D CNN, the calculation formula of the output Oxyz value of the neuron node (x, y,
y, z) is as follows:
z) is as follows:
K w −1 K h −1 K c −1 (5)
Oxyz = ∑ ∑ ∑ I( x+i)(y+ j)(z+m) · K(i+1)( j+1)(m+1) + b (5)
i =0 j =0 m =0
In Formula (5), 𝐾 , 𝐾 and 𝐾 represent the width, height and number of channels
of the kernel, respectively.
In Formula 𝐼 Kis
(5), Kw , Kh and the input,the
c represent and b is the
width, bias value.
height and number of channels of
the kernel, respectively. Ixyz is the input, and b is the bias value.
2.2.2. SSFCNN
2.2.2. SSFCNN
HSIs are rich in spectral information. In order to make better use of this important
HSIs are rich
characteristic, wein spectralainformation.
designed 3D CNN model In order
basedtoonmake better
spectral use offusion
feature this important
named the
characteristic,
spectral stridewe designed
fusion network a 3D CNN model
(SSFCNN). based onwe
The structure spectral
designedfeature fusion
can both named
ensure that
the spectral
enough stride information
spectral fusion network (SSFCNN).
is collected andThe structure
make we designed
the model learn more canabundant
both ensurefea-
that enough
tures spectral
by fusing information
different spectral is collected and
information. Themake
reasonthewhy
model
we learn
want tomore abundant
emphasize the
features by fusing different spectral information. The reason why we want to emphasize
learning ability of the model for spectral features is that in an actual HSI, there are certain
the learning between
similarities ability ofthe
thespectra
model of fordifferent
spectral ground
featuresobjects.
is that Taking
in an actual HSI, there
the Indian Pinesare
da-
certain
taset as an example, we plotted the spectral curves of these 16 types of samples.Pines
similarities between the spectra of different ground objects. Taking the Indian From
dataset
Figureas3, an
weexample, we plotted
can see that the trendtheofspectral curves
the spectral of these
curves 16 types
of these of samples.
16 types Fromis
of samples
Figure 3, we
basically thecan seeand
same thathas
thestrong
trend of the spectral
continuity. Thiscurves of these
requires 16 types
the model of samples
to have is
a stronger
basically the same and has strong continuity. This requires the model to have
spectral learning capability. Therefore, our original intention of designing SSFCNN is to a stronger
spectral learning
solve this capability. Therefore, our original intention of designing SSFCNN is to
problem.
solve this problem.
Compared with the traditional HSI classification model using a convolutional neural
network, a 3D convolutional neural network (3D CNN) can classify images relatively
quickly without manual dimensionality reduction. The difference between the method
in this paper and the general 3D CNN model is that the structure we designed allows
the model to extract spectral features under different spectral strides and then fuse those
features. This structure allows for both dimensionality reduction and for the network
Remote Sens. 2022, 14, x FOR PEER REVIEW 7 of 25
Compared with the traditional HSI classification model using a convolutional neural
network, a 3D convolutional neural network (3D CNN) can classify images relatively
quickly without manual dimensionality reduction. The difference between the method in
this paper and the general 3D CNN model is that the structure we designed allows the
model to extract spectral features under different spectral strides and then fuse those fea-
tures. This structure allows for both dimensionality reduction and for the network to learn
different spectral features, which can better guide the model to classify targets. Figure 4
shows
Figure 3. The spectral
Figure 3.the
curves network
Theof structure
16 types
spectral oftypes
of samples.
curves of 16 our model.
of samples.
Compared with the traditional HSI classification model using a convolutional neural
network, a 3D convolutional neural network (3D CNN) can classify images relatively
quickly without manual dimensionality reduction. The difference between the method in
this paper and the general 3D CNN model is that the structure we designed allows the
model to extract spectral features under different spectral strides and then fuse those fea-
tures. This structure allows for both dimensionality reduction and for the network to learn
different spectral features, which can better guide the model to classify targets. Figure 4
shows the network structure of our model.
Takingthe
Taking theIndian
IndianPines
Pines dataset
dataset as
as an
an example,
example, through
through Figure
Figure 4,
4, we
we aimed
aimed to
to fuse
fuse
the results of two different spectral strides. The values of stride for each layer are (1,
the results of two different spectral strides. The values of stride for each layer are (1, 3) 3)
and (1, 5), respectively. The results of the two different strides are concatenated by the
concat operation. We use different spectral sampling strides for concatenation because
the spectral features extract at different strides are not the same. With a small stride, the
model can extract more spectral features, but it also extracts some redundant information;
and (1, 5), respectively. The results of the two different strides are concatenated by the
concat operation. We use different spectral sampling strides for concatenation because the
Remote Sens. 2022, 14, 5334 spectral features extract at different strides are not the same. With a small stride, the model
8 of 24
can extract more spectral features, but it also extracts some redundant information; with
a large stride, the model extracts less redundant information, but the wealth of the spectral
features is likewise reduced. As a result, we planned to combine the results at different
with a large stride, the model extracts less redundant information, but the wealth of the
stages so that they could complement each other. With the extraction of the two strides,
spectral features is likewise reduced. As a result, we planned to combine the results at
the model can learn more abundant spectral information. The first two layers of the model
different stages so that they could complement each other. With the extraction of the two
are designed according to this idea. In Layer3, in order to facilitate the final model output,
strides, the model can learn more abundant spectral information. The first two layers of the
the fusion strategy is no longer used. However, the size of the feature map arriving at
model are designed according to this idea. In Layer3, in order to facilitate the final model
Layer3 will be different due to the size of the patch. When the patch size is less than 9, the
output, the fusion strategy is no longer used. However, the size of the feature map arriving
size of the feature map reaching that layer is already a one-dimensional vector, and there
at Layer3 will be different due to the size of the patch. When the patch size is less than 9,
is no need to downsample the feature map. Therefore, we set a discriminator to judge the
the size of the feature map reaching that layer is already a one-dimensional vector, and
feature maps input to Layer3. Finally, the final output is calculated through the fully con-
there is no need to downsample the feature map. Therefore, we set a discriminator to judge
nected layer.maps input to Layer3. Finally, the final output is calculated through the fully
the feature
connected layer.
3. Experimental Setting
3. Experimental
3.1. Dataset Setting
3.1. Dataset
Two public datasets, the Indian Pines dataset and the Pavia University dataset, were
used Two public
in this datasets,
experiment. theIndian
The IndianPines
Pines dataset
dataset wasand the Pavia
collected byUniversity
the Airbornedataset, were
Visible/In-
used in
frared this experiment.
Imaging Spectrometer The(AVIRIS)
Indian Pines
sensordataset was collected
over Northwest by the
Indiana, Airborne
United Visi-
States, in
ble/Infrared
1992. Imaging
This dataset Spectrometer
consists (AVIRIS)
of 145 × 145 pixelssensor
with aover Northwest
spatial Indiana,
resolution of 20 United
m. There States,
are
in 1992.
220 This dataset
continuous bandsconsists of 145 × 145range
in the wavelength pixelsofwith a spatial
400~2500 resolution
nm, with 20 of 20 m.
water There are
absorption
220 continuous bands in the wavelength range of 400~2500 nm, with 20
and low signal-to-noise ratio bands (104~108, 150~163, 220) removed. The ground truth water absorption
and low 16
includes signal-to-noise ratiomost
types of samples, bandsof (104~108,
which are 150~163, 220) removed.
crops at different growthThe ground
stages. truth
The spec-
includes
tral 16 types
features of samples,
of these 16 types most of which
of samples areare crops atsimilar,
relatively differentand
growth stages.resolution
the image The spec-
tral
is features
low, whichofcan
these 16 types
easily of samples
produce mixed are relatively
hybrid pixels,similar, and thesome
thus causing imagedifficulties
resolutionin is
low, which can easily produce mixed hybrid pixels, thus causing some difficulties
image classification. Figure 5 shows the pseudo-color image and the ground truth, respec- in image
classification. Figure 5 shows the pseudo-color image and the ground truth, respectively.
tively.
Figure
Figure 5.
5. Pseudo-color
Pseudo-color image
image (R:
(R: 50,
50, G:
G: 30,
30, B:
B: 20)
20) and
and ground
ground truth
truth of
of Indian
Indian Pines
Pines dataset.
dataset.
Figure
Figure 66 shows
shows the
the Pavia
Pavia University
University dataset.
dataset. This
This dataset
dataset was
was acquired
acquired inin 2002
2002 using
using
ROSIS sensors over Pavia, Italy. It includes nine types of samples, such as roads,
ROSIS sensors over Pavia, Italy. It includes nine types of samples, such as roads, numbers numbers
and
and roofs.
roofs. The
The image
image consists
consists of 610 ××340 340pixels
pixelswith
withaaspatial
spatialresolution
resolution of
of 1.3
1.3 m.
m. There
There
are
are 115 bands in in the
thewavelength
wavelengthrangerangeofof430~860
430~860nm,nm,ofof which
which 103103 bands
bands areare reserved
reserved for
for testing
testing after
after removing
removing 12 bands
12 bands with with strong
strong noise
noise andand water
water absorption.
absorption.
Remote Sens. 2022, 14, x FOR PEER REVIEW 9 of 25
Remote Sens. 2022, 14, x5334
FOR PEER REVIEW 9 9ofof 25
24
Figure 6. Pseudo-color image (R: 60, G: 30, B: 2) and ground truth of Pavia University dataset.
Figure
Figure 6.
6. Pseudo-color
Pseudo-color image
image (R:
(R: 60,
60, G:
G: 30,
30, B:
B: 2)
2) and
and ground
ground truth
truth of
of Pavia
Pavia University
University dataset.
dataset.
3.2. Running Environment
3.2. Running
3.2. Running Environment
Environment
The processor used for the experiments is an i7-10750H from Intel with a main fre-
The processor
The processor used
used for
for the
the experiments
experiments isis an
an i7-10750H
i7-10750Hfrom
fromIntel
Intelwith
withaamain
mainfre-
fre-
quency of 2.60 GHz. The graphics card used for the experiments is an RTX2060 from
quency
quency ofof 2.60 GHz. The graphics card used for the experiments is an RTX2060
The graphics card used for the experiments is an RTX2060 from from
NVIDIA with 6 GB of video memory. The experimental device has 16 GB of memory. The
NVIDIA with
NVIDIA with 66 GB
GB of
of video
video memory.
memory. The
The experimental
experimental device
device has
has 16
16 GB
GB of
of memory.
memory. The
The
system used is Windows 10. The deep learning framework used is Pytorch.
system used
system used is
is Windows
Windows 10. The deep learning framework used is Pytorch.
3.3.
3.3. Dataset
Dataset Processing
Processing
3.3. Dataset Processing
InIn
In
this
this
this
paper,
paper,
paper,
a fully
aa fully
fully supervised
supervised
supervised
learning
learning
learning
approach
approach
approach
is used,
is used,
is used, and
and
and
the
the
the
dataset
dataset
dataset
is divided
is divided
is divided
intothe
into thetraining
trainingsetsetand
andthe
the test
test set.
set. Figure
Figure77shows
shows the
thetraining
trainingsetset
and
andtest setset
test of of
two
into the training set and the test set. Figure 7 shows the training set and test set of two
datasets.
two datasets.
datasets.
91 96.1 96.05
90.13
96
90 95.86
95.9 95.83
89 88.53 88.52 95.8
95.67
88 87.62 95.7
95.6
87
95.5
86 95.4
Only_3D CNN Only_FsSE Only_SSFCNN ESFNet Only_3D CNN Only_FsSE Only_SSFCNN ESFNet
(a) (b)
Figure 8. Comparison results of the four models on two datasets. (a) Accuracies of the four models
Figure 8. Comparison results of the four models on two datasets. (a) Accuracies of the four models
on the Indian Pines dataset; (b) Accuracies of the four models on the Pavia University dataset.
on the Indian Pines dataset; (b) Accuracies of the four models on the Pavia University dataset.
4.2. Parameter
In FigureSensitivity
8, we canAnalysis
clearly see that the effect of ESFNet is better than the other three
Some
groups of hyperparameter settings in models.
comparison experimental the ESFNetWemodule can have
know that an impact
the FsSE module on the effect
enhances
of the model. After analysis, we mainly analyze the performance of the
the information contained in each band of the hyperspectral image (HSI), but the global model from three
aspects: the size of the convolution kernel in the FsSE module, the combination
averaging pooling in the FsSE module obtains a global receptive field and does not take of strides
in SSFCNN
into accountand thethe patch
spatial size of the input.
information. SSFCNN We set
hasthe
thebatch size to
capability of16, usedlearning
spatial the RMSprop while
algorithm
enhancingas thethe optimizer
learning of the
ability loss function
of spectra, but ifand set the epoch model
the classification of all models
is allowedto 200. The
to learn
test
the set is evaluated
original spectralby selecting
bands the the
directly, model with the
effective bandhighest detection
information accuracy
cannot on the
be extracted
validation
efficiently. set, andresults
In the finallyshown
the best choice 8,
in Figure ofthe
these three components
accuracy is used
only has a little as the when
difference final
choice of the model,
either method is usedwhich
alone.isHowever,
compared with
after other models
combining FsSE inwithSection
SSFCNN, 4.3. Ittheshould
accuracy be
of prediction
noted that the can
otherbeparameters
significantly of improved.
the model are Thethereason
same why
whenthe wecombined
analyze a modelparticularcan
improve the accuracy is that it can both ensure the continuity of the HSI spectra and
parameter.
enhance the spectral information of the HSI while allowing the enhanced information to
be better
4.2.1. Impactutilized in the classification
of Convolution Kernel Size model.
of the Therefore,
FsSE Module ouronidea
Modelof designing
Accuracy this new
classification model for HSIs is effective.
We will introduce the advantages of this module in Section 3. However, due to the
different settings of the convolution kernel size, the extracted spectral correlations are dif-
4.2. Parameter Sensitivity Analysis
ferent. Common convolution kernels are typically of size 3, 5 or 7. Therefore, we con-
ducted Some
threehyperparameter
sets of comparison settings in the ESFNet
experiments for both module
datasets.can have
Figure an impact
9 shows on the
the results.
effectFrom
of the model. After analysis, we mainly analyze the performance of the model
Figure 9, it can be seen that for extracting the correlation between the spectra, from
three aspects: the size of the convolution kernel in the FsSE module, the combination
a high accuracy has been achieved by using the convolution of two scales. Because of the of
strides in SSFCNN and the patch size of the input. We set the batch size to 16,
strong continuity existing between spectra, the use of two convolution kernels with little used the
RMSprop algorithm as the optimizer of the loss function and set the epoch of all models to
difference in size is enough to ensure that the model can extract sufficient spectral corre-
200. The test set is evaluated by selecting the model with the highest detection accuracy
lation. Therefore, our optimization of the SeKG module is effective. The best combination
on the validation set, and finally the best choice of these three components is used as
of convolution kernels for the Indian Pines dataset is 1 × 1 × 5 and 1 × 1 × 7 because the
the final choice of the model, which is compared with other models in Section 4.3. It
should be noted that the other parameters of the model are the same when we analyze a
particular parameter.
4.2.1. Impact of Convolution Kernel Size of the FsSE Module on Model Accuracy
We will introduce the advantages of this module in Section 3. However, due to
the different settings of the convolution kernel size, the extracted spectral correlations
are different. Common convolution kernels are typically of size 3, 5 or 7. Therefore,
we conducted three sets of comparison experiments for both datasets. Figure 9 shows
the results.
model has the highest accuracy with this combination. For the Pavia University dataset,
the accuracy of the convolution kernel combinations 1 × 1 × 3 & 1 × 1 × 7 and 1 × 1 × 5 & 1
× 1 × 7 is the same. Thus, we need to analyze these three combinations from other perspec-
Remote Sens. 2022, 14, 5334 11 of 24
tives. Table 1 shows the training time and the number of parameters for the three combi-
nations on the Pavia University dataset.
Overall Accuracy
89.68
90 93
89 92.5
87.73
88 92
87 91.5
86 91
3&5 3&7 5&7 3&5 3&7 5&7
Combination of kernel size Combination of kernel size
(a) (b)
Figure 9. Accuracy of different combinations of convolution kernels in the FsSE module for both
Figure 9. Accuracy of different combinations of convolution kernels in the FsSE module for both
datasets. (a) The result of the Indian Pines dataset; (b) The result of the Pavia University dataset.
datasets. (a) The result of the Indian Pines dataset; (b) The result of the Pavia University dataset.
Table 1. Training time and number of parameters required for three different combinations of con-
From Figure 9, it can be seen that for extracting the correlation between the spectra, a
volutional kernel sizes on the Pavia University dataset.
high accuracy has been achieved by using the convolution of two scales. Because of the
strong continuity
Model existing between spectra,
Trainingthe use of two convolution
Time/s Totalkernels
Params with little
difference in3&5
size is enough to ensure that 1570
the model can extract sufficient spectral correla-
249,816
tion. Therefore,
3&7 our optimization of the SeKG
1455 module is effective. The
249,818combination
best
of convolution kernels for the Indian Pines dataset is 1 × 1 × 5 and 1 × 1 × 7 because the
5&7 1160 249,820
model has the highest accuracy with this combination. For the Pavia University dataset,
the accuracy of the convolution kernel combinations 1 × 1 × 3 & 1 × 1 × 7 and 1 × 1 ×
From Table 1, it is obvious that the combination 5&7 takes the least time to train
5 & 1 × 1 × 7 is the same. Thus, we need to analyze these three combinations from other
among the three combinations. Although the number of parameters is the largest among
perspectives. Table 1 shows the training time and the number of parameters for the three
the three, the number of extra parameters is very small. Therefore, in the case that the two
combinations on the Pavia University dataset.
combinations of 3&7 and 5&7 have the same accuracy for the Pavia University dataset,
the less time-consuming combination of 1 × 1 × 5 and 1 × 1 × 7 is chosen.
Table 1. Training time and number of parameters required for three different combinations of
convolutional kernel sizes on the Pavia University dataset.
4.2.2. Impact of Patch Size on Model Accuracy
BecauseModel Training Time/sdataset is only one
the image in the public hyperspectral Total Params
piece, if the whole
image is input3&5 into the network for training,1570 it is not only disadvantageous for the network
249,816
training, but also3&7 the amount of data is far from 1455 enough. Therefore, we need to sample the
249,818
image and send 5&7the sampled part into the network 1160 for training, which can 249,820
both reduce the
training time of the model and increase the training volume of the model. Taking the In-
dian Pines
From dataset
Table 1,asitanis example,
obvious that the size
the of the original5&7
combination image is 145
takes the× least
145 ×time
200, and we
to train
select
among a the
block
threeof M × M × 200 pixels
combinations. to input
Although into the of
the number model for training.
parameters The choice
is the largest among of
patch size,the
the three, however,
numbercan have parameters
of extra a significantis impact on model
very small. accuracy
Therefore, in theandcasetraining
that thetime
two
as well. If the selected
combinations of 3&7 andsize5&7
is too
havesmall, the model
the same accuracy willfor
not
thebePavia
trained properly;
University if the se-
dataset, the
lected size is too large, the model training time will
less time-consuming combination of 1 × 1 × 5 and 1 × 1 × 7 is chosen. increase. Therefore, in order to select
the best patch size, we chose seven different sizes of sampling windows for comparison.
4.2.2. Impact
Figure 10 shows of Patch Size on for
the accuracy Model Accuracy
the two datasets when faced with different patch sizes.
From
BecauseFigure 10a, weincan
the image theobserve that the modeldataset
public hyperspectral has theishighest
only one accuracy
piece, ifintheclassify-
whole
ing the is
image Indian
inputPines dataset
into the networkwhen forthe size ofitthe
training, sampling
is not is increased to 11,
only disadvantageous forand
the then the
network
accuracy
training, decreases
but also the as the patchofsize
amount dataincreases.
is far from From Figure Therefore,
enough. 10b, we canwe seeneed
that to
thesample
accu-
the image
racy of the and
model send the sampled
is highest whenpartthe into
patchthe sizenetwork for training,
is 5. After which canofboth
that, the accuracy reduce
the model
thethe
on training time of the dataset
Pavia University model and increase
decreases whenthe the
training
patchvolume of the model.
size continues Taking
to increase. the
Also,
Indian
we know Pines
thatdataset as an
the patch example,
size not only theaffects
size ofthe theaccuracy
original ofimage is 145 ×but
the model, × 200,
145also hasand
an
we select a block of M × M × 200 pixels to input into the model for training. The choice of
patch size, however, can have a significant impact on model accuracy and training time as
well. If the selected size is too small, the model will not be trained properly; if the selected
size is too large, the model training time will increase. Therefore, in order to select the best
patch size, we chose seven different sizes of sampling windows for comparison. Figure 10
shows the accuracy for the two datasets when faced with different patch sizes.
Remote Sens. 2022, 14, x FOR PEER REVIEW 12 of 25
Remote Sens. 2022, 14, 5334 impact on the training time of the model. Table 2 shows the training time of the model
12 of 24
with seven patch sizes.
Overall accuracy
87.36 95.56
90 86.74 94.27
84.79 86.09 95 91.99
85 95.12 90.14
85.56 90 93.20
80 82.83 91.12
75 85
5 7 9 11 13 15 17 5 7 9 11 13 15 17
Patch size Patch size
(a) (b)
Figure 10. Accuracy of different patch sizes for two datasets. (a) Results for different patch sizes for
Figure 10. Accuracy of different patch sizes for two datasets. (a) Results for different patch sizes for
the Indian Pines dataset. (b) Results for different patch sizes for the Pavia University dataset.
the Indian Pines dataset. (b) Results for different patch sizes for the Pavia University dataset.
Table 2. Training time for seven patch sizes for the two datasets.
From Figure 10a, we can observe that the model has the highest accuracy in classifying
the IndianDataset:
Pines dataset
Indianwhen
Pinesthe size of the sampling is increased
Dataset: to 11, and then the
Pavia University
accuracy decreases
Patch Size as the patch size increases.
Training Time/s From Figure 10b,
Patch Size we can see that the
Training accuracy
Time/s
of the model
5 is highest when the
208patch size is 5. After that,
5 the accuracy of the
703 model on
the Pavia University dataset decreases when the patch size continues to increase. Also, we
7 293 7 962
know that the patch size not only affects the accuracy of the model, but also has an impact
9 284 9 865
on the training time of the model. Table 2 shows the training time of the model with seven
11 620 11 1160
patch sizes.
13 688 13 1779
15 1073 15
Table 2. Training time for seven patch sizes for the two datasets. 1887
17 1021 17 2223
Dataset: Indian Pines Dataset: Pavia University
Patch Size Training Time/s Patch Size Training Time/s
Table 2 clearly shows the relationship between the training time and the patch size.
5
With increasing size, the training208
time increases as well. 5For the Pavia University
703 dataset,
a patch size7of 5 gives the best results
293 and takes the least7 amount of time to train.
962 For the
9 284 9 865
Indian Pines dataset, although the training time is at its minimum when the patch size is
11 620 11 1160
set to 5, the13accuracy is 7.296% lower
688 than when the size 13 is 11. Therefore, for the Indian
1779
Pines dataset,
15 a patch size setting of 11 is optimal.
1073 15 1887
17 1021 17 2223
4.2.3. Impact of Stride Combinations on Model Accuracy
We already
Table know
2 clearly that there
shows is a certain degree
the relationship between ofthe
redundant
traininginformation
time and the inpatch
the spec-
size.
tra
With increasing size, the training time increases as well. For the Pavia University methods
of an HSI, which is the theoretical basis for the use of descending dimension dataset, a
such
patch assize
PCAofin5 numerous
gives the beststudies of HSI
results and classification.
takes the least In the sameof
amount way, even
time if we weight
to train. For the
the spectral
Indian Pinesinformation by the the
dataset, although set training
FsSE module,
time isthe redundant
at its minimum information
when the patch still exists,
size is
which
set to requires us to find
5, the accuracy ways tolower
is 7.296% makethanthe 3DwhenCNN themodel
size islearn as much effective
11. Therefore, infor-
for the Indian
mation as possible.
Pines dataset, Therefore,
a patch we designed
size setting SSFCNN to solve this problem. However, with
of 11 is optimal.
different spectral strides, the extracted spectral features are different. To study the effect
of4.2.3.
this Impact
part on of
theStride
model, Combinations on Model
we set up multiple Accuracy Tables 3 and 4 show the accu-
combinations.
racy and trainingknow
We already time of these
that combinations
there on theoftwo
is a certain degree datasets.information in the spectra
redundant
of anFrom
HSI, Table
which3,isthe
thecombination of Layer1
theoretical basis for theanduseLayer2 of the model
of descending is 1_3 methods
dimension and 1_5. suchThe
model
as PCA has
in the best classification
numerous studies of HSI effect on the Indian
classification. In Pines
the samedataset,
way,and
eventhe accuracy
if we weightratethe
isspectral
generally 1%–2% higher
information by the compared with other
set FsSE module, thecombinations. Although the
redundant information stilltraining time
exists, which
isrequires us to than
a bit higher find ways to make
for other the 3D CNN
combinations, it ismodel learn as
still within anmuch effective
acceptable information
range. The accu- as
possible. Therefore, we designed SSFCNN to solve this problem.
racy on the Pavia University dataset can be analyzed by Table 4. It can be seen that theHowever, with different
spectral strides, the extracted spectral features are different. To study the effect of this part
on the model, we set up multiple combinations. Tables 3 and 4 show the accuracy and
training time of these combinations on the two datasets.
Remote
Remote Sens.
Sens. 14, 14,
2022,
2022, 5334PEER REVIEW
x FOR 13 of1325
of 24
Table 3. Accuracy and training time for different stride combinations on the Indian Pines dataset.
accuracies of different combinations are close. Figure 11 shows the accuracies more visu-
ally. Combination of Strides Training Time/s Overall Accuracy/%
1_3&1_3
Table 3. Accuracy 585 combinations on the Indian88.585
and training time for different stride Pines dataset.
1_3&1_5 620 90.125
Combination 1_3&1_7
of Strides Training 506Time/s 88.770
Overall Accuracy/%
1_5&1_3 895 88.737
1_3&1_3 585 88.585
1_5&1_5 410 88.379
1_3&1_5
1_5&1_7 620736 90.125
87.902
1_3&1_7
1_7&1_3 506617 88.770
88.444
1_7&1_5
1_5&1_3 895464 87.458
88.737
1_7&1_7 474 88.466
1_5&1_5 410 88.379
1_5&1_7 736 87.902
1_7&1_3and training time for different
Table 4. Accuracy 617stride combinations on the Pavia
88.444
University dataset.
1_7&1_5 464 87.458
Combination of Strides Training Time/s Overall Accuracy/%
1_7&1_7 474 88.466
1_3&1_3 966 95.343
1_3&1_5 703 95.558
Table 4. Accuracy and training time for different stride combinations on the Pavia University da-
1_3&1_7 991 95.979
taset.
1_5&1_3 819 95.660
Combination1_5&1_5
of Strides Training 771Time/s 95.701
Overall Accuracy/%
1_5&1_7 1028 95.984
1_3&1_3
1_7&1_3 966769 95.343
95.522
1_3&1_5
1_7&1_5 703754 95.558
96.044
1_3&1_7
1_7&1_7 991897 95.979
95.561
1_5&1_3 819 95.660
From Table 3, the combination of Layer1 and Layer2 of the model95.701
1_5&1_5 771 is 1_3 and 1_5. The
model has1_5&1_7
the best classification effect on1028
the Indian Pines dataset, and95.984
the accuracy rate is
generally 1%–2% higher compared with other
1_7&1_3 769 combinations. Although the training time is
95.522
a bit higher than for other combinations, 754
1_7&1_5 it is still within an acceptable96.044
range. The accuracy
on the Pavia University dataset can be analyzed
1_7&1_7 897 by Table 4. It can be seen that the accuracies
95.561
of different combinations are close. Figure 11 shows the accuracies more visually.
OA (%)
97 96.05
95.98 95.99
Overall accuracy
Figure 11. Accuracy of different stride combinations on the Pavia University dataset.
Figure 11. Accuracy of different stride combinations on the Pavia University dataset.
From the training time, when the combination is 1_7 and 1_5, the training time is the
second
Fromlowest amongtime,
the training all combinations, and the accuracy
when the combination is also
is 1_7 and 1_5,the
thehighest.
trainingHowever, due
time is the
to the large amount of data in the Pavia University dataset, the training time
second lowest among all combinations, and the accuracy is also the highest. However, of the model
dueis to
increased
the largecompared
amount ofto data
the Indian
in the Pines
Pavia dataset. In summary,
University dataset, thewetraining
use thetime
combination
of the
model is increased compared to the Indian Pines dataset. In summary, we use theUniversity
1_3&1_5 for the Indian Pines dataset and the combination 1_7&1_5 for the Pavia combi-
dataset.
nation 1_3&1_5 for the Indian Pines dataset and the combination 1_7&1_5 for the Pavia
University dataset.
4.3. Comparison with Other Baselines
4.3.4.3.1. Baselinewith Other Baselines
Comparison
In order to verify the advantages of the models in this paper, we have selected some
4.3.1. Baseline
mainstream models in the field of HSI classification for comparison. The implementation
details of the comparison models are as follows:
Remote Sens. 2022, 14, 5334 14 of 24
(1) SVM: The SVM model in this paper used the radial basis function (RBF kernel), which
classifies by raw spectral features. We implemented the model using the SVM function
in the Sklearn module.
(2) ANN: The original spectral features are classified by an artificial neural network
(ANN), which contains four fully connected layers and a dropout layer, and was
trained with a learning rate of 0.0001 using the Adam algorithm.
(3) 1D CNN: We used the same 1D CNN structure as in [24], Pytorch to implement the
model and the stochastic gradient descent algorithm to train the model with a learning
rate of 0.01.
(4) 3D CNN: A structure proposed in [40] was used for the 3D CNN model, which is a
conventional structure consisting of three convolution-pooling layers and one fully
connected layer. The model was implemented in Pytorch and trained with a learning
rate of 0.003 using the stochastic gradient descent algorithm.
(5) Hamida (3D CNN + 1D classifier) [47]: We implemented the model in Pytorch, where
we extracted a 5 × 5 × 200 cube from the image as an input to the model. The
characteristic of the model is that it utilizes one-dimensional convolution instead of
the usual pooling method and finally utilizes one-dimensional convolution instead of
a fully connected layer. The model was trained with a learning rate of 0.01 using the
stochastic gradient descent algorithm.
(6) HybridSN: The model used the specific structure proposed in [48], and the model was
implemented in Pytorch. The patch size is 25 × 25. The model contains a total of four
convolutional layers and two fully connected layers, where the four convolutional
layers include three 3D convolutional layers and one 2D convolutional layer, with the
3D convolutional layer for learning spatial-spectral features and the 2D convolutional
layer for learning spatial features.
(7) RNN: We used an RNN model for HSI classification, which is similar to [31]. We
replaced the activation function with a tanh function and implemented the model
in Pytorch.
(8) SpectralFormer (SF): We implemented the model directly using the model code pro-
vided in [49]. The model is an improvement of Transformer with the addition of two
new modules, GSE and CAF, in order to improve the detail-capturing capacity of
subtle spectral discrepancies and enhance the information transitivity between layers,
respectively. We implemented it in Pytorch.
Class Name
SVM RNN ANN 1D CNN SF 3D CNN Hamida HybridSN ESFNet
[F1 Scores (%)]
1. Alfalfa 36.1 8.7 82.1 0.0 10.9 95.3 74.2 100.0 75.8
2. Corn-notill 74.8 63.3 79.3 52.8 69.2 90.2 89.4 93.7 93.5
3. Corn-mintill 72.6 44.9 70.4 31.1 68.7 54.5 77.9 59.2 85.6
4. Corn 64.4 44.1 71.0 2.7 64.5 55.4 76.3 46.2 88.3
5. Grass-pasture 86.5 75.6 91.0 8.6 84.0 70.2 92.5 73.6 93.0
6. Grass-trees 93.8 89.3 93.0 76.1 91.6 97.2 98.3 99.5 97.2
7. Grass-pasture-mowed 85.7 60.0 95.8 0.0 86.3 93.6 35.3 83.7 93.6
8. Hay-windrowed 94.7 91.7 97.6 87.3 93.3 67.7 96.8 74.1 95.8
9. Oats 52.6 0.0 71.4 0.0 66.7 80.0 86.5 94.7 50.0
10. Soybean-notill 73.2 56.1 77.4 34.7 71.9 83.9 87.6 87.0 91.4
11. Soybean-mintill 80.4 67.0 82.6 66.6 78.9 90.4 90.3 92.2 94.1
12. Soybean-clean 82.1 57.2 74.8 15.8 68.6 75.0 81.0 79.2 88.8
13. Wheat 93.7 90.2 96.2 81.9 95.7 100.0 99.2 97.6 99.5
14. Woods 91.8 90.2 94.5 82.2 91.5 82.5 95.5 85.0 97.9
15. B-G-T-D 62.8 56.1 69.5 12.9 53.5 37.5 76.5 39.5 68.5
16. Stone-Steel-Towers 91.0 81.0 86.7 90.3 90.3 74.1 97.6 91.1 97.6
OA(%) 81.0 68.9 83.2 59.6 78.3 72.9 88.5 76.3 90.1
Kappa × 100 0.783 0.645 0.808 0.522 0.751 0.698 0.869 0.735 0.888
Class Name
SVM RNN ANN 1D CNN SF 3D CNN Hamida HybridSN ESFNet
[F1 Scores (%)]
1. Asphalt 91.5 90.5 95.8 90.2 92.9 90.8 97.2 93.5 97.9
2. Meadows 95.1 95.3 97.1 91.2 94.3 83.9 95.5 86.8 96.9
3. Gravel 79.3 73.3 85.8 56.3 73.1 83.4 93.0 87.2 93.4
4. Trees 92.8 93.4 96.3 90.5 92.7 93.6 96.7 94.6 98.4
5. Painted metal sheets 99.2 99.5 99.6 99.1 99.4 100.0 100.0 99.8 100.0
6. Bare Soil 84.5 88.8 93.4 70.3 83.0 94.5 94.6 100.0 99.6
7. Bitumen 71.2 73.5 91.8 80.1 84.7 95.4 95.2 100.0 96.6
8. Self-Blocking Bricks 85.8 80.6 88.5 81.6 78.8 98.2 95.5 98.8 96.4
9. Shadows 99.9 99.6 99.6 99.9 99.9 97.9 99.9 98.8 100.0
OA(%) 91.2 90.5 94.7 86.3 89.9 83.2 94.5 86.2 96.1
Kappa × 100 0.882 0.875 0.930 0.816 0.866 0.792 0.927 0.828 0.948
In Figure 12, the spectral curves of the classes Oats and Grass_trees are very close
to each other. In the band range of 50–100, the spectral curves of these two classes of
features almost overlap. Reflecting on the specific classification effect, our model classified
50% of the class Oats into the class Grass_trees, which can be seen in Figure 16j shown
later. The reason for this is that the model in this paper learns some spatial features while
enhancing the spectral learning ability. However, as we do not emphasize the learning
of spatial features, coupled with the very small number of class Oats in the training set,
the final features of Oats learned by our model are closer to those of Grass_trees, which
led to misclassification. In contrast, HybridSN was designed with a convolutional layer
dedicated to extracting spatial features. Therefore, the classification of such samples has
some advantages. ESFNet, however, has enhanced its ability to learn spectral features,
enabling it to gain an advantage in the classification of most categories. The reason is
that these two fusion operations can effectively extract the effective features of the sample
spectrum so that the model can be trained to achieve better results.
advantages. ESFNet, however, has enhanced its ability to learn spectral features, enabling
it to gain an advantage in the classification of most categories. The reason is that these two
Remote Sens.fusion
2022, 14, operations
5334 can effectively extract the effective features of the sample spectrum so 16 of 24
Although our model has average results on very few categories, it can stay ahead in
Although our model has average results on very few categories, it can stay ahead in
most of the categories, which means that by our design, we can make our model learn
most of the categories, which means that by our design, we can make our model learn
enough features in most categories and make the model learn more complex spectral
enough features infeatures
most categories
by fusing the and make
results the model
of different learnfinally
strides, moreobtaining
complexexcellent
spectralclassification
fea-
tures by fusing theresults.
results ofhyperspectral
For different strides,
image finally obtaining
classification, we areexcellent classification
more concerned re-
with performance
in overall
sults. For hyperspectral accuracy
image and performance
classification, we areinmoremost concerned
categories, and withourperformance
method is ahead in of the
other methods.
overall accuracy and performance in most categories, and our method is ahead of the
Figure 13 shows the training loss and validation accuracy of seven deep learning
other methods.
models on the Indian Pines dataset and the Pavia University dataset. Through these graphs,
Figure 13 shows
we can theseetraining
that the 1D loss
CNN and validation
model converges accuracy of our
the slowest, seven
model deep learning
converges the fastest
models on the Indian
on the Pines
Indian dataset and and
Pines dataset, the HybridSN
Pavia University
converges thedataset.
fastestThrough
on the Pavia these
University
graphs, we can see that the
dataset. The1D CNN model
validation accuracy converges
of HybridSN the isslowest, ourofmodel
the highest all the converges
models, but when
combined with the final test accuracy, it shows a certain
the fastest on the Indian Pines dataset, and HybridSN converges the fastest on the Pavia degree of overfitting. There are
University dataset.twoThereasons for thisaccuracy
validation situation:ofone is that theisnetwork
HybridSN layersofofallHybridSN
the highest are deeper
the models,
compared to other networks, and the other is that the number of training samples is smaller.
but when combined with the final test accuracy, it shows a certain degree of overfitting.
Although HybridSN is able to extract both spatial and spectral features, the smaller number
There are two reasons for this
of training situation:
samples makes the one is that
model not the
learnnetwork layers
sufficiently, whileof theHybridSN are layers
deeper network
deeper compared aggravate
to other networks,
this problem. and the other
HybridSN hadisfaster
that convergence
the numberand of higher
training samples
validation accuracy,
is smaller. Although HybridSN
but the model was isstill
able to extract
overfitted dueboth
to thespatial and spectral
two problems mentioned features, the model
above. Our
smaller number of performs
training wellsamples
in terms of training
makes theloss and validation
model not learn accuracy, and its convergence
sufficiently, while the speed
is also aggravate
deeper network layers fast. Combining the validation
this problem. accuracyhad
HybridSN and faster
testing convergence
accuracy, our modeland does
not have a serious problem of overfitting and has good generalization performance. In
higher validation order
accuracy, but the model was still overfitted due to the two problems
to better show the differences between models, we plotted the confusion matrix of
mentioned above.the Our model
nine modelsperforms well of
on two types in datasets
terms ofastraining loss and
a significance test. validation accu-
The results are shown in
racy, and its convergence
Figures 14 speed is also fast. Combining the validation accuracy and testing
and 15.
accuracy, our model does not have a serious problem of overfitting and has good gener-
alization performance. In order to better show the differences between models, we plotted
the confusion matrix of the nine models on two types of datasets as a significance test. The
results are shown in Figures 14 and 15.
Remote Sens. 2022, 14, x FOR PEER REVIEW 17 of 25
Remote Sens. 2022, 14, 5334 17 of 24
Figure16.
Figure 16.The
The ground
ground truth
truth andand classification
classification maps maps of the
of the nine nine models
models on the
on the Indian Indian
Pines Pines datas
dataset.
(a)The
(a) Theground
groundtruth.
truth.
(b)(b) SVM.
SVM. (c) (c)
RNN.RNN. (d) ANN.
(d) ANN. (e)CNN.
(e) 1D 1D CNN.
(f) SF.(f)
(g)SF.
3D(g) 3D (h)
CNN. CNN. (h) Hamida.
Hamida.
HybridSN.
(i) HybridSN.(j)
(j)ESFNet.
ESFNet.
To determine the significant differences between the models, we used the Friedman
test [50] for statistical significance. We compared the significance of the models among the
categories in the two datasets.
In the Friedman test, we used the chi-square distribution to approximate the Friedman
test statistic. We calculated the ranking of the models in the above experiments in terms
of F1 scores in each category of the datasets. The results are shown in Tables 7 and 8. We
assume that there is no difference between the models, and thus R2j should be equal. Based
Remote Sens. 2022, 14, 5334 21 of 24
on the following equation and the data in Tables 7 and 8, the value of the Friedman test
statistic can be calculated.
k
2 12 2
X1 r = n1 k(k+1) ∑ R j − 3n1 (k + 1) = 71.825, dataset : IndianPines
j =1
Remote Sens. 2022, 14, x FOR PEER REVIEW k
(6) 21 of 2
2 12 2
X2 r = n k(k+1) ∑ R j − 3n2 (k + 1) = 39.489, dataset : PaviaUniversity
2
j =1
Figure17.17.The
Figure The ground
ground truth
truth andand classification
classification maps maps
of the of themodels
nine nine models on the
on the Pavia Pavia Universit
University
dataset.(a)
dataset. (a)The
Theground
ground truth.(b)(b)
truth. SVM.
SVM. (c)(c) RNN.
RNN. (d) (d) ANN.
ANN. (e) CNN.
(e) 1D 1D CNN. (f)(g)
(f) SF. SF.3D(g)CNN.
3D CNN. (h
Hamida. (i) HybridSN. (j) ESFNet.
(h) Hamida. (i) HybridSN. (j) ESFNet.
(6
In Equation (6), 𝑛 is the number of categories, i is the ith dataset, and 𝑅 indicate
Remote Sens. 2022, 14, 5334 22 of 24
Class Name SVM RNN ANN 1D CNN SF 3D CNN Hamida HybridSN ESFNet
1. Alfalfa 6 8 3 9 7 2 5 1 4
2. Corn-notill 6 8 5 9 7 3 4 1 2
3. Corn-mintill 3 8 4 9 5 7 2 6 1
4. Corn 5 8 3 9 4 6 2 7 1
5. Grass-pasture 4 6 3 9 5 8 2 7 1
6. Grass-trees 5 8 6 9 7 3.5 2 1 3.5
7. Grass-pasture-mowed 5 7 1 9 4 2.5 8 6 2.5
8. Hay-windrowed 4 6 1 7 5 9 2 8 3
9. Oats 6 8.5 4 8.5 5 3 2 1 7
10. Soybean-notill 6 8 5 9 7 4 2 3 1
11. Soybean-mintill 6 8 5 9 7 3 4 2 1
12. Soybean-clean 2 8 6 9 7 5 3 4 1
13. Wheat 7 8 5 9 6 1 3 4 2
14. Woods 4 6 3 9 5 8 2 7 1
15. B-G-T-D 4 5 2 9 6 8 1 7 3
16. Stone-Steel-Towers 4 8 7 5.5 5.5 9 1.5 3 1.5
Total Rank 77 118.5 63 138 92.5 82 45.5 68 35.5
Class Name SVM RNN ANN 1D CNN SF 3D CNN Hamida HybridSN ESFNet
1. Asphalt 6 8 3 9 5 7 2 4 1
2. Meadows 5 4 1 7 6 9 3 8 2
3. Gravel 6 7 4 9 8 5 2 3 1
4. Trees 7 6 3 9 8 5 2 4 1
5. Painted metal sheets 8 6 5 9 7 2 2 4 2
6. Bare Soil 7 6 5 9 8 4 3 1 2
7. Bitumen 9 8 5 7 6 3 4 1 2
8. Self-Blocking Bricks 6 8 5 7 9 2 4 1 3
9. Shadows 3.5 6.5 6.5 3.5 3.5 9 3.5 8 1
Total Rank 57.5 59.5 37.5 69.5 60.5 46 25.5 34 15
5. Conclusions
In this paper, we proposed a new enhanced spectral fusion network (ESFNet) for
hyperspectral image classification. The new model can improve the classification accuracy
of hyperspectral images by targeted learning based on the characteristics of hyperspectral
images. Firstly, we optimized the SeKG module and termed the optimized module the FsSE
module. The FsSE module is designed to enhance the spectral information of hyperspectral
images and to maximally preserve the spectral continuity. Secondly, in order to enable the
classification model to learn the maximum amount of effective spectral information, we
designed the SSFCNN model with fusion by different strides. This model was designed to
be able to filter out redundant features by different step sizes and to fuse these feature maps
with different levels of learning so that the results of different strides can complement each
other. In addition, because there are not many 3D CNN networks with complex structures,
we hope that our proposed SSFCNN network can provide ideas for the development of
more complex 3D CNN networks to be designed in the future. In the experiments of this
paper, we use two hyperspectral public datasets. Through a series of experiments, we
proved that our proposed ESFNet can lead to a significant improvement in the classification
effect by enhancing the model’s learning ability regarding the spectrum. In future work, we
will explore a better feature fusion method and further improve the classification accuracy
for hyperspectral images.
Remote Sens. 2022, 14, 5334 23 of 24
Author Contributions: All authors have made great contributions to the work. Conceptualization,
J.Z. (Junbo Zhou) and S.Z.; software, J.Z. (Junbo Zhou); validation, J.Z. (Junbo Zhou), S.Z. and Z.X.;
formal analysis, J.Z. (Junbo Zhou), J.Z. (Jinbo Zhou) and H.L.; investigation, Z.K.; writing—original
draft preparation, J.Z. (Junbo Zhou) and S.Z.; writing—review and editing, J.Z. (Jinbo Zhou), Z.K.
and Z.X. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Hubei Province Natural Science Foundation for Distin-
guished Young Scholars, grant No. 2020CFA063, and funded by excellent young and middle-aged
scientific and technological innovation teams in colleges and universities of Hubei Province, grant
No. T2021009.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Goetz, A.F.H. Three decades of hyperspectral remote sensing of the Earth: A personal view. Remote Sens. Environ. 2009, 113,
S5–S16. [CrossRef]
2. Nalepa, J. Recent Advances in Multi- and Hyperspectral Image Analysis. Sensors 2021, 21, 6002. [CrossRef] [PubMed]
3. Kemker, R.; Kanan, C. Self-Taught Feature Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017,
55, 2693–2705. [CrossRef]
4. Lu, B.; Dao, P.D.; Liu, J.G.; He, Y.H.; Shang, J.L. Recent Advances of Hyperspectral Imaging Technology and Applications in
Agriculture. Remote Sens. 2020, 12, 2659. [CrossRef]
5. Kruse, F.A. Identification and mapping of minerals in drill core using hyperspectral image analysis of infrared reflectance spectra.
Int. J. Remote Sens. 1996, 17, 1623–1632. [CrossRef]
6. Wang, Z.M.; Du, B.; Zhang, L.F.; Zhang, L.P.; Jia, X.P. A Novel Semisupervised Active-Learning Algorithm for Hyperspectral
Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3071–3083. [CrossRef]
7. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September
2014; pp. 740–755.
8. Zeng, S.; Wang, Z.Y.; Gao, C.J.; Kang, Z.; Feng, D.G. Hyperspectral Image Classification With Global-Local Discriminant Analysis
and Spatial-Spectral Context. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 5005–5018. [CrossRef]
9. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
10. Blanzieri, E.; Melgani, F. Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans.
Geosci. Remote Sens. 2008, 46, 1804–1811. [CrossRef]
11. Yager, R.R. An extension of the naive Bayesian classifier. Inf. Sci. 2006, 176, 577–588. [CrossRef]
12. Zhang, Y.X.; Liu, K.; Dong, Y.N.; Wu, K.; Hu, X.Y. Semisupervised Classification Based on SLIC Segmentation for Hyperspectral
Image. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1440–1444. [CrossRef]
13. Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth international
conference on computing communication control and automation (ICCUBEA) 2018, Pune, India, 16–18 August 2018; pp. 1–6.
[CrossRef]
14. Zhu, X.X.; Tuia, D.; Mou, L.C.; Xia, G.S.; Zhang, L.P.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive
Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [CrossRef]
15. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM
2017, 60, 84–90. [CrossRef]
16. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef] [PubMed]
17. Ma, W.P.; Zhang, J.; Wu, Y.; Jiao, L.C.; Zhu, H.; Zhao, W. A Novel Two-Step Registration Method for Remote Sensing Images
Based on Deep and Local Features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4834–4843. [CrossRef]
18. Ma, J.Y.; Tang, L.F.; Fan, F.; Huang, J.; Mei, X.G.; Ma, Y. SwinFusion: Cross-domain Long-range Learning for General Image
Fusion via Swin Transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [CrossRef]
19. Zeng, N.Y.; Wang, Z.D.; Zhang, H.; Kim, K.E.; Li, Y.R.; Liu, X.H. An Improved Particle Filter With a Novel Hybrid Proposal
Distribution for Quantitative Analysis of Gold Immunochromatographic Strips. IEEE Trans. Nanotechnol. 2019, 18, 819–829.
[CrossRef]
20. Rawat, W.; Wang, Z.H. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Comput.
2017, 29, 2352–2449. [CrossRef]
21. Xu, H.; Ma, J.Y.; Jiang, J.J.; Guo, X.J.; Ling, H.B. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern
Anal. Mach. Intell. 2022, 44, 502–518. [CrossRef]
22. Chen, Y.S.; Lin, Z.H.; Zhao, X.; Wang, G.; Gu, Y.F. Deep Learning-Based Classification of Hyperspectral Data. IEEE J. Sel. Top.
Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [CrossRef]
23. Lv, W.J.; Wang, X.F. Overview of Hyperspectral Image Classification. J. Sens. 2020, 2020, 4817234. [CrossRef]
Remote Sens. 2022, 14, 5334 24 of 24
24. Hu, W.; Huang, Y.Y.; Wei, L.; Zhang, F.; Li, H.C. Deep Convolutional Neural Networks for Hyperspectral Image Classification.
J. Sens. 2015, 2015, 258619. [CrossRef]
25. Imani, M.; Ghassemian, H. An overview on spectral and spatial information fusion for hyperspectral image classification: Current
trends and challenges. Inf. Fusion 2020, 59, 59–83. [CrossRef]
26. Luo, F.L.; Zou, Z.H.; Liu, J.M.; Lin, Z.P. Dimensionality Reduction and Classification of Hyperspectral Image via Multistructure
Unified Discriminative Embedding. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [CrossRef]
27. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E.H. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42,
2011–2023. [CrossRef] [PubMed]
28. Zhao, Q.; Cai, X.; Chen, C.; Lv, L.; Chen, M. Commented content classification with deep neural network based on attention
mechanism. In Proceedings of the 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control
Conference (IAEAC), Chongqing, China, 25–26 March 2017; pp. 2016–2019.
29. Ma, W.P.; Ma, H.X.; Zhu, H.; Li, Y.T.; Li, L.W.; Jiao, L.C.; Hou, B. Hyperspectral image classification based on spatial and spectral
kernels generation network. Inf. Sci. 2021, 578, 435–456. [CrossRef]
30. Chen, Y.S.; Zhao, X.; Jia, X.P. Spectral-Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top.
Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [CrossRef]
31. Mou, L.C.; Ghamisi, P.; Zhu, X.X. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci.
Remote Sens. 2017, 55, 3639–3655. [CrossRef]
32. Zhao, W.Z.; Du, S.H. Spectral-Spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and
Deep Learning Approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [CrossRef]
33. Zhang, M.M.; Li, W.; Du, Q. Diverse Region-Based CNN for Hyperspectral Image Classification. IEEE Trans. Image Process. 2018,
27, 2623–2634. [CrossRef] [PubMed]
34. Guo, A.J.X.; Zhu, F. A CNN-Based Spatial Feature Fusion Algorithm for Hyperspectral Imagery Classification. IEEE Trans. Geosci.
Remote Sens. 2019, 57, 7170–7181. [CrossRef]
35. Yang, L.M.; Yang, Y.H.; Yang, J.H.; Zhao, N.Y.; Wu, L.; Wang, L.G.; Wang, T.R. FusionNet: A Convolution-Transformer Fusion
Network for Hyperspectral Image Classification. Remote Sens. 2022, 14, 4066. [CrossRef]
36. He, J.; Zhao, L.N.; Yang, H.W.; Zhang, M.M.; Li, W. HSI-BERT: Hyperspectral Image Classification Using the Bidirectional Encoder
Representation From Transformers. IEEE Trans. Geosci. Remote Sens. 2020, 58, 165–178. [CrossRef]
37. He, X.; Chen, Y.S.; Lin, Z.H. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498.
[CrossRef]
38. Khotimah, W.N.; Bennamoun, M.; Boussaid, F.; Sohel, F.; Edwards, D. A High-Performance Spectral-Spatial Residual Network for
Hyperspectral Image Classification with Small Training Data. Remote Sens. 2020, 12, 3137. [CrossRef]
39. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In
Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 4489–4497.
40. Chen, Y.S.; Jiang, H.L.; Li, C.Y.; Jia, X.P.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on
Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [CrossRef]
41. Ahmad, M.; Khan, A.M.; Mazzara, M.; Distefano, S.; Ali, M.; Sarfraz, M.S. A Fast and Compact 3-D CNN for Hyperspectral Image
Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [CrossRef]
42. Zhong, Z.L.; Li, J.; Luo, Z.M.; Chapman, M. Spectral-Spatial Residual Network for Hyperspectral Image Classification: A 3-D
Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [CrossRef]
43. Laban, N.; Abdellatif, B.; Ebeid, H.M.; Shedeed, H.A.; Tolba, M.F. Reduced 3-D Deep Learning Framework for Hyperspectral
Image Classification. In International Conference on Advanced Machine Learning Technologies and Applications; Springer: Cham,
Switzerland, 2020; pp. 13–22.
44. Shi, C.P.; Sun, J.W.; Wang, L.G. Hyperspectral Image Classification Based on Spectral Multiscale Convolutional Neural Network.
Remote Sens. 2022, 14, 1951. [CrossRef]
45. Diakite, A.; Jiangsheng, G.; Xiaping, F. Hyperspectral image classification using 3D 2D CNN. IET Image Process. 2021, 15,
1083–1092. [CrossRef]
46. Firat, H.; Hanbay, D. Classification of Hyperspectral Images Using 3D CNN Based ResNet50. In Proceedings of the 2021 29th
Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkey, 9–11 June 2021; pp. 1–4.
47. Ben Hamida, A.; Benoit, A.; Lambert, P.; Ben Amar, C. 3-D Deep Learning Approach for Remote Sensing Image Classification.
IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [CrossRef]
48. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D-2-D CNN Feature Hierarchy for Hyperspectral
Image Classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [CrossRef]
49. Hong, D.F.; Han, Z.; Yao, J.; Gao, L.R.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image
Classification With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [CrossRef]
50. Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures; CRC Press: Boca Raton, FL, USA, 2003. [CrossRef]