0% found this document useful (0 votes)

47 views17 pages

A Lightweight Transformer Network For Hyperspectral Image Classification

Uploaded by

bhavesh agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views17 pages

A Lightweight Transformer Network For Hyperspectral Image Classification

Uploaded by

bhavesh agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

61, 2023 5517617

A Lightweight Transformer Network for

Hyperspectral Image Classification
Xuming Zhang , Yuanchao Su , Senior Member, IEEE, Lianru Gao , Senior Member, IEEE,
Lorenzo Bruzzone , Fellow, IEEE, Xingfa Gu, and Qingjiu Tian

Abstract— Transformer is a powerful tool for capturing at fine-grained spectral scales, making HSIs useful in a wide
long-range dependencies and has shown impressive performance range of applications, such as forest monitoring [2], medical
in hyperspectral image (HSI) classification. However, such power imaging [3], and urban development observation [4]. One of
comes with a heavy memory footprint and huge computation
burden. In this article, we propose two types of lightweight the fundamental tasks in these applications is classification,
self-attention modules (a channel lightweight multihead self- which involves assigning a specific category label to each
attention (CLMSA) module and a position lightweight multihead pixel.
self-attention (PLMSA) module) to reduce both memory and Traditional classifiers focus only on spectral information,
computation while associating each pixel or channel with global including support vector machine (SVM) [5], random for-
information. Moreover, we discover that transformers are inef-
fective in explicitly extracting local and multiscale features due est [6], and logistic regression [7]. Complex scenarios and
to the fixed input size and tend to overfit when dealing with spectral heterogeneity of objects require to model spatial
a small number of training samples. Therefore, a lightweight correlations. This can be achieved by different approaches,
transformer (LiT) network, built with the proposed lightweight such as Gabor filtering [8], morphological profiles [9], gray-
self-attention modules, is presented. LiT adopts convolutional level co-occurrence matrix [10], and 3-D wavelets [11]. As an
blocks to explicitly extract local information in early layers and
employs transformers to capture long-range dependencies in deep
alternative, kernel-based methods [12], [13] have also been
layers. Furthermore, we design a controlled multiclass stratified introduced to capture spatial information of HSIs. However,
(CMS) sampling strategy to generate appropriately sized input these traditional approaches rely heavily on prior knowledge
data, ensure balanced sampling, and reduce the overlap of feature and are limited to shallow features, resulting in poor general-
extraction regions between training and test samples. With ization and robustness [14].
appropriate training data, convolutional tokenization, and LiTs, Deep learning (DL), an automatic feature learning tech-
LiT mitigates overfitting and enjoys both high computational
efficiency and good performance. Experimental results on several nique, can automatically learn richer representations than
HSI datasets verify the effectiveness of our design. traditional approaches. Several DL techniques have been
applied to HSI classification, such as deep belief networks
Index Terms— Deep learning (DL), hyperspectral image (HSI)
classification, transformer. [15], convolutional neural networks (CNNs) [16], graph con-
volutional networks (GCNs) [17], [18], and transformers [19].
I. I NTRODUCTION Among these DL techniques, CNNs are the most widely
used due to their inherent properties of local connectivity
H YPERSPECTRAL images (HSIs) contain tens or even
hundreds of narrow and continuous spectral bands
ranging from visible to infrared [1]. The abundant spectral
and weight sharing. These properties impose strong con-
straints on convolution weights that hardcode inductive biases
into networks, thus leading to more sample-efficient and
information in HSIs enables object identification and detection
parameter-efficient [20].
Manuscript received 15 June 2023; accepted 14 July 2023. Date of pub- Many works [14], [21], [22], [23] have attempted to
lication 21 July 2023; date of current version 7 August 2023. This work bring the power of CNNs to HSI classification. The existing
was supported in part by the National Natural Science Foundation of China methods mainly fall into two broad categories: patch-based
under Grant 42101321 and Grant 42001319, in part by the Scientific Research
Program of the Education Department of Shaanxi Province under Grant classification and fully convolutional network (FCN)-based
21JK0762, and in part by the University–Industry Collaborative Education segmentation frameworks. Spectral–spatial DL networks,
Program of Ministry of Education of China under Grant 220802313200859. such as spectral–spatial residual network (SSRN) [21], dual-
(Corresponding author: Qingjiu Tian.)
Xuming Zhang and Qingjiu Tian are with the International Institute for Earth level deep spatial manifold representation network (SMRN)
System Science and the Jiangsu Provincial Key Laboratory of Geographic [24], and spectral–spatial self-attention network (SSSAN)
Information Science and Technology, Nanjing University, Nanjing 210023, [14], follow the design rule of patch-based classification to
China (e-mail: [email protected]; [email protected]).
Yuanchao Su is with the Department of Remote Sensing, College of facilitate feature learning and classifier training. Features are
Geomatics, Xi’an University of Science and Technology, Xi’an 710054, China extracted from the spatial patch centered on the sample pixel
(e-mail: [email protected]). and further processed to assign a specific category to the
Lianru Gao and Xingfa Gu are with the Key Laboratory of Com-
putational Optical Imaging Technology, Aerospace Information Research
center pixel. However, redundant computation on overlapping
Institute, Chinese Academy of Sciences, Beijing 100094, China (e-mail: regions between adjacent patches is inevitable, which severely
[email protected]). hampers large-scale applications. In contrast, FCNs exhibit
Lorenzo Bruzzone is with the Department of Information Engineering better accuracy–speed trade-offs. FCNs feed data into the
and Computer Science, University of Trento, 38050 Trento, Italy (e-mail:
[email protected]). network and perform feature extraction and pixel-to-pixel
Digital Object Identifier 10.1109/TGRS.2023.3297858 classification. Examples of these models are the
1558-0644 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

spectral–spatial FCN (SSFCN) [22] and encoder–decoder spatial–spectral representations along height, width, and spec-
architectures [23]. Although these CNN-based architectures tral dimensions, respectively.
can learn information from a larger receptive field by Although these transformer-based networks have shown
stacking multiple layers, they still lack global connectivity. impressive performance in HSI classification, the high dimen-
Moreover, convolutional filter weights are usually fixed after sionality, together with limited labeled samples, still makes the
training and cannot be dynamically adapted to different classification task very challenging. There are still three main
inputs [25]. aspects that need to be addressed.
Recently, transformers have shown promising results in vari- 1) The existing transformer-based models mainly follow
ous visual tasks, such as image classification [26] and semantic the patch-based classification, which not only results in
segmentation [27]. As a core component of transformers, redundant computation but also hinders long-range spa-
self-attention has the properties of long-range modeling and tial dependency modeling. Thus, an excellent approach
adaptive spatial aggregation. Benefiting from the flexible self- to build effective architectures is to follow the FCN-
attention, transformers can learn more robust and accurate based framework.
representations than CNNs from extensive data [28]. Vision 2) Self-attention in transformers incurs a heavy memory
transformer (ViT) [29] is the first pure transformer backbone footprint and huge computation burden, which limits
for vision tasks. It replaces the inductive biases inherent in their real-time applications.
convolution with global processing driven by self-attention, 3) Transformers are ineffective in explicitly extracting local
demonstrating that large-scale training can trump inductive and multiscale features and are prone to overfitting with
biases. Several studies have investigated the use of trans- small numbers of training samples.
formers for HSI classification. SpectralFormer [19] generates To overcome the above three drawbacks, we propose a
groupwise spectral embeddings by learning local sequence lightweight transformer (LiT) network and a controlled mul-
information from adjacent bands of HSIs. A spectral–spatial ticlass stratified (CMS) sampling strategy. Specifically, LiT is
transformer network [30] employs consecutive spatial and proposed to merge and exploit the advantages of transformers
spectral transformer blocks to learn spectral–spatial infor- (i.e., long-range dependence and input adaptive weighting)
mation from input patches. To fully exploit the abundant and CNNs (i.e., local spatial modeling and inductive bias) for
spectral information in HSIs, a two-branch pure transformer end-to-end pixel-to-pixel classification by following the FCN-
[31] uses a spectral transformer and a spatial transformer based framework. LiT has four stages to generate feature maps
to learn spectral sequences and spatial features, respectively. with different scales. The first two stages deploy convolutional
A multilevel spectral–spatial transformer network [32] pro- blocks to explicitly extract local information, and the last
cesses HSIs into sequences, learns feature representations with two stages employ transformer blocks to capture long-range
a pure transformer encoder, and then processes multilevel dependencies. In LiT, two lightweight self-attention modules
features with decoders to produce classification maps. Huang are designed to reduce both computation and memory while
et al. [33] improved the original Swin transformer [34] to the associating each pixel/channel with global information. The
3-D structure and proposed a 3-D Swin transformer to explore CMS sampling strategy is designed to provide appropriately
the rich spatial–spectral information of HSIs. However, these sized inputs and ensure balanced sampling while reducing
transformer architectures are still inferior to their CNN coun- the overlap of feature extraction regions between training
terparts for training on small-scale datasets, as they lack and test samples. The proposed LiT combined with the
proper inductive biases and, thus, require substantial data and CMS sampler can mitigate overfitting. Superior performance
computational resources to compensate [25]. Consequently, on three HSI datasets demonstrates the effectiveness of our
CNNs are still the preferred models for HSI classification, approaches.
since they require less time, data, and memory for training, The main contributions of this study are summarized as
whereas they do not enjoy long-range modeling [35]. follows.
Hybrid architectures combining transformers and convolu- 1) Extending transformer-based research to FCN-based
tions have received much attention in constructing lightweight, HSI classification by introducing LiT, which can incor-
high-performance models [36]. Hyper-ES2 T [37] embeds porate the merits of transformers and CNNs to efficiently
convolution layers before each spatial–spectral transformer model image hierarchies in local and long range, achiev-
block. In [38], 3-D convolution is embedded in a two-branch ing high performance and fast inference speed.
transformer to capture global–local dependencies in both spec- 2) Developing two simple yet effective self-attention
tral and spatial domains. Grouped multi-attention network modules—the position lightweight multihead self-
(GMA-Net) [39] is also a two-branch architecture, where attention (PLMSA) module and the channel lightweight
one branch is responsible for spectral–spatial feature learning multihead self-attention (CLMSA) module—to construct
using CNN and multiattention modules, and the other for compact long-range dependencies in spatial and spec-
pixelwise spectral feature learning using convolutions. Some tral dimensions, respectively, resulting in performance
works [35], [40], [41] first employ convolution to perform boosts with less computational and memory cost.
shallow feature extraction and use transformers to capture 3) Designing a CMS sampling strategy to provide appro-
the global relationship between different tokens. Hyperspectral priately sized input data and ensure balanced sampling.
image transformer (HiT) [42] embeds convolution operations It can also significantly reduce the overlap of feature
into the transformer structure. It uses 3-D convolution layers extraction regions between training and test samples,
to produce local spatial–spectral information and then uses providing a more objective benchmark for evaluating
depthwise and pointwise convolution operations to encode method performance.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617

The remainder of this study is organized as follows. and self-attention through simple relative attention. Some
Section II reviews transformer-related work and classical works [25], [44] apply convolution before transformer layers
CNN-based networks for HSI classification. Section III to generate richer tokens and preserve local information.
describes the proposed LiT network and CMS sampling strat- Mobile-former [45] is a bridge-connected parallel architecture
egy in detail. Extensive experiments with ablation studies of MobileNet and transformer, combining the advantages of
are performed and discussed in Section IV. Finally, some MobileNet for local processing and transformer for global
concluding remarks and a brief outlook on future research are interaction. CNNs meet transformers (CMT) [26] embeds the
presented in Section V. depthwise convolution into transformer blocks to enhance
local information. Li et al. [46] introduce locality guidance
II. R ELATED W ORKS provided by a trained CNN to accelerate convergence and
A. Vision Transformer improve the performance of ViTs on tiny datasets. MobileViT
[36] is a general and lightweight ViT for mobile devices, where
ViT [29] provides an alternative design paradigm to CNNs. the MobileViT block replaces local convolutional processing
It reshapes the input image into a sequence of flattened with global processing using transformers, allowing it to have
patches, which are then projected into a sequence of patch ViT- and CNN-like properties.
embeddings using a trainable linear projection. These patch Furthermore, many efficient attention mechanisms [47],
embeddings append a class token and then incorporate position [48], [49] have been developed to boost performance and
embeddings to obtain a sequence of input tokens z0 . save computational cost. One way is to restrict self-attention
The transformer encoder, consisting of L transformer layers, to small windows, such as Swin transformer [34]. Besides,
is applied to z0 to learn interpatch representations. A trans- BiFormer [47] designs a novel bi-level routing attention to
former layer consists of a multihead self-attention (MSA) save computation and memory, where each query focuses on a
block (1) followed by a feed-forward network (FFN) (2). small subset of the most semantically relevant key–value pairs.
Layer normalization (LN) is applied before each block, and EfficientViT [48] presents a cascaded group attention module
a shortcut connection is added after each block. Given input that feeds attention heads with different splits of the full
zl−1 , a transformer layer can be written as follows: features to reduce computational redundancy. A super token
zl′ = MSA(LN(zl−1 )) + zl−1 (1) attention mechanism [49] is designed to promote effective
and efficient global context modeling at the early stages of
zl = FFN LN zl′ + zl′

(2)
a network.
where l = 1, . . . , L. The MSA module is defined by consid- However, these transformer-based networks are still heavy
ering h “heads.” Specifically, the input xin is divided equally weight for the HSI classification task. Combining the strengths
into h “heads” by channel (3), and the self-attention function of CNNs and transformers to build ViT models for HSI
is applied to each “head” (4). These h sequences are then classification tasks remains an open question.
concatenated (5)
xin → xin1 , . . . , xinh

(3) B. CNN for HSI Classification
.q
h(xi ) = Softmax Qi KiT dik Vi (4) CNNs have become the dominant technique in HSI classi-
fication in recent years because of their strong representation,
MSA(xin ) = Concat h xin1 1 , . . . , h xinh h lightweight, and easy optimization. In [16], patches centered

(5)
on sample pixels are generated and fed into a sequence of
where Qi = xiin W Qi , Ki = xiin W Ki , and Vi = xiin WVi ; W Qi , W Ki , convolution and pooling layers, followed by linear layers for
and WVi are the learnable projection matrices to project the xiin feature extraction and classification. A CNN with pixel-pair
into different feature spaces, i = 1, . . . , h. The FFN consists features was proposed in [50] to improve feature learning.
of two linear layers separated by an activation function, i.e., Some networks use parallel filters with different kernel sizes
[51] or short connections [52] to promote multiscale informa-
Y = f (ZW1 )W2 (6)
tion learning. Two-branch CNN-based architectures [14], [53]
where W1 ∈ Rc×γ c , W2 ∈ Rγ c×c , f (·) denotes an activation employ a 2-D CNN and other algorithms (e.g., 1-D CNN and
function, and Y is the output of FFN. c is the channel stacked autoencoders) to learn spatial and spectral information,
dimension of Z, and γ is the dimension expansion ratio, respectively. The 3-D CNNs have been introduced to extract
usually set to 4. The bias term is omitted for simplicity. spectral–spatial features for HSI classification considering
ViT is computationally intensive, challenging to optimize, the 3-D structural characteristics of HSIs [21]. These patch-
and requires extensive data to avoid overfitting due to the based classifications have difficulties to achieve fast inference
absence of inductive bias [25]. Various design strategies have speed due to redundant computation of overlapping regions
been explored to incorporate the advantages of CNNs into between adjacent patches, which limits practical applications.
transformer models to improve performance. To generate FCN-based segmentation networks [22], [23], [54], [55] mit-
hierarchical representation and reduce computational cost, igate the redundant computation by performing pixel-to-pixel
hierarchical visual transformer (HVT) [43] gradually pools classification. For example, SSFCN [22] and a deep FCN
visual tokens to reduce the sequence length as the layer goes with an efficient nonlocal module (ENL-FCN) [54] take an
deeper, which is similar to downsampling in CNNs. CoAtNet entire HSI as input and perform feature extraction without
[25] improves the generalization and capacity of the model reducing spatial dimensions. FreeNet [23] and spectral–spatial-
by naturally unifying and summing depthwise convolution dependent global learning (SSDGL) [55] networks follow
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

Fig. 1. Overview illustration of the proposed LiT network for HSI classification.

the encoder–decoder architecture to perform pixel-to-pixel than regular convolutions, they are slow in the early stages,
classification. because they cannot fully utilize modern accelerators [57].
Fused-MBConv replaces the 1 × 1 expansion convolution and
III. M ETHODOLOGY 3 × 3 depthwise convolutions with a single 3 × 3 expansion
convolution. As illustrated in Fig. 1, Fused-MBConv consists
This section introduces the proposed approaches: the LiT
of a 3 × 3 expansion layer, a squeeze-and-excitation (SE)
network and the CMS sampling strategy.
module [58], and a 1 × 1 projection layer. The expansion
ratio of the 3 × 3 expansion convolution is set to 2, and
A. LiT Network the 1 × 1 convolution reduces the channel dimension by the
This study develops a simple yet effective network, LiT, same ratio. The input and output of Fused-MBConv block
which aims to improve the feature representation for HSI are connected when they have the same number of channels.
classification. In ViT, linear projections poorly model the Compared with the original ViT, the convolutional tokenizer
structural information present in patches. To alleviate this that adopts Fused-MBConv in the early stages is more effective
problem, we use convolutional tokenization to generate richer at encoding spatial information.
tokens and preserve more local information. Our LiT has 2) Position Lightweight Multiheaded Self-Attention: The
four stages to produce the hierarchical representation of data, computational complexity of self-attention in transformers is
as illustrated in Fig. 1. The first two stages deploy fused- quadratic to the spatial size of inputs. Using self-attention
MBConv blocks to introduce inductive bias and extract local modules to process high-resolution images would inevitably
features. The last two stages employ a sequence of transformer cause the problem of low computational efficiency and insuf-
blocks to model long-range dependencies. The CNN stages ficient memory. To alleviate this problem, we propose the
are implemented before the transformer stages based on the PLMSA module that constructs the relationships between
prior knowledge that convolution is good at encoding local each feature vector and eigenvector clusters. The eigenvector
information, which is essential for processing low-level fea- cluster is a compact feature vector collected from a subset
tures [56]. Considering the redundancy of spectral information of feature vectors in input tensors. They are implemented
in HSIs, a dimensionality reduction convolution is applied at using the spatial pyramid pooling-fast technique [59], which
the end of each stage to aggregate local spectral features. reduces computational burden and provides multiscale contex-
This can preserve more details without defining additional tual information.
parameters. In this way, LiT improves context aggregation for As illustrated in Fig. 2, the input token X ∈ Rw×h×d is
better identification and refines coarse results. The output of reshaped into X̂ ∈ Rn×d , where w, h, and d represent the
stage 4 is fed into a 1 × 1 convolutional layer for classification. height, width, and channels of X, respectively, n = wh.
In the transformer block, a twin MSA (TwinMSA) module is Meanwhile, we use 2 × 2 max pooling with stride 2 to
designed to promote spatial–spectral information learning and reduce the spatial size of K p and V p before the self-attention
reduce the high computational and memory cost of traditional operation. Specifically, we feed X ∈ Rw×h×d into two succes-
sive max-pooling layers to generate X1 ∈ R(w/2)×(h/2)×d and
p
self-attention. As shown in Fig. 2, the TwinMSA consists of p (w/4)×(h/4)×d p p
PLMSA and CLMSA modules. Details of the Fused-MBConv X2 ∈ R , respectively. X1 and X2 are separately
(n/4)×d (n/16)×d
block, PLMSA, and CLMSA modules are presented below. reshaped into R and R , which are concatenated
1) Fused-MBConv Block: Many state-of-the-art (SOTA) to form X p ∈ R(5n)/16×d . In addition, we inject positional
LiT models, such as CoAtNet [25] and MobileViT [36], use information into each attention block by introducing a relative
the inverted residual block for efficiency. The inverted residual bias B p to the attention maps, and the corresponding position
block first widens channels with a 1 × 1 expansion convolu- lightweight self-attention (PLSA) is defined as follows:
tion, and then uses a 3 × 3 depthwise convolution to capture T.p
local information. Furthermore, it uses a 1 × 1 convolution to PLSA = Softmax Q p K P dk + B p V p (7)
project the channel dimension to the original size, so that input
where Q p = X̂W Q ∈ Rn×d , K p = X p W K ∈ R(5n)/16×d , V p =
p
and output can be added. Although depthwise convolutions
X p WV ∈ R(5n)/16×d , and B p ∈ Rn×(5n)/16 . Finally, the PLMSA
p
have fewer parameters and floating-point operations (FLOPs)

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617

The channel lightweight self-attention (CLSA) is calculated as

follows:
T.p
CLSA = Softmax Qc Kc dk + Bc Vc (8)

where Qc = X̂, Kc = Xc , and Vc = Xc . The input features are

not transformed with learnable projection matrices to maintain
the relationships between different channel maps before the
self-attention map is computed. Similarly, the CLMSA module
is also defined by considering h “heads.” Specifically, the input
is divided equally into h “heads,” and the CLSA function is
applied to each “head.” The output of each “head” is then
concatenated.
To take full advantage of long-range contextual information
in spectral and spatial dimensions, we concatenate the outputs
of PLMSA and CLMSA modules and apply a 1 × 1 convo-
lution layer to project the channel dimension to the input size
w × h × d, as indicated in Fig. 2. Before the concatenation
operation, the outputs of PLMSA and CLMSA are reshaped
into w × h × d
TwinMSA = f ([PLMSA, CLMSA]) (9)
where [·] refers to the concatenation operation and f (·)
denotes the 1 × 1 convolution. PLMSA and CLMSA modules
can effectively improve feature representations with minimal
parameters and can be directly inserted into the existing
Fig. 2. Illustration of the TwinMSA module, which consists of PLMSA and backbones. Note that the pooling stride and the number of
CLMSA. pooling layers in both modules can be adjusted according to
the size of input data. The transformer block, as displayed in
Fig. 1, can be formulated as follows:
module is defined by considering h “heads,” i.e., the inputs X̂
and X p are divided equally into h “heads” by channel, and z̃ = TwinMSA(LN(z)) + z (10)

the PLSA function is applied to each “head.” Each “head” y = FFN LN z̃ + z̃ (11)
outputs a sequence of size n × (d/ h). These h sequences are
then concatenated into a sequence of n × d. Details of the where z and y are the input and output of the transformer
operation of MSA can be found in (3)–(5). block, respectively. In the FFN, the expansion ratio is reduced
from 4 to 2 for better efficiency, and the linear layers are
3) Channel Lightweight Multiheaded Self-Attention: Each
replaced by 1 × 1 convolution layers.
deep-level feature channel can be viewed as a class-specific
response, and these semantic responses are correlated [60]. It is
possible to enhance feature representation of related semantics B. Controlled Multiclass Stratified Sampling Strategy
by exploiting dependencies between different channel maps. HSI datasets typically contain only one partially labeled
In addition, given the abundant spectral information in HSI, image. In spatial-based classification methods, the commonly
a high degree of redundancy between adjacent channels is used pixel-based random sampling strategy results in over-
inevitable. To better exploit long-range spectral information lapping feature extraction spaces between training and test
with less computation, we propose the CLMSA module. samples. Consequently, information from training data is used
CLMSA builds the relationship between each channel and to evaluate methods during the testing phase. The improved
each channel cluster, which is a compact channel map. Chan- classification accuracy of spatial-based methods is partly due
nel clusters are generated using the pyramid pooling-fast to the increased dependency between training and test data,
technique along the channel dimension to aggregate channel compared with pure spectral-based methods [61]. Therefore,
maps from input tensors. this section develops the CMS sampling strategy to alleviate
As shown in Fig. 2, we first reshape X ∈ Rw×h×d to X̂ ∈ this drawback.
n×d
R while using max pooling with stride 2 along the channel As detailed in Fig. 3, the CMS sampler first partitions
axis to reduce the spectral dimension of Kc and Vc before the the entire HSI into nonoverlapping windows (Step 1). The
self-attention operation. Specifically, we feed X ∈ Rw×h×d window size should ensure that each class exists in at least
into two successive max-pooling operations to generate Xc1 ∈ two windows to guarantee that all classes are present in both
Rw×h×(d/2) and Xc2 ∈ Rw×h×(d/4) , which are then reshaped into training and test sets. The CMS sampler will pad the HSI
Xc1 ∈ Rn×(d/2) and Xc2 ∈ Rn×(d/4) , respectively. Xc1 and Xc2 are when the height and width cannot be divided by window size.
then concatenated to form Xc ∈ Rn×(3/4)d . Each channel map Windows where all pixels have the same labeled class are
of Xc1 and Xc2 is viewed as a channel cluster. A relative bias Bc used as test data, and windows where all pixels are unla-
is added to the attention map to provide positional information. beled are waiting for prediction (Step 2). The next step will

Algorithm 1 CMS Sampling Strategy

Input: Hyperspectral image (HSI),
HSI size: height (H ), width (W ),
window size w, sampling rate per class λ
Output: training set and test set
1 if H & W cannot be divided by w then
2 Perform the padding operation on the HSI.
3 end
4 Split the HSI into non-overlapping w × w windows.
5 for each window in HSI do
6 if all pixels have the same labeled class then
7 Classified as test samples.
8 Set the labels of all pixels in this window to zero.
9 end
10 if windows with more than one category then
11 Classified as multi-calss windows.
12 end
13 end
14 for each class c in multi-calss windows do
15 Collect windows containing c from multi-calss windows.
16 Randomly select λ windows as training samples and the
rest as test samples.
17 Set the labels of all pixels in these window to zero.
18 end
19 Training samples and test samples are combined to form the
training set and test set, respectively.

sampler can help us more objectively evaluate the performance

of spatial-based methods.
In the CMS sampler, windows where all pixels have the
Fig. 3. Illustration of the proposed CMS sampling strategy. same labeled class can also be used as training samples, called
the controlled random stratified (CRS) sampling strategy.
However, the experimental results in Section IV-D show that
the CMS sampler works better.
divide the remaining windows with more than one category
(including the unclassified ones) into training and test data
IV. E XPERIMENTS AND R ESULTS
according to a predefined order. The predefined order can be
either by category or by the number of samples within each A. Description of Datasets
category. We conducted experiments on three different-sized bench-
Take the predefined order by category as an example. As mark datasets to evaluate the performance and generalizability
shown in Step 3 of Fig. 3, multiclass windows containing of our approaches: Indian Pines, Salinas, and data fusion
the first class are collected; then, a predefined proportion of contest 2018 (DFC 2018).
windows is randomly selected for training, while the remaining 1) Small Dataset: The Indian Pines dataset, covering
windows are used for testing. Afterward, the labels of all northwestern Indiana, USA, was collected by the Air-
pixels in windows containing the first class are set to zero borne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor
to avoid repeated sampling. The remaining windows are then in 1992. This dataset consists of a 145 × 145 × 224 data cube
used to collect the second category. This process is repeated with a wavelength range of 0.4–2.5 µm and a spatial resolution
until sampling is complete for all classes. Finally, all training of 20 m. After discarding 24 noisy and water-absorbing bands,
and test windows form the training and test sets, respectively. 200 bands were used. There are 16 land-cover classes in the
Algorithm 1 gives a pseudo-code of our CMS sampling ground truth (GT). Fig. 4(a) summarizes the type and number
strategy. of samples, and Fig. 4(b) and (c) shows the spatial distribution
The generated windows can be fed directly into FCN-based of training and test data, respectively.
networks for training or testing. In this case, the feature extrac- 2) Medium Dataset: The Salinas dataset was recorded over
tion regions of the training and test samples are independent several agricultural fields in Salinas Valley, CA, USA, by the
of each other. Moreover, the input data size can be adjusted AVIRIS sensor. The related map contains 512 × 217 pixels
to fit the requirements of the actual situation. Compared with with a spatial resolution of 3.7 m per pixel and 224 bands
random sampling, the sample distribution generated by the covering a spectral range of 0.36–2.5 µm. Before experiments,
CMS sampler is more similar to that of practical applications. 20 noise and water absorption bands were excluded. Sixteen
Meanwhile, training samples are still distributed throughout land-cover classes were defined in the corresponding GT.
the image, covering a wide range of spectral variants. The The type and number of samples are given in Fig. 5(a), and
impact of test data during the training stage can be significantly the spatial distribution of training and test data is shown in
reduced. The experimental setup with the proposed CMS Fig. 5(b) and (c), respectively.

Fig. 4. Indian Pines dataset. (a) Land-cover type and sample settings.
(b) False-color image. (c) Spatial distribution of training samples.

3) Large Dataset: The DFC 2018 dataset was acquired in

2017 by the National Center for Airborne Laser Mapping
around the University of Houston campus. This dataset con-
tains 601 × 2385 pixels with a spatial resolution of 1 m per Fig. 5. Salinas dataset. (a) Land-cover type and sample settings.
pixel and 50 spectral bands covering a wavelength range of (b) False-color image. (c) Spatial distribution of training samples.
380–1050 nm. On the GT, 20 types of urban land cover were
defined. Fig. 6(a) summarizes the type and number of samples,
while the spatial distribution of the training and test data is based SVM classification method, and the rest are DL-based
represented in Fig. 6(b) and (c), respectively. spectral–spatial feature learning algorithms. SSRN, SSSAN,
The above training and test samples were all generated and SMRN follow the patch-based classification framework.
using the proposed CMS sampling strategy. Data augmentation Specifically, SSRN uses successive spectral and spatial resid-
techniques, such as random rotation, adding noise, and chang- ual blocks to learn spectral and spatial representations. SSSAN
ing brightness, were applied to the training data to prevent employs two branches, the spatial self-attention module-based
overfitting. Before experiments, datasets were normalized to branch and the spectral self-attention module-based branch,
[−1, 1] to harmonize the magnitude of the data. to learn spatial and spectral information in parallel. SMRN
models the latent manifold structure in each patch adaptively
using the GCN. SSFCN and ENL-FCN are FCN-based archi-
B. Experimental Setting tectures designed to take an entire image as input and perform
1) Comparison Methods: We compared the performance of pixel-to-pixel classification. SpectralFormer is a pure trans-
our LiT with that of SVM [5], SVM with edge-preserving former network specifically designed for HSI classification,
filtering (SVM-EPF) [62], SSRN [21], SSSAN [14], SMRN which takes 8 × 8 patches as input and outputs a category
[24], SSFCN [22], ENL-FCN [54], SpectralFormer [19], for the center pixel of each patch. ConvNeXt and MobileViT
ConvNeXt [63], and MobileViT [36]. SVM is a traditional are SOTA backbones for vision tasks. ConvNeXt is a pure
pixelwise classifier, SVM-EPF is an edge-preserving filtering- CNN model developed by ResNet according to ViT, while

TABLE I
D ETAILS OF THE L I T A RCHITECTURE

Fig. 7. Effect of window size on OA. (a) Salinas dataset. (b) DFC 2018
dataset.

a momentum of 0.9, and a learning rate of 5e − 5. There

is a 20-epoch linear warm-up followed by a cosine decay
schedule. We set the input data size to 32 × 32 and
64 × 64 on the Indian Pines and Salinas, respectively,
depending on the dataset size and GPU memory. The detailed
architecture of LiT is summarized in Table I, where Indian
Pines and Salinas datasets use the Channel-1 setting, while
the DFC 2018 dataset adopts the Channel-2 setting.
We also investigated the effect of window size on the
performance of our LiT. To ensure that each class is available
in at least two windows, we set the window size for the Indian
Pines dataset to its maximum value of 4, since there is the
limited number of labeled samples for certain classes (e.g., the
oats class has only 20 labeled pixels). We conducted exper-
iments for the Salinas and DFC 2018 datasets to select the
Fig. 6. DFC 2018 dataset. (a) Land-cover type and sample settings. optimal window size. We investigated the generated windows
(b) False-color image. (c) Spatial distribution of training samples. fed directly into the network for training and testing on the
large DFC 2018 dataset. In this way, the feature extraction
spaces of the training and test samples are independent of each
MobileViT is LiT for mobile devices. The parameters of these other, providing a more objective performance evaluation of
comparison methods were set according to their respective the methods.
literature. Fig. 7(a) shows that the accuracy decreases, as the window
2) Implementation Details: All comparison methods were size decreases in the Salinas dataset. The reason for this is that
run on the PyTorch platform and trained and tested on the same a larger window size reduces the overlap between the feature
sample sets. All experiments were performed on an NVIDIA extraction spaces of training and test samples and reduces
GeForce RTX 3060 GPU. We conducted ten independent the diversity of training samples. We set the window size of
tests for each experiment to avoid biased estimates. Average the Salinas to 12 × 12 to balance accuracy with objective
evaluation metrics are reported to provide a comprehensive benchmark.
and accurate representation of our findings. The effect of window size on accuracy for the DFC 2018
3) Metric: The overall accuracy (OA), average accuracy dataset can be seen in Fig. 7(b). A large window is efficient for
(AA), kappa coefficient (κ), and producer accuracy (PA) of capturing long-range spatial information, as accuracy increases
each category were adopted to evaluate the performance of with window size. However, a too large window size decreases
different methods. accuracy, since large windows reduce the number of windows
4) Parameter Setting: We trained our LiT for 100 epochs and the diversity of training samples. According to Fig. 7(b),
using the AdamW optimizer [64] with a decay rate of 1e − 2, we set the window size to 32 × 32 for the DFC 2018 dataset.

TABLE II TABLE IV
C OMPARING THE P ERFORMANCE OF L I T T RAINED W ITH S AMPLES S ENSITIVITY A NALYSIS OF THE M ODEL L AYOUT IN T ERMS OF OA (%) ON
G ENERATED BY CRS AND CMS S AMPLING S TRATEGIES I NDIAN P INES , S ALINAS , AND DFC 2018 DATASETS

TABLE V
C OMPARISON OF T RAINABLE PARAMETERS (PARAMS ), FLOP S ,
T RAINING (T RAIN ), AND I NFERENCE (I NFER ) T IME ,
TABLE III AND OA B ETWEEN PATCH -BASED AND FCN-BASED
A BLATION A NALYSIS OF THE P ROPOSED L I T ON THE S ALINAS DATASET F RAMEWORKS ON THE S ALINAS DATASET

outperformed CoViT by 1.64%, as illustrated in Table III(d).

Joint exploitation of PLMSA and CLMSA resulted in further
performance improvements, as evident in Table III(e). These
C. Sampling Strategy Analysis results indicate that constructing relationships between indi-
vidual pixels/channels and eigenvectors in the spectral/spatial
To investigate the impact of the sampling strategy on the
dimension is valuable for exploiting the global context over
classification performance, we compared the performance of
local features.
our LiT under the CMS sampler with that under the CRS
2) Model Layout Analysis: The performance of models is
sampler using almost the same number of training samples.
heavily influenced by its layout. Therefore, it is necessary
As reported in Table II, experimental results under the
to investigate the most effective configuration of LiT. Given
CMS sampler outperformed those under the CRS sampler by
that convolution is better suited for processing local features
4.52% on Indian Pines, 4.46% on Salinas, and 2.26% on DFC
in early stages, we imposed the restriction that convolution
2018, even with fewer training samples. This demonstrates
stages must precede transformer stages. With this constraint,
that using multiclass input data can effectively enhance the
we investigated either the Fused-MBConv or transformer block
learning capability of networks when compared with using
from stages 2 to 4. This results in four versions: C–T –T –
single-class input data. This is due to the complementarity
T , C–C–T –T , C–C–C–T , and C–C–C–C, where C and T
and collaboration between multiclass data, which facilitates a
denote convolution and transformer blocks, respectively. We
comprehensive learning process for the network. As a result,
compared the performance of these four layouts.
networks can handle more complex and diverse data.
Table IV shows that C–T –T –T produced the lowest OA
(83.16% on Indian Pines, 86.58% on Salinas, and 82.87% on
D. Model Analysis DFC 2018), followed by C–C–C–C (85.69% on Indian Pines,
1) Ablation Studies: We conducted ablation studies on the 87.96% on Salinas, and 83.89% on DFC 2018), whereas C–C–
Salinas dataset to validate the effectiveness of each component T –T and C–C–C–T produced comparable results. C–C–T –T
of our LiT by following a trajectory from ViT to LiT. The achieved the best results on the DFC 2018 dataset, while
corresponding results are summarized in Table III. C–C–C–T produced the best results on Salinas and Indian
First, we study the importance of the hybrid approach Pines datasets, yet the difference between them is minor.
(CoViT), where the first two stages deploy the Fused-MBConv Considering that the transformer has strong adaptability to
block, and the last two stages use the transformer block in extensive data, we adopt C–C–T –T for LiT to make it
ViT. As shown in Table III(b), CoViT improved OA from more suitable for large-scale datasets. Moreover, the fact that
82.51% to 85.12% compared with ViT. Fused-MBConv blocks C–C–T –T is better than C–T –T –T suggests that convolution
can effectively learn representations of local features in early may be more efficient at processing low-level information
layers due to their strong inductive biases. This suggests that than self-attention mechanisms while using significantly less
deploying convolutions in ViT for early vision processing may computation and memory.
be a critical design decision that makes a trade-off between 3) Analysis of Learning Frameworks: This section
inductive biases and the representational learning capability of investigates the impact of different learning frameworks
transformer blocks. (i.e., patch-based classification and FCN-based segmentation)
Second, to evaluate the proposed PLMSA and CLMSA on performance. To ensure a fair comparison of model
modules, we conducted experiments by replacing the MSA complexity and speed, we directly replaced the last
in CoViT with them individually. Table III(c) shows that the convolutional layer in LiT with a global average pooling
PLMSA improved OA from 85.12% to 87.62%. CLMSA alone followed by a linear layer to represent the patch-based

TABLE VI
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE I NDIAN P INES DATASET

Fig. 8. Classification maps of different methods on the Indian Pines dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.

classification. Both learning frameworks were trained on the patch-based framework, the actual computation of the
the same training set and tested on the same test set. FCN-based framework is faster than that of the patch-based
The number of trainable parameters (Parmas) and FLOPs framework. This is because the time-consuming training
was used to measure model complexity and theoretical process is performed offline, while inference speed is the
computational overhead, respectively. Meanwhile, training main factor determining whether a method is practical.
time and inference time were used to represent actual Therefore, the patch-based framework is a better choice when
efficiency. The corresponding results on the Salinas dataset the amount of data is small, while the FCN-based framework
are summarized in Table V. Table V shows that the number is more suitable for large-scale applications (e.g., classification
of parameters in both frameworks is almost equal. The on satellite images) or real-time applications (e.g., those on
FCN-based framework is less efficient than the patch-based mobile devices).
framework in terms of FLOPs and training time. However,
the FCN-based framework achieved higher accuracy (89.03%
E. Comparison With Other Methods
versus 86.40%) and faster inference speed (1.93 versus
22.84 s) than the patch-based classification. Although the 1) Quantitative Evaluation: Tables VI–VIII summarize the
FCN-based framework consumed more training time than mean and standard deviation of OA, AA, and κ, as well as

TABLE VII
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE S ALINAS DATASET

TABLE VIII
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE DFC 2018 DATASET

the average PA for each class. The best performance in each With the introduction of spatial information and the strong
row is highlighted in bold. learning capability of DL techniques, patch-based networks,
As a conventional classifier, SVM exhibits the worst results, such as SSRN, SSSAN, and SMRN, also significantly out-
especially on the DFC 2018 dataset. SVM considers raw perform SVM. Although these patch-based methods have
pixel vectors as features and feeds them directly into clas- achieved remarkable accuracy, their limited patch size limits
sifiers, which cannot excavate the discriminative information the spatial context and tends to lose some important informa-
contained in the data. SVM-EPF dramatically improves the tion. FCN-based networks (i.e., SSFCN and ENL-FCN) only
performance of SVM by applying edge-preserving filtering. achieved on-par results with patch-based networks (i.e., SSRN,

Fig. 9. Classification maps of different methods on the Salinas dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.

SSSAN, and SMRN) under the proposed sampling strategy. The proposed LiT exhibited a significant improvement over
For example, on the Salinas dataset, SSFCN and ENL-FCN all the comparison algorithms, because it uses transformers
achieved 87.58% and 85.34%, respectively, while SSRN, to capture long-range dependencies and convolution to gather
SSSAN, and SMRN achieved 86.48%, 86.65%, and 85.74%, local information. Both long-range connectivity and local
respectively. These FCN-based networks mainly use a stack information contribute to reasoning about the relationships
of many small spatial convolutions (e.g., 3 × 3) to enlarge between image contexts. For example, LiT achieved a max-
the receptive fields. According to the effective receptive√field imum OA of 89.03% on the Salinas dataset, outperforming
(ERF) theory, the size of the ERF is proportional to O(K L), SVM, SSRN, SSSAN, SMRN, SSFCN, ENL-FCN, Spectral-
where K is the kernel size and L is the depth, i.e., the number Former, ConvNeXt, and MobileViT by 9.49%, 2.55%, 2.38%,
of layers [65]. Thus, the ERF of FCN-based networks is still 3.29%, 1.45%, 3.69%, 5.47%, 1.01%, and 3.63%, respectively.
limited, which may limit their performance gains. Although Even for some challenging classes with similar spectral and
the transformer is good at modeling global dependencies, textural features (e.g., stubble and grapes_untrained on the
SpectralFormer is inferior to CNN-based or GCN-based mod- Salinas dataset), the proposed LiT still achieved better accu-
els, since it lacks inductive biases built-in CNNs and still racy by using its dynamic TwinMSA modules to capture
follows the patch-based classification. As the SOTA back- complicated relational interactions between different positions.
bones, ConvNeXt performs well on all three datasets, while These results powerfully demonstrate the practicality and value
MobileViT’s results are not satisfactory. Although MobileViT of our LiT.
employs both local information and global connectivity to 2) Visual Evaluation: Figs. 8–10 show the classification
reason about the relationships between image contents, it is maps produced by the comparison methods and the cor-
still heavy weight for HSI datasets. responding GT images. We can see that these qualitative
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617

Fig. 10. Classification maps of different methods on the DFC 2018 dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.

comparisons between these methods are in agreement with pepper noise problem of SVM, especially on the Indian Pines
the quantitative comparisons presented in Tables VI–VIII. dataset. The patch-based networks, such as SSRN, SSSAN,
As shown in Figs. 8(c), 9(c), and 10(c), the classifi- and SMRN, also produced better visual performance than
cation maps of SVM exhibit a considerable salt–pepper SVM by incorporating the spatial features of neighboring
noise, because SVM only considers spectral information. Due pixels. Classification maps generated by patch-based networks
to the similar spectral information between certain classes contain fewer noisy points and are smoother. However, the
(e.g., grapes_untrained and vinyard_untrained on the Salinas boundaries of these classification maps tend to be distorted,
dataset), distinguishing these easily confused categories using since patch-based classification approaches assume that each
spectral information alone is difficult. SVM-EPF not only pixel within a patch contributes equally to classifying the
makes object edges sharper, but also significantly mitigates the center pixel. This assumption is valid for most pixels in

TABLE IX
C OMPARISON OF PARAMS , FLOP S , T RAINING (T RAIN ), AND I NFERENCE
(I NFER ) T IME OF D IFFERENT M ETHODS ON THE S ALINAS DATASET

Fig. 11. Classification accuracy of LiT with different numbers of training

samples on the DFC 2018 dataset.

homogeneous regions, while spatial support from adjacent

test samples were kept the same as in Fig. 5(a). We also
pixels is often ineffective or even disruptive when belonging to
used OA, AA, and κ for performance evaluation, and the
different categories. Compared with patch-based algorithms,
corresponding results are presented in Table X. The table
SSFCN, ENL-FCN, and ConvNeXt obtained slightly better
shows that the proposed method also yielded the best results.
boundary localization. As seen, classification maps gener-
SSSAN, SMRN, SSFCN, ENL-FCN, SpectralFormer, Con-
ated by CNNs tend to smooth or overlook details, such
vNeXt, MobileViT, and LiT all achieved high OA (even
as class boundaries. This results from the local connectiv-
close to 100%) with little difference between each other. The
ity and parameter-sharing properties of CNNs, which limit
reason is that there is a large overlap in the feature extraction
their ability to detect fine details and accurately localize
spaces between the training and test samples. Information from
objects. There are more misclassifications in the classi-
training data was used to evaluate the methods during the
fication maps of SpectralFormer and MobileViT than in
testing phase, resulting in exaggerated results. Therefore, it is
ConvNeXt. Our LiT produced high-quality classification maps
imperative to establish a standardized performance evaluation
with well-positioned boundaries and smooth objects, while
method to ensure consistent and reliable results. The pro-
the object outline of LiT is still inferior to SVM and SVM-
posed CMS sampling strategy provides a solution doing this
EPF. This is mainly because SVM did not consider spatial
direction.
information, while LiT did. SVM-EPF produced more accurate
Then, we compared our LiT with SVM with region-
edges than SVM by optimizing the classification maps of
SVM-EPF. based max-voting (SVM-RMV) [66], FasterNet [67], and
The experimental results also show that transformers can be EfficientViT [48] on the Salinas dataset using samples in
as effective as CNNs, even with small-scale datasets. This can Fig. 5. SVM-RMV, an edge-preserving method, applies the
be attributed to the inductive bias inherent in convolutional region-based max-voting scheme on the SVM classification
layers, which results in higher generalization and faster con- map to optimize classification maps. FasterNet is a neural net-
vergence speed. On the other hand, transformer layers have work built upon partial convolution, and EfficientViT is a ViT,
a larger model capacity that can benefit from larger datasets. both of which achieve SOTA speed and accuracy trade-offs on
Therefore, combining convolutional and attention layers can various vision tasks. Results in Table XI show that our LiT still
achieve better generalization and capacity. delivers the best results. FasterNet and EfficientViT are supe-
3) Model Complexity and Speed Evaluation: To compre- rior to SVM-RMV, confirming that FasterNet and EfficientViT
hensively analyze the efficiency of our LiT, we compared it have better generalization and performance than SVM-RMV.
with other methods in terms of Params and FLOPs, as well LiT shows better results than FasterNet and EfficientViT,
as training and inference time on the SA dataset. As listed again demonstrating its superiority and generalizability
in Table IX, LiT obtained on par Params and FLOPs with the again.
average level of other methods, while having a faster inference
speed. In general, the FCN-based methods take longer training F. Impact of the Number of Training Samples
time but less inference time than the patch-based methods.
This again demonstrates that the FCN-based framework is The performance of DL-based supervised methods is
more suitable for real-world applications. Among these FCN- strongly influenced by the number of training samples. To
based models, LiT has more Params and FLOPs than SSFCN investigate the sensitivity of LiT to the number of training
and ENL-FCN but less than ConvNeXt and MobileViT. In samples, we randomly selected different numbers of training
addition, the inference speed of our LiT is much faster and samples from the training set of the DFC 2018 dataset (as
achieves near real-time inference, demonstrating its potential shown in Fig. 6) in the proportion of [100%, 80%, 60%, 40%].
for efficient hyperspectral scene analysis. The corresponding results are summarized in Fig. 11. We can
4) Extended Experiment: To further evaluate the perfor- see that the classification accuracy gradually decreases, as the
mance of the proposed LiT, we first compared it with other percentage of training samples decreases. However, the decline
methods on the Salinas dataset using samples generated by is relatively slow, demonstrating the stability and robustness
the random sampling strategy. The numbers of training and of the proposed LiT.

TABLE X
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE S ALINAS DATASET U SING A R ANDOM S AMPLING S TRATEGY

TABLE XI [3] T. C. W. Mok and A. C. S. Chung, “Fast symmetric diffeomor-

C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS phic image registration with convolutional neural networks,” in Proc.
ON THE S ALINAS DATASET U SING O UR CMS S AMPLING S TRATEGY IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 4643–4652.
[4] R. Alamús et al., “Ground-based hyperspectral analysis of the urban
nightscape,” ISPRS J. Photogramm. Remote Sens., vol. 124, pp. 16–26,
Feb. 2017.
[5] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
[6] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
[7] M. Pal, “Multinomial logistic regression-based feature selection for
V. C ONCLUSION hyperspectral data,” Int. J. Appl. Earth Observ. Geoinf., vol. 14, no. 1,
pp. 214–220, Feb. 2012.
This article has proposed a hybrid architecture for HSI [8] S. Jia, L. Shen, J. Zhu, and Q. Li, “A 3-D Gabor phase-based
classification, called LiT, that can effectively address the coding and matching framework for hyperspectral imagery clas-
limitations of using transformers on small-scale datasets. LiT sification,” IEEE Trans. Cybern., vol. 48, no. 4, pp. 1176–1188,
Apr. 2018.
improves the representational ability of the extracted features
[9] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson,
and achieves a favorable accuracy–speed trade-off by com- “Spectral and spatial classification of hyperspectral data using SVMs
bining the strengths of transformers and CNNs. Moreover, and morphological profiles,” IEEE Trans. Geosci. Remote Sens., vol. 46,
the proposed PLMSA and CLMSA modules significantly no. 11, pp. 3804–3814, Nov. 2008.
[10] M. Pesaresi, A. Gerhardinger, and F. Kayitakire, “A robust built-up
enhanced long-range contextual information and successfully area presence index by anisotropic rotation-invariant textural measure,”
extracted more discriminative features in complex regions, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 1, no. 3,
while consuming less computational and memory resources. pp. 180–192, Sep. 2008.
The proposed CMS sampling strategy significantly reduced [11] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu, “Integration of
3-dimensional discrete wavelet transform and Markov random field
the overlap of feature extraction regions between training and for hyperspectral image classification,” Neurocomputing, vol. 226,
test samples. Extensive experiments demonstrated that the LiT pp. 90–100, Feb. 2017.
outperforms other SOTA classification methods. Moreover, [12] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances,
the experiments also revealed that selecting training samples and J. Calpe-Maravilla, “Composite kernels for hyperspectral image
classification,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp. 93–97,
from multiclass regions can make the trained network more Jan. 2006.
discriminative, and the FCN-based segmentation framework is [13] Y. Su, L. Gao, M. Jiang, A. Plaza, X. Sun, and B. Zhang, “NSCKL:
better for real-world, large-scale applications compared with Normalized spectral clustering with kernel-based learning for semisu-
the patch-based classification framework. In the future, we will pervised hyperspectral image classification,” IEEE Trans. Cybern., early
access, Nov. 17, 2022, doi: 10.1109/TCYB.2022.3219855.
focus on constructing large benchmark datasets to facilitate [14] X. Zhang et al., “Spectral–spatial self-attention networks for hyperspec-
spectral–spatial HSI analysis. In addition, weakly supervised tral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 60,
and self-supervised approaches should be considered to reduce 2022, Art. no. 5512115.
the need for expensive pixel-level image annotation. [15] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyper-
spectral data based on deep belief network,” IEEE J. Sel. Topics
Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392,
Jun. 2015.
ACKNOWLEDGMENT
[16] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deep
The authors would like to thank the IEEE Geoscience and convolutional neural network for fast hyperspectral image classification,”
ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 120–147, Nov. 2018.
Remote Sensing Society (GRSS) IADF, and the Hyperspectral [17] S. Jia, S. Jiang, S. Zhang, M. Xu, and X. Jia, “Graph-in-graph
Image Analysis Laboratory, University of Houston, Houston, convolutional network for hyperspectral image classification,” IEEE
TX, USA, for providing the DFC 2018 datasets. Trans. Neural Netw. Learn. Syst., early access, Jun. 20, 2022, doi:
10.1109/TNNLS.2022.3182715.
[18] Y. Su, M. Jiang, L. Gao, X. Sun, X. You, and P. Li, “Graph-cut-based
R EFERENCES collaborative node embeddings for hyperspectral images classification,”
IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022.
[1] Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty, [19] D. Hong et al., “SpectralFormer: Rethinking hyperspectral image classi-
“DAEN: Deep autoencoder networks for hyperspectral unmixing,” IEEE fication with transformers,” IEEE Trans. Geosci. Remote Sens., vol. 60,
Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4309–4321, Jul. 2019. 2022, Art. no. 5518615.
[2] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectral [20] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli,
and LiDAR remote sensing data for classification of complex forest and L. Sagun, “ConViT: Improving vision transformers with soft con-
areas,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 5, pp. 1416–1427, volutional inductive biases,” in Proc. Int. Conf. Mach. Learn., 2021,
May 2008. pp. 2286–2296.

[21] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral–spatial residual [43] Z. Pan, B. Zhuang, J. Liu, H. He, and J. Cai, “Scalable vision transform-
network for hyperspectral image classification: A 3-D deep learn- ers with hierarchical pooling,” in Proc. IEEE/CVF Int. Conf. Comput.
ing framework,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, Vis. (ICCV), Oct. 2021, pp. 367–376.
pp. 847–858, Feb. 2018. [44] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,
[22] Y. Xu, B. Du, and L. Zhang, “Beyond the patchwise classification: “Escaping the big data paradigm with compact transformers,” 2021,
Spectral–spatial fully convolutional networks for hyperspectral image arXiv:2104.05704.
classification,” IEEE Trans. Big Data, vol. 6, no. 3, pp. 492–506, [45] Y. Chen et al., “Mobile-former: Bridging MobileNet and transformer,”
Sep. 2020. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[23] Z. Zheng, Y. Zhong, A. Ma, and L. Zhang, “FPGA: Fast patch-free Jun. 2022, pp. 5260–5269.
global learning framework for fully end-to-end hyperspectral image [46] K. Li, R. Yu, Z. Wang, L. Yuan, G. Song, and J. Chen, “Local-
classification,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 8, ity guidance for improving vision ttransformers on tiny datasets,”
pp. 5612–5626, Aug. 2020. in Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 61, Nov. 2022,
[24] C. Wang, L. Zhang, W. Wei, and Y. Zhang, “Toward effective hyper- pp. 110–127.
spectral image classification using dual-level deep spatial manifold [47] L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. W. Lau, “BiFormer:
representation,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Vision transformer with bi-level routing attention,” in Proc. IEEE Conf.
Art. no. 5505614. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 10323–10333.
[25] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying convolution [48] X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan, “EfficientViT:
and attention for all data sizes,” in Advances in Neural Information Memory efficient vision transformer with cascaded group attention,” in
Processing Systems, vol. 34. Red Hook, NY, USA: Curran Associates, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023,
2021, pp. 3965–3977. pp. 14420–14430.
[26] J. Guo et al., “CMT: Convolutional neural networks meet vision trans- [49] H. Huang, X. Zhou, J. Cao, R. He, and T. Tan, “Vision transformer
formers,” 2021, arXiv:2107.06263. with super token sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern
[27] E. Sanderson and B. J. Matuszewski, “FCN-transformer feature fusion Recognit. (CVPR), Jun. 2023, pp. 22690–22699.
for polyp segmentation,” in Proc. Annu. Conf. Med. Image Underst. [50] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
Anal. Cham, Switzerland: Springer, 2022, pp. 892–907. using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
vol. 55, no. 2, pp. 844–853, Feb. 2017.
[28] W. Wang et al., “Internimage: Exploring large-scale vision foundation
[51] G. Sun et al., “Deep fusion of localized spectral features and multi-scale
models with deformable convolutions,” in Proc. IEEE Conf. Comput.
spatial features for effective classification of hyperspectral images,” Int.
Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14408–14419.
J. Appl. Earth Observ. Geoinf., vol. 91, Sep. 2020, Art. no. 102157.
[29] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
[52] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-
for image recognition at scale,” 2020, arXiv:2010.11929.
spectral image classification,” IEEE Trans. Image Process., vol. 26,
[30] Z. Zhong, Y. Li, L. Ma, J. Li, and W.-S. Zheng, “Spectral–spatial no. 10, pp. 4843–4855, Oct. 2017.
transformer network for hyperspectral image classification: A factorized [53] S. Hao, W. Wang, Y. Ye, T. Nie, and L. Bruzzone, “Two-stream deep
architecture search framework,” IEEE Trans. Geosci. Remote Sens., architecture for hyperspectral image classification,” IEEE Trans. Geosci.
vol. 60, 2022, Art. no. 5514715. Remote Sens., vol. 56, no. 4, pp. 2349–2361, Apr. 2018.
[31] X. He, Y. Chen, and Q. Li, “Two-branch pure transformer for hyper- [54] Y. Shen et al., “Efficient deep learning of nonlocal features for
spectral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 19, hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,
pp. 1–5, 2022. vol. 59, no. 7, pp. 6029–6043, Jul. 2021.
[32] H. Yu, Z. Xu, K. Zheng, D. Hong, H. Yang, and M. Song, “MSTNet: [55] Q. Zhu et al., “A spectral–spatial-dependent global learning framework
A multilevel spectral–spatial transformer network for hyperspectral for insufficient and imbalanced hyperspectral image classification,” IEEE
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Trans. Cybern., vol. 52, no. 11, pp. 11709–11723, Nov. 2022.
Art. no. 5532513. [56] C. Yang et al., “Lite vision transformer with enhanced self-attention,”
[33] X. Huang, M. Dong, J. Li, and X. Guo, “A 3-D-Swin transformer- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
based hierarchical contrastive learning method for hyperspectral image Jun. 2022, pp. 11988–11998.
classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, [57] M. Tan and Q. Le, “EfficientnetV2: Smaller models and faster training,”
Art. no. 5411415. in Proc. Int. Conf. Mach. Learn., 2021, pp. 10096–10106.
[34] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using [58] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
Oct. 2021, pp. 9992–10002. pp. 7132–7141.
[35] L. Sun, G. Zhao, Y. Zheng, and Z. Wu, “Spectral–spatial feature [59] G. Jocher et al., “YOLOv5 by ultralytics,” Oct 2020.
tokenization transformer for hyperspectral image classification,” IEEE [60] J. Fu, J. Liu, J. Jiang, Y. Li, Y. Bao, and H. Lu, “Scene segmentation
Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5522214. with dual relation-aware attention network,” IEEE Trans. Neural Netw.
[36] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, Learn. Syst., vol. 32, no. 6, pp. 2547–2560, Jun. 2021.
and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. [61] J. Liang, J. Zhou, Y. Qian, L. Wen, X. Bai, and Y. Gao, “On the sampling
Represent. (ICLR), Jan. 2022. strategy for evaluation of spectral–spatial methods in hyperspectral
[37] W. Wang, L. Liu, T. Zhang, J. Shen, J. Wang, and J. Li, “HyperHyper- image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2,
ES2 T: Efficient spatial–spectral transformer for the classification of pp. 862–880, Feb. 2017.
hyperspectral remote sensing images,” Int. J. Appl. Earth Observ. [62] X. Kang, S. Li, and J. A. Benediktsson, “Spectral–spatial hyper-
Geoinf., vol. 113, Sep. 2022, Art. no. 103005. spectral image classification with edge-preserving filtering,” IEEE
[38] W. Qi, C. Huang, Y. Wang, X. Zhang, W. Sun, and L. Zhang, Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2666–2677,
“Global–local 3-D convolutional transformer network for hyperspectral May 2014.
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, [63] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
Art. no. 5510820. “A ConvNet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis.
[39] T. Lu, M. Liu, W. Fu, and X. Kang, “Grouped multi-attention network Pattern Recognit. (CVPR), Jun. 2022, pp. 11966–11976.
for hyperspectral image spectral–spatial classification,” IEEE Trans. [64] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
Geosci. Remote Sens., vol. 61, 2023, Art. no. 5507912. 2017, arXiv:1711.05101.
[40] E. Ouyang, B. Li, W. Hu, G. Zhang, L. Zhao, and J. Wu, “When [65] X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your ker-
multigranularity meets spatial–spectral attention: A hybrid transformer nels to 31×31: Revisiting large kernel design in CNNs,” in Proc.
for hyperspectral image classification,” IEEE Trans. Geosci. Remote IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
Sens., vol. 61, 2023, Art. no. 4401118. pp. 11963–11975.
[41] B. Zu, Y. Li, J. Li, Z. He, H. Wang, and P. Wu, “Cascaded [66] X. Zhang, A. Zhang, G. Sun, and Y. Yao, “Multiscale convolution
convolution-based transformer with densely connected mechanism for network with region-based max voting for hyprrsprctral imagrs clas-
spectral–spatial hyperspectral image classification,” IEEE Trans. Geosci. sificatton,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
Remote Sens., vol. 61, 2023, Art. no. 5513119. Sep. 2020, pp. 64–67.
[42] X. Yang, W. Cao, Y. Lu, and Y. Zhou, “Hyperspectral image transformer [67] J. Chen et al., “Run, don’t walk: Chasing higher flops for faster neural
classification networks,” IEEE Trans. Geosci. Remote Sens., vol. 60, networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
2022, Art. no. 5528715. Jun. 2023, pp. 12021–12031.

Xuming Zhang received the B.Sc. and M.Sc. Lorenzo Bruzzone (Fellow, IEEE) received the Lau-
degrees from the China University of Petroleum rea (M.S.) degree (summa cum laude) in electronic
(East China), Qingdao, China, in 2018 and 2021, engineering and the Ph.D. degree in telecommuni-
respectively. She is currently pursuing the Ph.D. cations from the University of Genoa, Genoa, Italy,
degree in geography with Nanjing University, in 1993 and 1998, respectively.
Nanjing, China. He is the Founder and the Director of the
Her research interests include hyperspectral image Remote Sensing Laboratory, Department of Informa-
processing, multisensor data fusion, high-resolution tion Engineering and Computer Science, University
remote sensing processing, deep learning, and of Trento, Trento, Italy, where he is currently a Full
its applications to semantic segmentation and Professor of telecommunications. He is a principal
classification. investigator of many research projects. Among the
others, he is a Principal Investigator of the Radar for Icy Moon Exploration
(RIME) instrument in the framework of the JUpiter ICy moons Explorer
(JUICE) mission of the European Space Agency (ESA) and the Science Lead
of the High Resolution Land Cover Project in the framework of the Climate
Change Initiative of ESA. His research interests include the areas of remote
Yuanchao Su (Senior Member, IEEE) received the sensing, radar and synthetic aperture radar (SAR), signal processing, machine
B.S. and M.Sc. degrees from the Xi’an University of learning, and pattern recognition.
Science and Technology, Xi’an, China, in 2012 and Dr. Bruzzone has been a member of the Administrative Committee of the
2015, respectively, and the Ph.D. degree from Sun IEEE Geoscience and Remote Sensing Society (GRSS) since 2009, where
Yat-sen University, Guangzhou, China, in 2019. he has been the Vice President of professional activities since 2019. He
From 2013 to 2015, he was an Exchange Postgrad- ranked first place in the Student Prize Paper Competition of the 1998 IEEE
uate with the Optical Laboratory, Institute of Remote International Geoscience and Remote Sensing Symposium (IGARSS), Seattle,
Sensing and Digital Earth, Chinese Academy of in July 1998. He was a recipient of many international and national honors and
Sciences, Beijing, China. From 2018 to 2019, he was awards, including the recent IEEE GRSS 2015 Outstanding Service Award, the
a Visiting Researcher with the Advanced Imaging 2017 and 2018 IEEE IGARSS Symposium Prize Paper Awards, and the 2019
and Collaborative Information Processing Group, WHISPER Outstanding Paper Award. Since 2003, he has been the Chair of the
Department of Electrical Engineering and Computer Science, University of International Society for Optical Engineering (SPIE) Conference on Image and
Tennessee, Knoxville, TN, USA. In 2019, he joined the Department of Remote Signal Processing for Remote Sensing. He has been a Distinguished Speaker
Sensing, College of Geomatics, Xi’an University of Science and Technology, of the IEEE Geoscience and Remote Sensing Society between 2012 and 2016.
where he is currently a Lecturer and leads the Hyperspectral Information and He is the Cofounder of the IEEE International Workshop on the Analysis of
Intelligent Computation Group. Since 2021, he has been a Visiting Researcher Multi-Temporal Remote-Sensing Images (MultiTemp) series and is a member
with the Key Laboratory of Computational Optical Imaging Technology, of the Permanent Steering Committee of this series of workshops. He has
Aerospace Information Research Institute, Chinese Academy of Sciences. His been the Founder of the IEEE Geoscience and Remote Sensing Magazine
research interests include hyperspectral unmixing, hyperspectral classification, for which he has been an Editor-in-Chief between 2013 and 2017. He was
neural network, and deep learning. a guest coeditor of many special issues of international journals. He is
Dr. Su is a Senior Member of the IEEE Geoscience and Remote Sensing an Associate Editor of the IEEE T RANSACTIONS ON G EOSCIENCE AND
Society. He serves as a Reviewer for many international journals, including R EMOTE S ENSING.
the IEEE T RANSACTIONS ON C YBERNETICS, the IEEE T RANSACTIONS ON
P ROCESSING, the IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE
S ENSING, the IEEE J OURNAL OF S ELECTED T OPICS IN A PPLIED E ARTH
O BSERVATIONS, and the IEEE G EOSCIENCE AND R EMOTE S ENSING L ET-
TERS.

Xingfa Gu received the B.S. degree in remote

sensing and cartography from Wuhan University,
Wuhan, China, in 1982, and the M.S. and Ph.D.
degrees in physical remote sensing from University
Lianru Gao (Senior Member, IEEE) received the Paris Diderot-Paris 7, Paris, France, in 1988 and
B.S. degree in civil engineering from Tsinghua Uni- 1991, respectively.
versity, Beijing, China, in 2002, and the Ph.D. degree He is currently a Professor and the Director
in cartography and geographic information system of the National Engineering Research Center for
from the Institute of Remote Sensing Applica- Satellite Remote Sensing Applications (NECRSA),
tions, Chinese Academy of Sciences (CAS), Beijing, Aerospace Information Research Institute (AIR),
in 2007. Chinese Academy of Sciences (CAS), Beijing,
He has been a Visiting Scholar with the Univer- China. His research interests include quantitative remote sensing, data and
sity of Extremadura, Cáceres, Spain, in 2014, and information engineering science of aerospace remote sensing, and the design
Mississippi State University (MSU), Starkville, MS, of remote sensing processing systems.
USA, in 2016. He is currently a Professor with the Dr. Gu is an International Society for Optical Engineering (SPIE) Fellow.
Key Laboratory of Computational Optical Imaging Technology, Aerospace He is an Academician of the International Academy of Astronautics (IAA)
Information Research Institute, CAS. In the last ten years, he was a Principal and the International Eurasian Academy of Sciences (IEAS).
Investigator (PI) of ten scientific research projects at national and ministerial
levels, including projects by the National Natural Science Foundation of
China (2016–2019, 2018–2020, and 2022–2025) and by the National Key
Research and Development Program of China (2021–2025). He has published
more than 200 peer-reviewed papers, and there are more than 130 journal
articles included by Science Citation Index (SCI). He has coauthored three Qingjiu Tian received the B.S. degree in infrared
academic books, including Hyperspectral Image Information Extraction. He from Shandong University, Jinan, China, in 1987,
holds 29 national invention patents in China. His research interests include the M.S. degree in cartography and remote sensing
hyperspectral image processing and information extraction. from the Chinese Academy of Sciences, Beijing,
Dr. Gao was awarded the Outstanding Science and Technology Achievement China, in 1996, and the Ph.D. degree in cartography
Prize by CAS in 2016, supported by the China National Science Fund for and geographical information system from Nanjing
Excellent Young Scholars in 2017, and won the Second Prize of the State University, Nanjing, China, in 2003.
Scientific and Technological Progress Award in 2018. He received the recog- He is currently a Professor with the Interna-
nition of the best reviewers by the IEEE J OURNAL OF S ELECTED T OPICS tional Institute for Earth System Sciences, Nanjing
IN A PPLIED E ARTH O BSERVATIONS AND R EMOTE S ENSING (JSTARS) in University. His research interests include ground
2015 and by the IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE object detection and the retrieval of parameters by
S ENSING (TGRS) in 2017. multispectral/hyperspectral remote sensing.

Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.