A Lightweight Transformer Network For Hyperspectral Image Classification
A Lightweight Transformer Network For Hyperspectral Image Classification
Abstract— Transformer is a powerful tool for capturing at fine-grained spectral scales, making HSIs useful in a wide
long-range dependencies and has shown impressive performance range of applications, such as forest monitoring [2], medical
in hyperspectral image (HSI) classification. However, such power imaging [3], and urban development observation [4]. One of
comes with a heavy memory footprint and huge computation
burden. In this article, we propose two types of lightweight the fundamental tasks in these applications is classification,
self-attention modules (a channel lightweight multihead self- which involves assigning a specific category label to each
attention (CLMSA) module and a position lightweight multihead pixel.
self-attention (PLMSA) module) to reduce both memory and Traditional classifiers focus only on spectral information,
computation while associating each pixel or channel with global including support vector machine (SVM) [5], random for-
information. Moreover, we discover that transformers are inef-
fective in explicitly extracting local and multiscale features due est [6], and logistic regression [7]. Complex scenarios and
to the fixed input size and tend to overfit when dealing with spectral heterogeneity of objects require to model spatial
a small number of training samples. Therefore, a lightweight correlations. This can be achieved by different approaches,
transformer (LiT) network, built with the proposed lightweight such as Gabor filtering [8], morphological profiles [9], gray-
self-attention modules, is presented. LiT adopts convolutional level co-occurrence matrix [10], and 3-D wavelets [11]. As an
blocks to explicitly extract local information in early layers and
employs transformers to capture long-range dependencies in deep
alternative, kernel-based methods [12], [13] have also been
layers. Furthermore, we design a controlled multiclass stratified introduced to capture spatial information of HSIs. However,
(CMS) sampling strategy to generate appropriately sized input these traditional approaches rely heavily on prior knowledge
data, ensure balanced sampling, and reduce the overlap of feature and are limited to shallow features, resulting in poor general-
extraction regions between training and test samples. With ization and robustness [14].
appropriate training data, convolutional tokenization, and LiTs, Deep learning (DL), an automatic feature learning tech-
LiT mitigates overfitting and enjoys both high computational
efficiency and good performance. Experimental results on several nique, can automatically learn richer representations than
HSI datasets verify the effectiveness of our design. traditional approaches. Several DL techniques have been
applied to HSI classification, such as deep belief networks
Index Terms— Deep learning (DL), hyperspectral image (HSI)
classification, transformer. [15], convolutional neural networks (CNNs) [16], graph con-
volutional networks (GCNs) [17], [18], and transformers [19].
I. I NTRODUCTION Among these DL techniques, CNNs are the most widely
used due to their inherent properties of local connectivity
H YPERSPECTRAL images (HSIs) contain tens or even
hundreds of narrow and continuous spectral bands
ranging from visible to infrared [1]. The abundant spectral
and weight sharing. These properties impose strong con-
straints on convolution weights that hardcode inductive biases
into networks, thus leading to more sample-efficient and
information in HSIs enables object identification and detection
parameter-efficient [20].
Manuscript received 15 June 2023; accepted 14 July 2023. Date of pub- Many works [14], [21], [22], [23] have attempted to
lication 21 July 2023; date of current version 7 August 2023. This work bring the power of CNNs to HSI classification. The existing
was supported in part by the National Natural Science Foundation of China methods mainly fall into two broad categories: patch-based
under Grant 42101321 and Grant 42001319, in part by the Scientific Research
Program of the Education Department of Shaanxi Province under Grant classification and fully convolutional network (FCN)-based
21JK0762, and in part by the University–Industry Collaborative Education segmentation frameworks. Spectral–spatial DL networks,
Program of Ministry of Education of China under Grant 220802313200859. such as spectral–spatial residual network (SSRN) [21], dual-
(Corresponding author: Qingjiu Tian.)
Xuming Zhang and Qingjiu Tian are with the International Institute for Earth level deep spatial manifold representation network (SMRN)
System Science and the Jiangsu Provincial Key Laboratory of Geographic [24], and spectral–spatial self-attention network (SSSAN)
Information Science and Technology, Nanjing University, Nanjing 210023, [14], follow the design rule of patch-based classification to
China (e-mail: [email protected]; [email protected]).
Yuanchao Su is with the Department of Remote Sensing, College of facilitate feature learning and classifier training. Features are
Geomatics, Xi’an University of Science and Technology, Xi’an 710054, China extracted from the spatial patch centered on the sample pixel
(e-mail: [email protected]). and further processed to assign a specific category to the
Lianru Gao and Xingfa Gu are with the Key Laboratory of Com-
putational Optical Imaging Technology, Aerospace Information Research
center pixel. However, redundant computation on overlapping
Institute, Chinese Academy of Sciences, Beijing 100094, China (e-mail: regions between adjacent patches is inevitable, which severely
[email protected]). hampers large-scale applications. In contrast, FCNs exhibit
Lorenzo Bruzzone is with the Department of Information Engineering better accuracy–speed trade-offs. FCNs feed data into the
and Computer Science, University of Trento, 38050 Trento, Italy (e-mail:
[email protected]). network and perform feature extraction and pixel-to-pixel
Digital Object Identifier 10.1109/TGRS.2023.3297858 classification. Examples of these models are the
1558-0644 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
spectral–spatial FCN (SSFCN) [22] and encoder–decoder spatial–spectral representations along height, width, and spec-
architectures [23]. Although these CNN-based architectures tral dimensions, respectively.
can learn information from a larger receptive field by Although these transformer-based networks have shown
stacking multiple layers, they still lack global connectivity. impressive performance in HSI classification, the high dimen-
Moreover, convolutional filter weights are usually fixed after sionality, together with limited labeled samples, still makes the
training and cannot be dynamically adapted to different classification task very challenging. There are still three main
inputs [25]. aspects that need to be addressed.
Recently, transformers have shown promising results in vari- 1) The existing transformer-based models mainly follow
ous visual tasks, such as image classification [26] and semantic the patch-based classification, which not only results in
segmentation [27]. As a core component of transformers, redundant computation but also hinders long-range spa-
self-attention has the properties of long-range modeling and tial dependency modeling. Thus, an excellent approach
adaptive spatial aggregation. Benefiting from the flexible self- to build effective architectures is to follow the FCN-
attention, transformers can learn more robust and accurate based framework.
representations than CNNs from extensive data [28]. Vision 2) Self-attention in transformers incurs a heavy memory
transformer (ViT) [29] is the first pure transformer backbone footprint and huge computation burden, which limits
for vision tasks. It replaces the inductive biases inherent in their real-time applications.
convolution with global processing driven by self-attention, 3) Transformers are ineffective in explicitly extracting local
demonstrating that large-scale training can trump inductive and multiscale features and are prone to overfitting with
biases. Several studies have investigated the use of trans- small numbers of training samples.
formers for HSI classification. SpectralFormer [19] generates To overcome the above three drawbacks, we propose a
groupwise spectral embeddings by learning local sequence lightweight transformer (LiT) network and a controlled mul-
information from adjacent bands of HSIs. A spectral–spatial ticlass stratified (CMS) sampling strategy. Specifically, LiT is
transformer network [30] employs consecutive spatial and proposed to merge and exploit the advantages of transformers
spectral transformer blocks to learn spectral–spatial infor- (i.e., long-range dependence and input adaptive weighting)
mation from input patches. To fully exploit the abundant and CNNs (i.e., local spatial modeling and inductive bias) for
spectral information in HSIs, a two-branch pure transformer end-to-end pixel-to-pixel classification by following the FCN-
[31] uses a spectral transformer and a spatial transformer based framework. LiT has four stages to generate feature maps
to learn spectral sequences and spatial features, respectively. with different scales. The first two stages deploy convolutional
A multilevel spectral–spatial transformer network [32] pro- blocks to explicitly extract local information, and the last
cesses HSIs into sequences, learns feature representations with two stages employ transformer blocks to capture long-range
a pure transformer encoder, and then processes multilevel dependencies. In LiT, two lightweight self-attention modules
features with decoders to produce classification maps. Huang are designed to reduce both computation and memory while
et al. [33] improved the original Swin transformer [34] to the associating each pixel/channel with global information. The
3-D structure and proposed a 3-D Swin transformer to explore CMS sampling strategy is designed to provide appropriately
the rich spatial–spectral information of HSIs. However, these sized inputs and ensure balanced sampling while reducing
transformer architectures are still inferior to their CNN coun- the overlap of feature extraction regions between training
terparts for training on small-scale datasets, as they lack and test samples. The proposed LiT combined with the
proper inductive biases and, thus, require substantial data and CMS sampler can mitigate overfitting. Superior performance
computational resources to compensate [25]. Consequently, on three HSI datasets demonstrates the effectiveness of our
CNNs are still the preferred models for HSI classification, approaches.
since they require less time, data, and memory for training, The main contributions of this study are summarized as
whereas they do not enjoy long-range modeling [35]. follows.
Hybrid architectures combining transformers and convolu- 1) Extending transformer-based research to FCN-based
tions have received much attention in constructing lightweight, HSI classification by introducing LiT, which can incor-
high-performance models [36]. Hyper-ES2 T [37] embeds porate the merits of transformers and CNNs to efficiently
convolution layers before each spatial–spectral transformer model image hierarchies in local and long range, achiev-
block. In [38], 3-D convolution is embedded in a two-branch ing high performance and fast inference speed.
transformer to capture global–local dependencies in both spec- 2) Developing two simple yet effective self-attention
tral and spatial domains. Grouped multi-attention network modules—the position lightweight multihead self-
(GMA-Net) [39] is also a two-branch architecture, where attention (PLMSA) module and the channel lightweight
one branch is responsible for spectral–spatial feature learning multihead self-attention (CLMSA) module—to construct
using CNN and multiattention modules, and the other for compact long-range dependencies in spatial and spec-
pixelwise spectral feature learning using convolutions. Some tral dimensions, respectively, resulting in performance
works [35], [40], [41] first employ convolution to perform boosts with less computational and memory cost.
shallow feature extraction and use transformers to capture 3) Designing a CMS sampling strategy to provide appro-
the global relationship between different tokens. Hyperspectral priately sized input data and ensure balanced sampling.
image transformer (HiT) [42] embeds convolution operations It can also significantly reduce the overlap of feature
into the transformer structure. It uses 3-D convolution layers extraction regions between training and test samples,
to produce local spatial–spectral information and then uses providing a more objective benchmark for evaluating
depthwise and pointwise convolution operations to encode method performance.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
The remainder of this study is organized as follows. and self-attention through simple relative attention. Some
Section II reviews transformer-related work and classical works [25], [44] apply convolution before transformer layers
CNN-based networks for HSI classification. Section III to generate richer tokens and preserve local information.
describes the proposed LiT network and CMS sampling strat- Mobile-former [45] is a bridge-connected parallel architecture
egy in detail. Extensive experiments with ablation studies of MobileNet and transformer, combining the advantages of
are performed and discussed in Section IV. Finally, some MobileNet for local processing and transformer for global
concluding remarks and a brief outlook on future research are interaction. CNNs meet transformers (CMT) [26] embeds the
presented in Section V. depthwise convolution into transformer blocks to enhance
local information. Li et al. [46] introduce locality guidance
II. R ELATED W ORKS provided by a trained CNN to accelerate convergence and
A. Vision Transformer improve the performance of ViTs on tiny datasets. MobileViT
[36] is a general and lightweight ViT for mobile devices, where
ViT [29] provides an alternative design paradigm to CNNs. the MobileViT block replaces local convolutional processing
It reshapes the input image into a sequence of flattened with global processing using transformers, allowing it to have
patches, which are then projected into a sequence of patch ViT- and CNN-like properties.
embeddings using a trainable linear projection. These patch Furthermore, many efficient attention mechanisms [47],
embeddings append a class token and then incorporate position [48], [49] have been developed to boost performance and
embeddings to obtain a sequence of input tokens z0 . save computational cost. One way is to restrict self-attention
The transformer encoder, consisting of L transformer layers, to small windows, such as Swin transformer [34]. Besides,
is applied to z0 to learn interpatch representations. A trans- BiFormer [47] designs a novel bi-level routing attention to
former layer consists of a multihead self-attention (MSA) save computation and memory, where each query focuses on a
block (1) followed by a feed-forward network (FFN) (2). small subset of the most semantically relevant key–value pairs.
Layer normalization (LN) is applied before each block, and EfficientViT [48] presents a cascaded group attention module
a shortcut connection is added after each block. Given input that feeds attention heads with different splits of the full
zl−1 , a transformer layer can be written as follows: features to reduce computational redundancy. A super token
zl′ = MSA(LN(zl−1 )) + zl−1 (1) attention mechanism [49] is designed to promote effective
and efficient global context modeling at the early stages of
zl = FFN LN zl′ + zl′
(2)
a network.
where l = 1, . . . , L. The MSA module is defined by consid- However, these transformer-based networks are still heavy
ering h “heads.” Specifically, the input xin is divided equally weight for the HSI classification task. Combining the strengths
into h “heads” by channel (3), and the self-attention function of CNNs and transformers to build ViT models for HSI
is applied to each “head” (4). These h sequences are then classification tasks remains an open question.
concatenated (5)
xin → xin1 , . . . , xinh
(3) B. CNN for HSI Classification
.q
h(xi ) = Softmax Qi KiT dik Vi (4) CNNs have become the dominant technique in HSI classi-
fication in recent years because of their strong representation,
MSA(xin ) = Concat h xin1 1 , . . . , h xinh h lightweight, and easy optimization. In [16], patches centered
(5)
on sample pixels are generated and fed into a sequence of
where Qi = xiin W Qi , Ki = xiin W Ki , and Vi = xiin WVi ; W Qi , W Ki , convolution and pooling layers, followed by linear layers for
and WVi are the learnable projection matrices to project the xiin feature extraction and classification. A CNN with pixel-pair
into different feature spaces, i = 1, . . . , h. The FFN consists features was proposed in [50] to improve feature learning.
of two linear layers separated by an activation function, i.e., Some networks use parallel filters with different kernel sizes
[51] or short connections [52] to promote multiscale informa-
Y = f (ZW1 )W2 (6)
tion learning. Two-branch CNN-based architectures [14], [53]
where W1 ∈ Rc×γ c , W2 ∈ Rγ c×c , f (·) denotes an activation employ a 2-D CNN and other algorithms (e.g., 1-D CNN and
function, and Y is the output of FFN. c is the channel stacked autoencoders) to learn spatial and spectral information,
dimension of Z, and γ is the dimension expansion ratio, respectively. The 3-D CNNs have been introduced to extract
usually set to 4. The bias term is omitted for simplicity. spectral–spatial features for HSI classification considering
ViT is computationally intensive, challenging to optimize, the 3-D structural characteristics of HSIs [21]. These patch-
and requires extensive data to avoid overfitting due to the based classifications have difficulties to achieve fast inference
absence of inductive bias [25]. Various design strategies have speed due to redundant computation of overlapping regions
been explored to incorporate the advantages of CNNs into between adjacent patches, which limits practical applications.
transformer models to improve performance. To generate FCN-based segmentation networks [22], [23], [54], [55] mit-
hierarchical representation and reduce computational cost, igate the redundant computation by performing pixel-to-pixel
hierarchical visual transformer (HVT) [43] gradually pools classification. For example, SSFCN [22] and a deep FCN
visual tokens to reduce the sequence length as the layer goes with an efficient nonlocal module (ENL-FCN) [54] take an
deeper, which is similar to downsampling in CNNs. CoAtNet entire HSI as input and perform feature extraction without
[25] improves the generalization and capacity of the model reducing spatial dimensions. FreeNet [23] and spectral–spatial-
by naturally unifying and summing depthwise convolution dependent global learning (SSDGL) [55] networks follow
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
Fig. 1. Overview illustration of the proposed LiT network for HSI classification.
the encoder–decoder architecture to perform pixel-to-pixel than regular convolutions, they are slow in the early stages,
classification. because they cannot fully utilize modern accelerators [57].
Fused-MBConv replaces the 1 × 1 expansion convolution and
III. M ETHODOLOGY 3 × 3 depthwise convolutions with a single 3 × 3 expansion
convolution. As illustrated in Fig. 1, Fused-MBConv consists
This section introduces the proposed approaches: the LiT
of a 3 × 3 expansion layer, a squeeze-and-excitation (SE)
network and the CMS sampling strategy.
module [58], and a 1 × 1 projection layer. The expansion
ratio of the 3 × 3 expansion convolution is set to 2, and
A. LiT Network the 1 × 1 convolution reduces the channel dimension by the
This study develops a simple yet effective network, LiT, same ratio. The input and output of Fused-MBConv block
which aims to improve the feature representation for HSI are connected when they have the same number of channels.
classification. In ViT, linear projections poorly model the Compared with the original ViT, the convolutional tokenizer
structural information present in patches. To alleviate this that adopts Fused-MBConv in the early stages is more effective
problem, we use convolutional tokenization to generate richer at encoding spatial information.
tokens and preserve more local information. Our LiT has 2) Position Lightweight Multiheaded Self-Attention: The
four stages to produce the hierarchical representation of data, computational complexity of self-attention in transformers is
as illustrated in Fig. 1. The first two stages deploy fused- quadratic to the spatial size of inputs. Using self-attention
MBConv blocks to introduce inductive bias and extract local modules to process high-resolution images would inevitably
features. The last two stages employ a sequence of transformer cause the problem of low computational efficiency and insuf-
blocks to model long-range dependencies. The CNN stages ficient memory. To alleviate this problem, we propose the
are implemented before the transformer stages based on the PLMSA module that constructs the relationships between
prior knowledge that convolution is good at encoding local each feature vector and eigenvector clusters. The eigenvector
information, which is essential for processing low-level fea- cluster is a compact feature vector collected from a subset
tures [56]. Considering the redundancy of spectral information of feature vectors in input tensors. They are implemented
in HSIs, a dimensionality reduction convolution is applied at using the spatial pyramid pooling-fast technique [59], which
the end of each stage to aggregate local spectral features. reduces computational burden and provides multiscale contex-
This can preserve more details without defining additional tual information.
parameters. In this way, LiT improves context aggregation for As illustrated in Fig. 2, the input token X ∈ Rw×h×d is
better identification and refines coarse results. The output of reshaped into X̂ ∈ Rn×d , where w, h, and d represent the
stage 4 is fed into a 1 × 1 convolutional layer for classification. height, width, and channels of X, respectively, n = wh.
In the transformer block, a twin MSA (TwinMSA) module is Meanwhile, we use 2 × 2 max pooling with stride 2 to
designed to promote spatial–spectral information learning and reduce the spatial size of K p and V p before the self-attention
reduce the high computational and memory cost of traditional operation. Specifically, we feed X ∈ Rw×h×d into two succes-
sive max-pooling layers to generate X1 ∈ R(w/2)×(h/2)×d and
p
self-attention. As shown in Fig. 2, the TwinMSA consists of p (w/4)×(h/4)×d p p
PLMSA and CLMSA modules. Details of the Fused-MBConv X2 ∈ R , respectively. X1 and X2 are separately
(n/4)×d (n/16)×d
block, PLMSA, and CLMSA modules are presented below. reshaped into R and R , which are concatenated
1) Fused-MBConv Block: Many state-of-the-art (SOTA) to form X p ∈ R(5n)/16×d . In addition, we inject positional
LiT models, such as CoAtNet [25] and MobileViT [36], use information into each attention block by introducing a relative
the inverted residual block for efficiency. The inverted residual bias B p to the attention maps, and the corresponding position
block first widens channels with a 1 × 1 expansion convolu- lightweight self-attention (PLSA) is defined as follows:
tion, and then uses a 3 × 3 depthwise convolution to capture T.p
local information. Furthermore, it uses a 1 × 1 convolution to PLSA = Softmax Q p K P dk + B p V p (7)
project the channel dimension to the original size, so that input
where Q p = X̂W Q ∈ Rn×d , K p = X p W K ∈ R(5n)/16×d , V p =
p
and output can be added. Although depthwise convolutions
X p WV ∈ R(5n)/16×d , and B p ∈ Rn×(5n)/16 . Finally, the PLMSA
p
have fewer parameters and floating-point operations (FLOPs)
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
Fig. 4. Indian Pines dataset. (a) Land-cover type and sample settings.
(b) False-color image. (c) Spatial distribution of training samples.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
TABLE I
D ETAILS OF THE L I T A RCHITECTURE
Fig. 7. Effect of window size on OA. (a) Salinas dataset. (b) DFC 2018
dataset.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
TABLE II TABLE IV
C OMPARING THE P ERFORMANCE OF L I T T RAINED W ITH S AMPLES S ENSITIVITY A NALYSIS OF THE M ODEL L AYOUT IN T ERMS OF OA (%) ON
G ENERATED BY CRS AND CMS S AMPLING S TRATEGIES I NDIAN P INES , S ALINAS , AND DFC 2018 DATASETS
TABLE V
C OMPARISON OF T RAINABLE PARAMETERS (PARAMS ), FLOP S ,
T RAINING (T RAIN ), AND I NFERENCE (I NFER ) T IME ,
TABLE III AND OA B ETWEEN PATCH -BASED AND FCN-BASED
A BLATION A NALYSIS OF THE P ROPOSED L I T ON THE S ALINAS DATASET F RAMEWORKS ON THE S ALINAS DATASET
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
TABLE VI
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE I NDIAN P INES DATASET
Fig. 8. Classification maps of different methods on the Indian Pines dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.
classification. Both learning frameworks were trained on the patch-based framework, the actual computation of the
the same training set and tested on the same test set. FCN-based framework is faster than that of the patch-based
The number of trainable parameters (Parmas) and FLOPs framework. This is because the time-consuming training
was used to measure model complexity and theoretical process is performed offline, while inference speed is the
computational overhead, respectively. Meanwhile, training main factor determining whether a method is practical.
time and inference time were used to represent actual Therefore, the patch-based framework is a better choice when
efficiency. The corresponding results on the Salinas dataset the amount of data is small, while the FCN-based framework
are summarized in Table V. Table V shows that the number is more suitable for large-scale applications (e.g., classification
of parameters in both frameworks is almost equal. The on satellite images) or real-time applications (e.g., those on
FCN-based framework is less efficient than the patch-based mobile devices).
framework in terms of FLOPs and training time. However,
the FCN-based framework achieved higher accuracy (89.03%
E. Comparison With Other Methods
versus 86.40%) and faster inference speed (1.93 versus
22.84 s) than the patch-based classification. Although the 1) Quantitative Evaluation: Tables VI–VIII summarize the
FCN-based framework consumed more training time than mean and standard deviation of OA, AA, and κ, as well as
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
TABLE VII
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE S ALINAS DATASET
TABLE VIII
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE DFC 2018 DATASET
the average PA for each class. The best performance in each With the introduction of spatial information and the strong
row is highlighted in bold. learning capability of DL techniques, patch-based networks,
As a conventional classifier, SVM exhibits the worst results, such as SSRN, SSSAN, and SMRN, also significantly out-
especially on the DFC 2018 dataset. SVM considers raw perform SVM. Although these patch-based methods have
pixel vectors as features and feeds them directly into clas- achieved remarkable accuracy, their limited patch size limits
sifiers, which cannot excavate the discriminative information the spatial context and tends to lose some important informa-
contained in the data. SVM-EPF dramatically improves the tion. FCN-based networks (i.e., SSFCN and ENL-FCN) only
performance of SVM by applying edge-preserving filtering. achieved on-par results with patch-based networks (i.e., SSRN,
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
Fig. 9. Classification maps of different methods on the Salinas dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.
SSSAN, and SMRN) under the proposed sampling strategy. The proposed LiT exhibited a significant improvement over
For example, on the Salinas dataset, SSFCN and ENL-FCN all the comparison algorithms, because it uses transformers
achieved 87.58% and 85.34%, respectively, while SSRN, to capture long-range dependencies and convolution to gather
SSSAN, and SMRN achieved 86.48%, 86.65%, and 85.74%, local information. Both long-range connectivity and local
respectively. These FCN-based networks mainly use a stack information contribute to reasoning about the relationships
of many small spatial convolutions (e.g., 3 × 3) to enlarge between image contexts. For example, LiT achieved a max-
the receptive fields. According to the effective receptive√field imum OA of 89.03% on the Salinas dataset, outperforming
(ERF) theory, the size of the ERF is proportional to O(K L), SVM, SSRN, SSSAN, SMRN, SSFCN, ENL-FCN, Spectral-
where K is the kernel size and L is the depth, i.e., the number Former, ConvNeXt, and MobileViT by 9.49%, 2.55%, 2.38%,
of layers [65]. Thus, the ERF of FCN-based networks is still 3.29%, 1.45%, 3.69%, 5.47%, 1.01%, and 3.63%, respectively.
limited, which may limit their performance gains. Although Even for some challenging classes with similar spectral and
the transformer is good at modeling global dependencies, textural features (e.g., stubble and grapes_untrained on the
SpectralFormer is inferior to CNN-based or GCN-based mod- Salinas dataset), the proposed LiT still achieved better accu-
els, since it lacks inductive biases built-in CNNs and still racy by using its dynamic TwinMSA modules to capture
follows the patch-based classification. As the SOTA back- complicated relational interactions between different positions.
bones, ConvNeXt performs well on all three datasets, while These results powerfully demonstrate the practicality and value
MobileViT’s results are not satisfactory. Although MobileViT of our LiT.
employs both local information and global connectivity to 2) Visual Evaluation: Figs. 8–10 show the classification
reason about the relationships between image contents, it is maps produced by the comparison methods and the cor-
still heavy weight for HSI datasets. responding GT images. We can see that these qualitative
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
Fig. 10. Classification maps of different methods on the DFC 2018 dataset. (a) GT. (b) SVM. (c) SVM-EPF. (d) SSRN. (e) SSSAN. (f) SMRN. (g) SSFCN.
(h) ENL-FCN. (i) SpectralFormer. (j) ConvNeXt. (k) MobileViT. (l) LiT.
comparisons between these methods are in agreement with pepper noise problem of SVM, especially on the Indian Pines
the quantitative comparisons presented in Tables VI–VIII. dataset. The patch-based networks, such as SSRN, SSSAN,
As shown in Figs. 8(c), 9(c), and 10(c), the classifi- and SMRN, also produced better visual performance than
cation maps of SVM exhibit a considerable salt–pepper SVM by incorporating the spatial features of neighboring
noise, because SVM only considers spectral information. Due pixels. Classification maps generated by patch-based networks
to the similar spectral information between certain classes contain fewer noisy points and are smoother. However, the
(e.g., grapes_untrained and vinyard_untrained on the Salinas boundaries of these classification maps tend to be distorted,
dataset), distinguishing these easily confused categories using since patch-based classification approaches assume that each
spectral information alone is difficult. SVM-EPF not only pixel within a patch contributes equally to classifying the
makes object edges sharper, but also significantly mitigates the center pixel. This assumption is valid for most pixels in
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
TABLE IX
C OMPARISON OF PARAMS , FLOP S , T RAINING (T RAIN ), AND I NFERENCE
(I NFER ) T IME OF D IFFERENT M ETHODS ON THE S ALINAS DATASET
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
TABLE X
C OMPARISON OF C LASSIFICATION ACCURACY OF D IFFERENT M ETHODS ON THE S ALINAS DATASET U SING A R ANDOM S AMPLING S TRATEGY
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
5517617 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
[21] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral–spatial residual [43] Z. Pan, B. Zhuang, J. Liu, H. He, and J. Cai, “Scalable vision transform-
network for hyperspectral image classification: A 3-D deep learn- ers with hierarchical pooling,” in Proc. IEEE/CVF Int. Conf. Comput.
ing framework,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, Vis. (ICCV), Oct. 2021, pp. 367–376.
pp. 847–858, Feb. 2018. [44] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,
[22] Y. Xu, B. Du, and L. Zhang, “Beyond the patchwise classification: “Escaping the big data paradigm with compact transformers,” 2021,
Spectral–spatial fully convolutional networks for hyperspectral image arXiv:2104.05704.
classification,” IEEE Trans. Big Data, vol. 6, no. 3, pp. 492–506, [45] Y. Chen et al., “Mobile-former: Bridging MobileNet and transformer,”
Sep. 2020. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[23] Z. Zheng, Y. Zhong, A. Ma, and L. Zhang, “FPGA: Fast patch-free Jun. 2022, pp. 5260–5269.
global learning framework for fully end-to-end hyperspectral image [46] K. Li, R. Yu, Z. Wang, L. Yuan, G. Song, and J. Chen, “Local-
classification,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 8, ity guidance for improving vision ttransformers on tiny datasets,”
pp. 5612–5626, Aug. 2020. in Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 61, Nov. 2022,
[24] C. Wang, L. Zhang, W. Wei, and Y. Zhang, “Toward effective hyper- pp. 110–127.
spectral image classification using dual-level deep spatial manifold [47] L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. W. Lau, “BiFormer:
representation,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Vision transformer with bi-level routing attention,” in Proc. IEEE Conf.
Art. no. 5505614. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 10323–10333.
[25] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying convolution [48] X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan, “EfficientViT:
and attention for all data sizes,” in Advances in Neural Information Memory efficient vision transformer with cascaded group attention,” in
Processing Systems, vol. 34. Red Hook, NY, USA: Curran Associates, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023,
2021, pp. 3965–3977. pp. 14420–14430.
[26] J. Guo et al., “CMT: Convolutional neural networks meet vision trans- [49] H. Huang, X. Zhou, J. Cao, R. He, and T. Tan, “Vision transformer
formers,” 2021, arXiv:2107.06263. with super token sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern
[27] E. Sanderson and B. J. Matuszewski, “FCN-transformer feature fusion Recognit. (CVPR), Jun. 2023, pp. 22690–22699.
for polyp segmentation,” in Proc. Annu. Conf. Med. Image Underst. [50] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
Anal. Cham, Switzerland: Springer, 2022, pp. 892–907. using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
vol. 55, no. 2, pp. 844–853, Feb. 2017.
[28] W. Wang et al., “Internimage: Exploring large-scale vision foundation
[51] G. Sun et al., “Deep fusion of localized spectral features and multi-scale
models with deformable convolutions,” in Proc. IEEE Conf. Comput.
spatial features for effective classification of hyperspectral images,” Int.
Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14408–14419.
J. Appl. Earth Observ. Geoinf., vol. 91, Sep. 2020, Art. no. 102157.
[29] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
[52] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-
for image recognition at scale,” 2020, arXiv:2010.11929.
spectral image classification,” IEEE Trans. Image Process., vol. 26,
[30] Z. Zhong, Y. Li, L. Ma, J. Li, and W.-S. Zheng, “Spectral–spatial no. 10, pp. 4843–4855, Oct. 2017.
transformer network for hyperspectral image classification: A factorized [53] S. Hao, W. Wang, Y. Ye, T. Nie, and L. Bruzzone, “Two-stream deep
architecture search framework,” IEEE Trans. Geosci. Remote Sens., architecture for hyperspectral image classification,” IEEE Trans. Geosci.
vol. 60, 2022, Art. no. 5514715. Remote Sens., vol. 56, no. 4, pp. 2349–2361, Apr. 2018.
[31] X. He, Y. Chen, and Q. Li, “Two-branch pure transformer for hyper- [54] Y. Shen et al., “Efficient deep learning of nonlocal features for
spectral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 19, hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,
pp. 1–5, 2022. vol. 59, no. 7, pp. 6029–6043, Jul. 2021.
[32] H. Yu, Z. Xu, K. Zheng, D. Hong, H. Yang, and M. Song, “MSTNet: [55] Q. Zhu et al., “A spectral–spatial-dependent global learning framework
A multilevel spectral–spatial transformer network for hyperspectral for insufficient and imbalanced hyperspectral image classification,” IEEE
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Trans. Cybern., vol. 52, no. 11, pp. 11709–11723, Nov. 2022.
Art. no. 5532513. [56] C. Yang et al., “Lite vision transformer with enhanced self-attention,”
[33] X. Huang, M. Dong, J. Li, and X. Guo, “A 3-D-Swin transformer- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
based hierarchical contrastive learning method for hyperspectral image Jun. 2022, pp. 11988–11998.
classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, [57] M. Tan and Q. Le, “EfficientnetV2: Smaller models and faster training,”
Art. no. 5411415. in Proc. Int. Conf. Mach. Learn., 2021, pp. 10096–10106.
[34] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using [58] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
Oct. 2021, pp. 9992–10002. pp. 7132–7141.
[35] L. Sun, G. Zhao, Y. Zheng, and Z. Wu, “Spectral–spatial feature [59] G. Jocher et al., “YOLOv5 by ultralytics,” Oct 2020.
tokenization transformer for hyperspectral image classification,” IEEE [60] J. Fu, J. Liu, J. Jiang, Y. Li, Y. Bao, and H. Lu, “Scene segmentation
Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5522214. with dual relation-aware attention network,” IEEE Trans. Neural Netw.
[36] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, Learn. Syst., vol. 32, no. 6, pp. 2547–2560, Jun. 2021.
and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. [61] J. Liang, J. Zhou, Y. Qian, L. Wen, X. Bai, and Y. Gao, “On the sampling
Represent. (ICLR), Jan. 2022. strategy for evaluation of spectral–spatial methods in hyperspectral
[37] W. Wang, L. Liu, T. Zhang, J. Shen, J. Wang, and J. Li, “HyperHyper- image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2,
ES2 T: Efficient spatial–spectral transformer for the classification of pp. 862–880, Feb. 2017.
hyperspectral remote sensing images,” Int. J. Appl. Earth Observ. [62] X. Kang, S. Li, and J. A. Benediktsson, “Spectral–spatial hyper-
Geoinf., vol. 113, Sep. 2022, Art. no. 103005. spectral image classification with edge-preserving filtering,” IEEE
[38] W. Qi, C. Huang, Y. Wang, X. Zhang, W. Sun, and L. Zhang, Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2666–2677,
“Global–local 3-D convolutional transformer network for hyperspectral May 2014.
image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, [63] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
Art. no. 5510820. “A ConvNet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis.
[39] T. Lu, M. Liu, W. Fu, and X. Kang, “Grouped multi-attention network Pattern Recognit. (CVPR), Jun. 2022, pp. 11966–11976.
for hyperspectral image spectral–spatial classification,” IEEE Trans. [64] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
Geosci. Remote Sens., vol. 61, 2023, Art. no. 5507912. 2017, arXiv:1711.05101.
[40] E. Ouyang, B. Li, W. Hu, G. Zhang, L. Zhao, and J. Wu, “When [65] X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your ker-
multigranularity meets spatial–spectral attention: A hybrid transformer nels to 31×31: Revisiting large kernel design in CNNs,” in Proc.
for hyperspectral image classification,” IEEE Trans. Geosci. Remote IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
Sens., vol. 61, 2023, Art. no. 4401118. pp. 11963–11975.
[41] B. Zu, Y. Li, J. Li, Z. He, H. Wang, and P. Wu, “Cascaded [66] X. Zhang, A. Zhang, G. Sun, and Y. Yao, “Multiscale convolution
convolution-based transformer with densely connected mechanism for network with region-based max voting for hyprrsprctral imagrs clas-
spectral–spatial hyperspectral image classification,” IEEE Trans. Geosci. sificatton,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
Remote Sens., vol. 61, 2023, Art. no. 5513119. Sep. 2020, pp. 64–67.
[42] X. Yang, W. Cao, Y. Lu, and Y. Zhou, “Hyperspectral image transformer [67] J. Chen et al., “Run, don’t walk: Chasing higher flops for faster neural
classification networks,” IEEE Trans. Geosci. Remote Sens., vol. 60, networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
2022, Art. no. 5528715. Jun. 2023, pp. 12021–12031.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: LIGHTWEIGHT TRANSFORMER NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION 5517617
Xuming Zhang received the B.Sc. and M.Sc. Lorenzo Bruzzone (Fellow, IEEE) received the Lau-
degrees from the China University of Petroleum rea (M.S.) degree (summa cum laude) in electronic
(East China), Qingdao, China, in 2018 and 2021, engineering and the Ph.D. degree in telecommuni-
respectively. She is currently pursuing the Ph.D. cations from the University of Genoa, Genoa, Italy,
degree in geography with Nanjing University, in 1993 and 1998, respectively.
Nanjing, China. He is the Founder and the Director of the
Her research interests include hyperspectral image Remote Sensing Laboratory, Department of Informa-
processing, multisensor data fusion, high-resolution tion Engineering and Computer Science, University
remote sensing processing, deep learning, and of Trento, Trento, Italy, where he is currently a Full
its applications to semantic segmentation and Professor of telecommunications. He is a principal
classification. investigator of many research projects. Among the
others, he is a Principal Investigator of the Radar for Icy Moon Exploration
(RIME) instrument in the framework of the JUpiter ICy moons Explorer
(JUICE) mission of the European Space Agency (ESA) and the Science Lead
of the High Resolution Land Cover Project in the framework of the Climate
Change Initiative of ESA. His research interests include the areas of remote
Yuanchao Su (Senior Member, IEEE) received the sensing, radar and synthetic aperture radar (SAR), signal processing, machine
B.S. and M.Sc. degrees from the Xi’an University of learning, and pattern recognition.
Science and Technology, Xi’an, China, in 2012 and Dr. Bruzzone has been a member of the Administrative Committee of the
2015, respectively, and the Ph.D. degree from Sun IEEE Geoscience and Remote Sensing Society (GRSS) since 2009, where
Yat-sen University, Guangzhou, China, in 2019. he has been the Vice President of professional activities since 2019. He
From 2013 to 2015, he was an Exchange Postgrad- ranked first place in the Student Prize Paper Competition of the 1998 IEEE
uate with the Optical Laboratory, Institute of Remote International Geoscience and Remote Sensing Symposium (IGARSS), Seattle,
Sensing and Digital Earth, Chinese Academy of in July 1998. He was a recipient of many international and national honors and
Sciences, Beijing, China. From 2018 to 2019, he was awards, including the recent IEEE GRSS 2015 Outstanding Service Award, the
a Visiting Researcher with the Advanced Imaging 2017 and 2018 IEEE IGARSS Symposium Prize Paper Awards, and the 2019
and Collaborative Information Processing Group, WHISPER Outstanding Paper Award. Since 2003, he has been the Chair of the
Department of Electrical Engineering and Computer Science, University of International Society for Optical Engineering (SPIE) Conference on Image and
Tennessee, Knoxville, TN, USA. In 2019, he joined the Department of Remote Signal Processing for Remote Sensing. He has been a Distinguished Speaker
Sensing, College of Geomatics, Xi’an University of Science and Technology, of the IEEE Geoscience and Remote Sensing Society between 2012 and 2016.
where he is currently a Lecturer and leads the Hyperspectral Information and He is the Cofounder of the IEEE International Workshop on the Analysis of
Intelligent Computation Group. Since 2021, he has been a Visiting Researcher Multi-Temporal Remote-Sensing Images (MultiTemp) series and is a member
with the Key Laboratory of Computational Optical Imaging Technology, of the Permanent Steering Committee of this series of workshops. He has
Aerospace Information Research Institute, Chinese Academy of Sciences. His been the Founder of the IEEE Geoscience and Remote Sensing Magazine
research interests include hyperspectral unmixing, hyperspectral classification, for which he has been an Editor-in-Chief between 2013 and 2017. He was
neural network, and deep learning. a guest coeditor of many special issues of international journals. He is
Dr. Su is a Senior Member of the IEEE Geoscience and Remote Sensing an Associate Editor of the IEEE T RANSACTIONS ON G EOSCIENCE AND
Society. He serves as a Reviewer for many international journals, including R EMOTE S ENSING.
the IEEE T RANSACTIONS ON C YBERNETICS, the IEEE T RANSACTIONS ON
P ROCESSING, the IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE
S ENSING, the IEEE J OURNAL OF S ELECTED T OPICS IN A PPLIED E ARTH
O BSERVATIONS, and the IEEE G EOSCIENCE AND R EMOTE S ENSING L ET-
TERS.
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on November 09,2024 at 08:02:01 UTC from IEEE Xplore. Restrictions apply.