Spatial Gated Multi-Layer Perceptron For Land Use and Land Cover Mapping
Spatial Gated Multi-Layer Perceptron For Land Use and Land Cover Mapping
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2024.3354175
Abstract—Due to its capacity to recognize detailed spectral urban sprawl, is essential for understanding its environmental
differences, hyperspectral data have been extensively used for consequences, as well as promoting the adoption of more
precise Land Use Land Cover (LULC) mapping. However, recent sustainable forms of urban expansion. Hyperspectral (HS) data
multi-modal methods have shown their superior classification
performance over the algorithms that use single data sets. On the have been utilized widely for accurate LULC mapping due
other hand, Convolutional Neural Networks (CNNs) are models to their ability to distinguish subtle spectral differences [2].
extensively utilized for the hierarchical extraction of features. However, recent research on the use of multi-modal models,
Vision transformers (ViTs), through a self-attention mechanism, such as multi-modal fusion transformer (MFT) network has
have recently achieved superior modeling of global contextual proven their superior classification performance compared to
information compared to CNNs. However, to harness their
image classification strength, ViTs require substantial training the models that utilize only hyperspectral data [3].
datasets. In cases where the available training data is limited, It has been shown that due to the complex characteristics
current advanced multi-layer perceptrons (MLPs) can provide of HS data, conventional machine learning models, such as
viable alternatives to both deep CNNs and ViTs. In this paper, the random forests, struggle to accurately classify HSI [2].
we developed the SGU-MLP, a deep learning algorithm that Furthermore, traditional models do not take spatial informa-
effectively combines MLPs and spatial gating units (SGUs) for
precise Land Use Land Cover (LULC) mapping using multi- tion. Additionally, hyperspectral imaging often involves a nat-
modal data from multi-spectral, LiDAR, and hyperspectral data. urally nonlinear interaction between the corresponding ground
Results illustrated the superiority of the developed SGU-MLP classes and the acquired spectral information [2]. On the other
classification algorithm over several CNN and CNN-ViT-based hand, deep learning models have been used increasingly for
models, including HybridSN, ResNet, iFormer, EfficientFormer, HS classification in recent years. In particular, Convolutional
and CoAtNet. The SGU-MLP classification model consistently
outperformed the benchmark CNN and CNN-ViT-based algo- Neural Networks (CNNs) are widely used models because of
rithms. The code will be made publicly available at https: their ability for automatic hierarchical feature extraction. To
//github.com/aj1365/SGUMLP address the limitation of CNNs in capturing global contextual
Index Terms—Attention mechanism, image classification, spa- information, vision transformers (ViTs) have been successfully
tial gating unit (SGU), vision transformers. employed for HSI classification [4]. ViTs use self-attention
mechanisms to obtain global contextual information more
I. I NTRODUCTION effectively than CNNs, significantly increasing the accuracy
of HS classification [3].
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on January 15,2024 at 22:38:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2024.3354175
consequently, minimizing the necessity for extensive training does not necessitate the use of positional embedding. In
data. other words, the positional embedding information is obtained
This letter introduces the SGU-MLP in Section II, illustrates through the use of spatial depth-wise convolutions [6] similar
the experiments and analyses the results in Section III, and to inverted bottlenecks employed in MobileNetV2 [7]. Con-
highlights the concluding remarks in Section IV. sidering the dense layer of D (i.e., input feature) in the MLP
block , as illustrated in Fig. 1, the SGU uses a linear projection
Buliding
Classifer
Head
layer that benefits from a contraction operation across the
Tree
Road
Grass Global Average Pooling spatial dimension of the cross-tokens interaction as defined
....
by:
...
Nx
Mixer Layer fW,b (D) = W D + b (2)
Dense Layer
Per-Pixel Fully Connected where W ∈ Rn×n defines a matrix that has a size equal to
GELU
the input sequence length, while n and b present the sequence
SGU length and biases of the tokens. It should be highlighted
DWC
Dense Layer DSM ⊙ LiDAR
that the spatial projection matrix of W is not dependent on
HSI
the input data, contradicting the self-attention models where
PCA
MLP Layer W (D) is created dynamically from the D. The SGU can be
Fig. 1: Graphical representation of spatial gated multi-layer perceptron framework for formulated as:
land use and land cover classification. The MLP-Mixer layer includes two MLPs to S(D) = D · fW,b (D) (3)
extract spatial information. ⊙ represents channel-wise concatenation.
where element-wise multiplication is represented by (·). The
SGU equation can be improved by dividing D into D1 and
II. P ROPOSED C LASSIFICATION F RAMEWORK D2 along the channel dimension. Thus, the SGU can be
As illustrated in Fig. 1, the SGU-MLP, is developed for formulated as:
image classification using a small number of training data. S(D) = D1 · fW,b (D2) (4)
For efficient application of the multi-scale representation in
the classification task, we incorporated a computationally light The output map of the DWC block is flattened and fed
and straightforward depth-wise CNN-based architecture. As to the MLP-Mixer layer. Considering a dense layer of size
presented in Fig. 2, the MLP-Mixer layer of the developed 256 × 256, The D1 and D2 both have sizes of 256 × 128. The
model includes two different types of layers: (i) MLPs utilized fW,b (D2) has a size of 256 × 128, where the S(D) has a size
across image patches for extraction of spatial information and of 256 × 128.
(ii) MLPs utilized individually to extract per-location features
from image inputs. In addition, in each MLP block, the SGU is C. Multi-layer Perceptron Mixer Block (MLP-Mixer):
utilized to enable the developed algorithm to effectively learn
In current advanced deep vision architectures, layers com-
intricate spatial relationships among the tokens of the input
bine features in one or more of the following ways: first,
data.
at a given spatial location, second, among various spatial
locations, or third, both operations simultaneously, with a
A. Depth-wise Convolution Block (DWC): kernel of k×k convolutions (for k > 1) and pooling operations
The DWC architecture is light and straightforward and is (i.e., second operation), incorporated in CNNs. Convolutions
based on CNNs. With so many variables and the limited with kernel size 1 × 1 perform only the first operation,
available training data, a higher probability of overfitting exists whereas convolutions with larger kernels accomplish both the
during the training process. Hence, to address the overfit- first and second operations. Self-attention layers in ViTs and
ting issue and capture multi-scale feature information, we other attention-based structures include the first and second
incorporated three depth-wise convolutions in parallel. These operations, while models based on MLPs perform only the first
convolutions consist of 20 outputs channels with kernel (k) operation. The objective of the MLP-Mixer architecture is to
sizes of 1 × 1, 3 × 3, and 5 × 5, respectively. Feature maps X distinguish between cross-location (height and width mixing)
with a size of 9 × 9 × d are the input for the DWC block that operations and per-location (channel-mixing) operations, as
produces output DZ , where d is the number of bands. presented in Fig. 2 [5]. A series of non-overlapping patches
X of images E from the output feature of the DWC block DZ
DZ = DWConv2D(k×k) (X) (1) are the input to the MLP-Mixer that is projected to a given
j=1,3,5 hidden dimension of C, resulting in two-dimensional table of
The output maps of the three depth-wise CNNs are added M ∈ RE×C . The output features of the DWC block are first
and fed to the MLP-Mixer blocks. flattened and then fed to the MLP-Mixer layers. Given the
input image of size H ×W , and patches of F ×F , the number
of patches would be E = H×W F 2 , where all resulting patches
B. Spatial gating unit (SGU): of images are projected into the same projection matrix. For
The SGU is designed to extract complex spatial interaction instance, considering the input image size of 9×9, the reshaped
across tokens. Unlike, the current ViT models, the SGU feature has a size of 9 × 9 = 81. As we set the dimension of
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on January 15,2024 at 22:38:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2024.3354175
MLP-Mixer Layer
Skip Connection
MLP
MLP
Layer Norm Layer Norm
Mixing Mixing Mixing
Skip Connection
the token-mixing MLP to 256, the output feature map has a III. E XPERIMENTAL R ESULTS
dimension of 81 × 256. The MLP-Mixer consists of several A. Experimental Data
layers of identical size (i.e., 4 layers), where each layer has
two MLP blocks. The first token-mixing is applied to column Houston dataset: This dataset was captured over the
of the M table (i.e, it is applied to the transposed input M T ), University of Houston campus and the neighboring urban
while the second MLP block (i.e., channel mixing) is applied area. It consists of a co-registered hyperspectral and multi-
on the rows of the M table. Two fully connected layers are spectral dataset containing 144 and 8 bands, respectively, with
in each MLP block, and a non-linearity function is applied 349 × 1905 pixels. More information can be found at [8].
independently to each row of the input image tensors. As such, Berlin dataset: This dataset has a spatial resolution of
each MLP-Mixer can be formulated as: 797 × 220 pixels and contains 244 spectral bands over Berlin.
The Sentinel-1 dual-Pol (VV-VH) single-look complex (SLC)
Uι,i = Mι,i + W2 ξ(W1 LN (M )ι,i )), i = 1, ..., C (5) product represents the SAR data. The processed SAR data
have a spatial resolution of 1723 × 476 pixels. The HS data
Yj,ι = Uj,ι + W4 ξ(W3 LN (U )ι,i )), j = 1, . . . , E (6) are interpolated through the nearest neighbor algorithm, as for
the Houston dataset, to provide the same image size as the
Where ξ illustrates the element-wise non-linearity function, SAR data [9].
while LN presents layer normalization. Notably, the Augsburg dataset: This scene over the city of Augsburg,
MLP-Mixer has a linear computation complexity, which Germany includes three distinct datasets: a spaceborne HS
distinguishes it from vision transformers with quadratic dataset and a dual-Pol PolSAR image. All image spatial
computation complexity and, consequently, exhibits a high resolutions were down-scaled to a single 30 m GSD. The
level of computational efficiency. scene describes four features from the dual-Pol (VV-VH) SAR
image, 180 spectral bands for the HS dataset of 332 × 485
pixels [10].
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on January 15,2024 at 22:38:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2024.3354175
classification model outperformed the other CNN and CNN- the MLP-Mixer’s classification accuracy was increased by
ViT-based algorithms of the HybridSN, CoAtNet, Efficient- approximately 7 percentage points to 87.25%.
former, iFormer, and ResNet by about 11, 14, 15, 15, and 19
TABLE IV: Classification results of Augsburg dataset in terms of F-1 score where κ =
percentage points, respectively, in terms of average accuracy, Kappa index, OA = Overall Accuracy, AA = Average Accuracy, respectively.
as demonstrated in Table III. Class MLP SGU + MLP DWC + MLP SGUMLP
Forest 0.87 0.92 0.91 0.93
Residential 0.91 0.93 0.92 0.96
TABLE I: Classification results of Augsburg dataset in terms of F-1 score where κ = Industrial 0.36 0.52 0.55 0.59
Low Plants 0.95 0.96 0.95 0.98
Kappa index, OA = Overall Accuracy, AA = Average Accuracy, respectively. Allotment 0.20 0.20 0.21 0.27
Commercial 0.15 0.20 0.18 0.29
Class HybridSN ResNet iFormer Efficientformer CoAtNet SGU-MLP Water 0.55 0.57 0.54 0.55
Forest 0.88 0.83 0.91 0.88 0.85 0.93 OA×100 87.64 ± (0.61) 88.90 ± (0.45) 89.48 ± (0.52) 91.13 ± (0.30)
Residential 0.89 0.83 0.89 0.9 0.87 0.96 AA×100 60.96 ± (1.16) 62.36 ± (1.59) 63.59 ± (1.25) 65.75 ± (0.42)
Industrial 0.43 0.15 0.35 0.4 0.22 0.59
κ×100 82.12 ± (0.90) 84.01 ± (0.66) 84.83 ± (0.78) 87.24 ± (0.41)
Low Plants 0.87 0.88 0.88 0.88 0.98 0.96
Allotment 0.13 0.1 0.13 0.11 0.09 0.27
Commercial 0.04 0.05 0.1 0.11 0.16 0.29
Water 0.35 0.19 0.21 0.25 0.19 0.55
OA×100 82.28 79.07 82.82 82.72 81.32 91.13
AA×100 55.76 43.57 52.96 52.81 49.9 65.75
κ×100
Training time (min)
74.85
6
69.34
3
75.37
34
75.24
7
73.12
13
87.24
4
TABLE V: Classification results of Berlin dataset in terms of F-1 score where κ = Kappa
index, OA = Overall Accuracy, AA = Average Accuracy, respectively.
Class MLP SGU + MLP DWC + MLP SGUMLP
Forest 0.74 0.72 0.72 0.72
TABLE II: Classification results of Berlin dataset in terms of F-1 score where κ = Kappa Residential 0.81 0.80 0.82 0.81
Industrial 0.40 0.39 0.40 0.39
index, OA = Overall Accuracy, AA = Average Accuracy, respectively. Low Plants 0.68 0.66 0.68 0.70
Soil 0.71 0.67 0.67 0.72
Class HybridSN ResNet iFormer Efficientformer CoAtNet SGUMLP
Allotment 0.44 0.43 0.44 0.44
Forest 0.71 0.64 0.69 0.73 0.65 0.72 Commercial 0.25 0.26 0.24 0.27
Residential 0.80 0.81 0.82 0.81 0.76 0.81 Water 0.50 0.44 0.45 0.50
Industrial 0.49 0.39 0.35 0.32 0.32 0.39 OA×100 68.43 ± (0.83) 69.12 ± (0.65) 70.03 ± (0.17) 70.56 ± (0.58)
Low Plants 0.59 0.35 0.72 0.70 0.59 0.70 AA×100 65.16 ± (0.52) 65.20 ± (0.50) 64.70 ± (1.04) 65.89 ± (0.26)
Soil 0.65 0.72 0.70 0.67 0.75 0.72 κ×100 55.25 ± (0.93) 56.06 ± (0.75) 56.95 ± (0.25) 57.85 ± (0.58)
Allotment textbf0.44 0.28 0.34 0.29 0.30 0.44
Commercial 0.45 0.25 0.29 0.24 0.29 0.27
Water 0.65 0.53 0.49 0.38 0.28 0.50
OA×100 66.81 63.7 68.6 68.17 63.14 70.56
AA×100 62.67 58.23 62.84 60.05 60.53 65.89
κ×100
Training time (min)
55.84
6
47.61
3
55.28
29
54.32
7
49.21
13
57.85
4
TABLE VI: Classification results of Houston dataset in terms of F-1 score where κ =
Kappa index, OA = Overall Accuracy, AA = Average Accuracy, respectively.
Class MLP SGU + MLP DWC + MLP SGUMLP
Healthy Grass 0.89 0.90 0.90 0.90
TABLE III: Classification results of Houston dataset in terms of F-1 score where κ = Stressed Grass
Synthetic Grass
0.90
0.43
0.91
0.97
0.90
0.98
0.90
0.97
Kappa index, OA = Overall Accuracy, AA = Average Accuracy, respectively. Tree 0.89 0.94 0.94 0.92
Soil 0.96 1 1 1
Water 0.46 0.17 0.22 0.33
Class HybridSN ResNet iFormer Efficientformer CoAtNet SGUMLP Residential 0.79 0.78 0.80 0.79
Healthy Grass 0.85 0.88 0.86 0.89 0.90 0.90 Commercial 0.66 0.69 0.81 0.81
Stressed Grass 0.84 0.90 0.87 0.87 0.88 0.90 Road 0.82 0.84 0.81 0.85
Synthetic Grass 0.84 0.78 0.5 0.58 0.72 0.97 Highway 0.62 0.59 0.62 0.83
Tree 0.87 0.89 0.92 0.91 0.93 0.92 Railway 0.73 0.83 0.80 0.82
Soil 0.96 0.94 0.93 0.95 0.85 1 Parking Lot1 0.75 0.93 0.94 0.97
Water 0.73 0.71 0.29 0.39 0.25 0.33 Parking Lot2 0.79 0.69 0.88 0.86
Residential 0.69 0.72 0.68 0.6 0.79 0.79 Tennis Court 0.88 1 1 1
Commercial 0.69 0.39 0.68 0.56 0.6 0.81 Running Track 0.82 0.96 0.95 0.95
Road 0.7 0.57 0.75 0.77 0.82 0.85
Highway 0.58 0.52 0.45 0.54 0.54 0.83 OA×100 78.27 ± (1.53) 82.45 ± (0.92) 84.22 ± (0.81) 85.34 ± (0.91)
Railway 0.7 0.54 0.67 0.57 0.67 0.82 AA×100 80.53 ± (1.46) 85.03 ± (0.68) 86.38 ± (0.73) 87.25 ± (0.68)
Parking Lot1 0.74 0.42 0.48 0.71 0.55 0.97 κ×100 76.53 ± (1.65) 81.08 ± (0.98) 82.99 ± (0.88) 84.17 ± (0.96)
Parking Lot2 0.94 0.61 0.72 0.78 0.58 0.86
Tennis Court 0.84 0.77 0.74 0.73 0.56 1
Running Track 0.64 0.82 0.83 0.61 0.92 0.95
OA×100 75.62 68.16 71.03 71.66 72.67 85.34
AA×100 76.44 71.42 72.86 70.69 75.62 87.25
κ×100 73.59 65.49 68.71 69.25 70.56 84.17
Training time (min) 4 2 20 5 10 3
D. Computation cost
As illustrated in Table I, the proposed model required the
C. Ablation study least computation cost in terms of training time (4 min)
in the Augsburg data benchmark compared to other ViT-
An ablation study was performed to better understand
based models of iFormer (34 min), CoAtNet (13 min), and
the contribution and significance of different parts of the
Efficient Former (7 min). Moreover, in the Berlin dataset,
developed SGU-MLP classification algorithm. As seen in
the SGU-MLP algorithm with a required training time of 4
Table IV, the inclusion of the DWC block and SGU block
min demonstrated better computation efficiency over the other
increased the classification accuracy of the MLP-Mixer model
ViTs of iFormer (29 min), CoAtNet (13 min), and Efficient
by approximately 2 and 3 percentage points, respectively,
Former (7 min) (see Table II). In addition, as seen in Table III,
in terms of average accuracy for the Augsburg dataset. The
in the benchmark of the Houston data, the computational
highest classification accuracy was achieved by the inclusion
complexity of the SGU-MLP model was much less in terms
of both the DWC and SGU blocks with an average accuracy
of training time (3 min) compared to the other implemented
of 65.75%, increasing the classification accuracy of the MLP-
ViTs, including iFormer (20 min), CoAtNet (10 min), and
Mixer algorithm by about 5 percentage points.
Efficient Former (5 min). It is worth mentioning that an RTX
In the Berlin dataset, as illustrated in Table V, the inclusion
2070 MAX-Q GPU and Intel core-i7 CPU were utilized. The
of the SGU block and DWC block increased the classification
optimizer, loss function, batch size, and learning rate were set
accuracy of the MLP-Mixer algorithm by about 1 and 2
to Adam, Sparse Categorical Cross Entropy, 100, and 0.001,
percentage points, respectively, in terms of Kappa index. By
respectively, in all of the implemented models.
incorporating both the DWC and SGU blocks, the highest
classification was attained with a Kappa index of 57.85%.
This increased the accuracy of the MLP-Mixer classifier by IV. C ONCLUSION
approximately 3 percentage points. In this study, we developed the SGU-MLP algorithm based
As demonstrated in Table VI, the inclusion of the DWC on advanced MLP models and a spatial gating unit for land
block and SGU block increased the accuracy of the MLP- use and land cover mapping which demonstrated superior clas-
Mixer algorithm by approximately 2 and 1 percentage points, sification accuracy compared to several CNN and CNN-ViT-
respectively, in terms of average accuracy for the Houston based models. The obtained results illustrated that the utilized
dataset. By the inclusion of both the DWC and SGU blocks, MLP-Mixer architecture could obtain greater cross-location
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on January 15,2024 at 22:38:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2024.3354175
(f) (g)
(f) (g)
Residential Allotment Water Commercial Healthy grass Stressed grass Synthetic grass Tree Road
Highway Railway Parking Lot1 Parking Lot2 Tennis court Running track
Fig. 5: Classification Maps over the Houston dataset using a) Study image, b) CoAtNet, c) EfficientFormer, d) HybridSN, e) iFormer, f) ResNet, and g) the SGU-MLP.
(height and width) and per-location (channel) information Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 9204–
compared to the current advanced ViTs. Additionally, the SGU 9215. [Online]. Available: https://proceedings.neurips.cc/paper files/
paper/2021/file/4cc05b35c2f937c5bd9e7d41d3686fff-Paper.pdf
increased the classification accuracy by efficiently acquiring [7] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
complex spatial interaction across image tokens. Moreover, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
the SGU-MLP algorithm was demonstrated to be much more of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
computationally efficient in terms of training time compared [8] C. Debes, A. Merentitis, R. Heremans, J. Hahn, N. Frangiadakis, T. van
to other implemented ViT-based models of iFormer, Efficient Kasteren, W. Liao, R. Bellens, A. Pižurica, S. Gautama, W. Philips,
Former, and the state-of-the-art ViT model of CoAtNet. S. Prasad, Q. Du, and F. Pacifici, “Hyperspectral and lidar data fusion:
Outcome of the 2013 grss data fusion contest,” IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6,
pp. 2405–2418, 2014.
R EFERENCES [9] A. Okujeni, S. van der Linden, and P. Hostert, “Berlin-urban-gradient
dataset 2009 - an enmap preparatory flight campaign,” 2016.
[1] J. Yang, A. Guo, Y. Li, Y. Zhang, and X. Li, “Simulation of [10] D. Hong, J. Hu, J. Yao, J. Chanussot, and X. X. Zhu, “Multimodal
landscape spatial layout evolution in rural-urban fringe areas: a remote sensing benchmark datasets for land cover classification with
case study of ganjingzi district,” GIScience & Remote Sensing, a shared and specific feature learning model,” ISPRS Journal of
vol. 56, no. 3, pp. 388–405, 2019. [Online]. Available: https: Photogrammetry and Remote Sensing, vol. 178, pp. 68–80, 2021.
//doi.org/10.1080/15481603.2018.1533680 [Online]. Available: https://www.sciencedirect.com/science/article/pii/
[2] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, S0924271621001362
“Deep learning for hyperspectral image classification: An overview,” [11] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “Hybridsn:
IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, Exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classi-
pp. 6690–6709, 2019. fication,” IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 2,
[3] S. K. Roy, A. Deria, D. Hong, B. Rasti, A. Plaza, and J. Chanussot, pp. 277–281, 2019.
“Multimodal fusion transformer for remote sensing image classification,” [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– recognition,” in Proceedings of the IEEE Conference on Computer Vision
20, 2023. and Pattern Recognition (CVPR), June 2016.
[4] H. Yan, E. Zhang, J. Wang, C. Leng, A. Basu, and J. Peng, “Hybrid conv- [13] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,
vit network for hyperspectral image classification,” IEEE Geoscience “Informer: Beyond efficient transformer for long sequence time-
and Remote Sensing Letters, vol. 20, pp. 1–5, 2023. series forecasting,” Proceedings of the AAAI Conference on Artificial
[5] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Un- Intelligence, vol. 35, no. 12, pp. 11 106–11 115, May 2021. [Online].
terthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, Available: https://ojs.aaai.org/index.php/AAAI/article/view/17325
and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” [14] Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang,
in Advances in Neural Information Processing Systems, M. Ranzato, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,”
A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. 2022.
Curran Associates, Inc., 2021, pp. 24 261–24 272. [15] Z. Dai, H. Liu, Q. Le, and M. Tan, “CoAtNet: Marrying convolution
[6] H. Liu, Z. Dai, D. So, and Q. V. Le, “Pay attention to and attention for all data sizes,” in Advances in Neural Information
mlps,” in Advances in Neural Information Processing Systems, Processing Systems 34, 2021.
M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W.
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on January 15,2024 at 22:38:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.