Remotesensing 13 00516 v3
Remotesensing 13 00516 v3
Article
Vision Transformers for Remote Sensing Image Classification
Yakoub Bazi 1, * , Laila Bashmal 1 , Mohamad M. Al Rahhal 2 , Reham Al Dayil 1 and Naif Al Ajlan 1
1 Computer Engineering Department, College of Computer and Information Sciences, King Saud University,
Riyadh 11543, Saudi Arabia; [email protected] (L.B.); [email protected] (R.A.D.);
[email protected] (N.A.A.)
2 Applied Computer Science Department, College of Applied Computer Science, King Saud University,
Riyadh 11543, Saudi Arabia; [email protected]
* Correspondence: [email protected]; Tel.: +966-101469629
easier to deploy to satisfy the requirements of rapid monitoring, assessment, and mapping.
They can work at lower altitudes compared to the piloted aircraft, which provides spatial
resolution at the centimeters level. They can fly any time the weather permits, leading to
improvements in temporal resolution. As the spatial resolution increases, images are likely
to contain noisy and outlying descriptors. A recent study [5] proposed a convolutional
neural network (CNN) to classify images captured by a camera mounted on a UAV. Al-
Najjar et al. [6] proposed a CNN model to classify a digital surface model beside UAV
images. Liu et al. [7] combined CNNs with object-based image analysis (OBIA) for land
cover classification using multiview data. The author in [8] proposed a two-branch neural
network to assign multiple class labels to UAV imagery.
The topic of scene classification has been an active research field lately to face the
challenging problem of effectively interpreting remote-sensing images. It is the task of
taking an image and correctly labeling it to a predefined class as shown in Figure 1. Scene
classification is an important task for many applications, such as land management [9],
urban planning [10], and modeling wildfires [11].
The early works on scene classification were based on handcrafted features, manually
extracted by humans, including local binary patterns (LBP) [12], histogram of oriented
gradients (HOG) [13], and the scale-invariant feature transform (SIFT) [14]. Conventional
scene-classification methods depend on encoding handcrafted features with different
models such as the bag-of-words (BoWs) [15], Fisher vectors (FV) [16], or the vector of
locally aggregated descriptors (VLAD) [17].
On the other hand, deep learning methods such as Deep Belief Networks (DBNs) [18]
and stacked auto-encoders [19] gained enormous achievements in several applications,
including remote-sensing image classification. In particular, CNNs have surpassed tradi-
tional methods in many applications [20–22]. These methods have a main key advantage
of providing an end-to-end solution, which requires minimal feature engineering. Other
approaches based on recurrent neural networks (RNNs) [23], generative adversarial net-
works (GANs) [24,25], graph convolutional networks (GCNs) [26], and long- short-term
memory (LSTM) [27] have been introduced also. In a recent contribution, the authors
considered remote-sensing scene classification as a multiple-instance learning (MIL) prob-
lem [28]. They proposed a multiple-instance densely connected network to highlight
the local semantics relevant to the scene label. The method enhances the capability of
local semantic representation by effectively discarding useless information. Yu et al. [29]
proposed the attention GAN, which integrates GANs with the attention mechanism to
enhance the representation power of the discriminator for aerial scene classification. The
authors in [30] introduced a simple fine-tuning method using an auxiliary classification
loss. They showed how to combat the vanishing gradient problem using an auxiliary loss
function. Sun et al. [31] proposed a gated bidirectional network for feature fusion. Liu
Remote Sens. 2021, 13, 516 3 of 19
et al. [32] combined the feature maps from intermediate and fully connected layers and
input them to the classifier for classification. Yu et al. [33] combined two pretrained CNNs
with the two-stream fusion technique to classify high-resolution aerial scenes. Cheng
et al. [34] proposed a metric learning regularization on discriminative CNNs features to
optimize a new discriminative objective function to make the model more discriminative.
Xue et al. [35] proposed a method using three deep networks to extract deep features from
the image separately. Then these features were fused together to create a single feature
vector for classification.
Besides CNNs, a new type of deep-learning models called Transformers have been
proposed and received some popularity in computer vision. Transformers rely on a simple
but powerful procedure called attention, which focuses on certain parts of the input
to get more efficient results. Currently, they are considered state-of-the-art models in
sequential data, in particular natural language processing (NLP) methods such as machine
translation [36], language modeling [37], and speech recognition [38]. The architecture
of the Transformer developed by Vaswani et al. [39] is based on the encoder–decoder
model, which transforms a given sequence of elements into another sequence. The main
motivation for transformers was to enable parallel processing of the words in a sentence,
which was not possible in LSTMs or RNNs because they take words of a sentence one
by one.
Inspired by the success of Transformers in NLP, new research tries to apply Trans-
formers directly to images. This is a challenging task, due to the need in self-attention
application that every pixel attends to all other pixels. For images, this is very costly
because the image contains a huge number of pixels. Researchers tried several approaches
to apply Transformers to images. Some works combined CNN architectures with self-
attention. For example, Bello et al. [40] enhanced CNNs by replacing some convolutional
layers with self-attention layers, which led to improvement in image classification. How-
ever, this method faced high computational cost because the large size of the image causes
an enormous growth in the time complexity of self-attention. Wang et al. [41] proposed a
method to generate powerful features by selectively focusing on critical parts or locations
of the image, then processing them sequentially. Wu et al. [42] used the Transformer on top
of the CNN; first they extracted feature maps using a CNN, then fed them to stacked visual
Transformers to process visual tokens and compute the output. Ramachandran et al. [43]
first started to use self-attention as a stand-alone building block for vision tasks instead of
a simple augmentation on top of convolutional layers. They set up a fully attention model
by replacing all convolutional layers with self-attention layers. Chen et al. [44] proposed a
method that applies Transformers to raw images with reduced resolution and reshaped
into textlike long sequences of pixels.
In a very recent contribution, and different from previous works, Dosovitskiy et al. [45]
applied a standard Transformer directly to images by splitting the image into patches
not focusing on pixels, then input to the Transformer the sequence of embeddings for
those patches. The image patches were treated as tokens in NLP applications. These
models led to very competitive results on the ImageNet dataset. In this work, we will
exploit these pretrained models for transferring knowledge to the case of remote-sensing
imagery. Indeed, to the best of our knowledge in remote-sensing scene-classification
tasks, convolutional architectures remain dominant and Transformers have not yet been
widely used as the model choice in classification. For instance, He et al. [46] proposed a
model derived from the bidirectional encoder representations called BERT [47] that was
used in the natural language processing field to the context of hyperspectral images. The
method is based on several multihead self-attention layers. Each head encodes the semantic
context-aware representation to obtain discriminative features that are needed for accurate
pixel-level classification.
In this paper, we propose an extensive evaluation of the model proposed in [45]
for the classification of remote-sensing images. To this end, the images under analysis
are divided into patches, then converted to sequence by flattening and embedding. The
Remote Sens. 2021, 13, 516 4 of 19
position embedding is added to these patches to keep the position information. The
obtained sequence is then fed to several multihead attention layers for generating the
final representation. During classification, the first token sequence is fed as input to a
softmax classification layer. To increase the classification performance, we explore several
data augmentation strategies such as CutMix, and Cutout to generate additional data for
training. In the experiments, we show that we can compress the network by pruning half
of its layers while keeping competing classification accuracies.
The remainder of the paper is organized as follows: Section 2 describes the main
methods based on Transformers. In Section 3, we present the experimental results on
three well-known datasets. Section 4 provides a discussion about the results and presents
comparisons with state-of-the-art methods. Then we finally conclude and show future
directions in Section 5.
is encoded and appended to the patch representations. The resulting embedded sequence
of patches with the token z0 is given in (Equation (1)):
2 c )× d
z0 = [vclass ; x1 E; x2 E; . . . ; xn E] + E pos , E ∈ R( p , E pos ∈ R(n+1)×d (1)
It has been shown in [45], that 1-D and 2-D positional encodings produce nearly
identical results. Therefore, a simple 1-D positional encoding is used to preserve the
positional information of the flattened patches.
Figure 2. The Vision Transformer architecture: (a) the main architecture of the model; (b) the Transformer encoder module;
(c) the Multiscale-self attention (MSA) head, and (d) the self-attention (SA) head.
The MSA block in the encoder is the central component of the Transformer. It has
the role of determining the relative importance of a single patch embedding with respect
to the other embeddings in the sequence. This block has four layers: the linear layer, the
Remote Sens. 2021, 13, 516 6 of 19
self-attention layer, the concatenation layer, which concatenates the outputs of the multiple
attention heads, and a final linear layer, as shown in Figure 2c.
At a high level, attention can be represented by attention weight, which is computed
by finding the weighted sum over all values of the sequence z. The self-attention (SA)
head learns the attention weights by computing the query-key-value scaling dot-product.
Figure 2d shows the details of the computation that takes place in the SA block. For each
element in the input sequence, three values are generated: Q (query), K (key), and V
(value) by multiplying the element against three learned matrices UQKV (Equation (5)).
To determine the relevance between an element with other elements on the sequence, the
dot product is calculated between the Q vector of this element with the K vectors of other
elements. The results determine the relative importance of patches in the sequence. The
results of the dot-product are then scaled and fed into a softmax (Equation (6)). The scaling
dot-product operation performed by the SA block is similar to the standard dot-product,
but it incorporates the dimension of the key DK as a scaling factor. Finally, the value of
each patch embedding’s vector is multiplied by the output of the softmax to find the patch
with the high attention scores (Equation (6)). The full operation is given by these equations:
QK T
A = softmax √ , A ∈ Rn × n (6)
DK
SA(z) = A.V (7)
The MSA block computes the scaled dot-product attention separately for h heads
using the previous operation, but instead of using a single value for the Query, Key, and
Value, multiple values are used. The results of all of the attention heads are concatenated
together and then projected through a feed-forward layer with learnable weights W to the
desired dimension. This operation is expressed by this equation:
Table 1. Parameter statistics for the Base, Large and Huge variants of Vision Transformer.
Model Number of Layers Hidden Size D MLP Size Heads Number of Parameters
ViT-Base 12 768 3072 12 86 M
ViT-Large 24 1024 4096 16 307 M
ViT-Huge 32 1280 5120 16 632 M
The experimental results on Vision Transformers of different size have shown that
using relatively deeper models is important to get higher accuracy. Moreover, choosing
a small patch dimension increases the sequence length n, which in turn improves the
overall accuracy of the model. Another important finding is that attention heads at the
earlier layers of the Vision Transformer can attend image regions at high distances. This
ability increases as the depth of the model increases. This is different from the CNNs-based
Remote Sens. 2021, 13, 516 7 of 19
models, in which earlier layers can only detect local information and global information
can only be detected at the higher layers of the network. This property of the Vision
Transformer is crucial for detecting the relevant features for classification.
3. Experimental Results
3.1. Dataset Description
In our experiments, three well-known remote-sensing datasets are used for evaluation:
Merced land-use dataset [57], Aerial image dataset (AID) [58], and the Optimal-31 [41]
dataset. The characteristic of these three datasets are listed in Table 2, and samples from
each dataset are shown in Figure 4.
Dataset Number of Classes Number of Images per Class Image Size Year
Merced 21 100 256 × 256 2010
AID 30 220~420 600 × 600 2017
Optimal 31 31 60 256 × 256 2019
Figure 4. Cont.
Remote Sens. 2021, 13, 516 10 of 19
Figure 4. Some example images from (a) Merced dataset. (b) AID dataset. (c) Optimal-31.
Merced dataset: This dataset was released in 2010, and contains 2100 RGB images of
21 land-use scene classes. Each class consists of 100 images of size 256 × 256 pixels with
0.3 m resolution. The images were extracted from the United States Geological Survey
National Map.
Aerial image dataset (AID) dataset: The AID dataset is a large-scale dataset of 10,000
for aerial scene images published in 2017 by Wuhan University. The dataset contains
30 different classes of 220 to 420 images per class. The images were cropped from Google
Earth imagery measuring 600 × 600 pixels with a resolution varying from 8 m to about
0.5 m.
Optimal-31 dataset: This dataset was captured from Google Earth imagery covering
31 scene classes. Each class contains 60 images of size of 256 × 256 pixels in the RGB color
space. The pixel resolution for the images is 0.3 m.
optimized it with Adam method and set the learning rate to 0.0003. We initially fixed the
image size to 224 × 224 and the patch size to 16 × 16 and got a sequence with 196 tokens
length.
For comparison purposes, we evaluated the performance of the method in terms of
the standard overall accuracy(OA), which represents the number of correctly classified
images over the total number of images.
We conducted all the experiments on an HP Omen Station with the following spec-
ification: Central processing Unit (CPU) Intel core (TM) i9-7920× CPU @ 2.9 GHz with
a RAM of 32 GB and an NVIDIA GeForce GTX 1080 Ti Graphical Processing Unit (GPU)
(with 11 GB GDDR5X memory). All codes were implemented using Pytorch, which is an
open-source deep neural network library written in python.
Table 3. Classification results on: (A): Merced dataset (30% train set), AID (10% trainset) and Optimal
31 (50% Train set). Image size: 224 × 224.
With Augmentation
Dataset Clean Images Standard CutMix Cutout Hybrid
Merced 94.55 96.32 96.66 95.44 96.73
AID 89.31 92.06 90.50 91.62 91.76
Optimal31 88.27 91.43 92.44 92.25 92.97
Average 90.71 93.27 93.20 93.30 93.82
The results in Table 3 clearly show the effectiveness of using data augmentation as
widely known in the literature. In general, all strategies provided close results, but for
the Merced and Optimal31 datasets, using a hybrid data augmentation yielded slightly
better results with accuracy of 96.73% and 92.97%, respectively. Standard augmentation
performed slightly better than other techniques for the AID dataset with 92.06%. Normally,
and as the results of all the three datasets suggest, using a combination of data augmentation
strategies provides slightly the best behavior.
Figure 5 shows the evolution of the loss function during training with and without
data augmentation. It can be seen that training the model on the original images made the
loss converge smoother and faster. In contrast, when the model is trained on the augmented
images the loss oscillates after the warm-up iterations and takes longer to converge.
Remote Sens. 2021, 13, 516 12 of 19
Figure 5. Loss function without data augmentation for: (a) Merced, (b) AID, and (c) Optimal31, and with data augmentation
for: (d) Merced, (e) AID, and (f) Optimal31.
Figure 6. The class attention maps resulted from different encoder’s layers for the Merced dataset.
Figure 7. The class attention maps resulted from different encoder’s layers for the AID dataset.
Remote Sens. 2021, 13, 516 14 of 19
Figure 7 shows four images from the AID dataset along with the output of three
different encoder layers (layer 1, 6 and that last one). As can be seen, for the beach class the
network at layer 1 mostly focuses on the sea regions. Then, the next encoder layers learn
to gradually shift the attention to the beach line while gradually ignoring the unrelated
regions. In addition, we observed that the attention maps provided by layer 6 are visually
similarly to the one provided at the last layer. We observed also a similar behavior for the
river class where the network concentrates on the river region at layer 6 and the attention
slightly improves in the last layer. For the stadium image, as the encoder gets deeper
it learns to localize the discriminative parts that are corresponding to the stadium class.
Finally, for the tank class, we observed that the network concentrates on unrelated objects
in the first layer but was able to concentrate on the tank objects at layer 6. This means that
the attention improves as the encoder goes deeper.
From a quantitative point of view, Figure 8 shows the classification accuracies obtained
at each layer of the encoder for the Merced, AID, and Optimal 31 datasets. In general, we
can see that deep encoders tend to perform better, and the classification performance is
consistently increasing with the number of layers. The figure shows that using encoders
with at least 5 layers is sufficient to reach 90% classification accuracy in all datasets. The
subsequent layers from 6 to 12 improve the accuracy by 2%. This indicates that the earlier
layers in the Vision Transformer model play the key role in extracting the discriminative
representation that is required for classification.
Figure 8. Relative change in model classification accuracy with respect to the encoder layers for the (a) Merced, (b) AID,
and (c) Optimal-31 datasets.
Remote Sens. 2021, 13, 516 15 of 19
The average results of the three datasets show that pruning the model up to layer 10
gives the best performance, with average accuracy of 93.88% compared to 93.82% with the
full model. Therefore, for scene classification the last layer of the vision Transformer model
can be removed without affecting the performance of the model.
More specifically, for the Optimal31 dataset the best classification accuracy can be
obtained from the last layer with accuracy of 92.97%. However, it is interesting to observe
that the highest accuracy can be obtained from earlier layers for the other two datasets. For
example, an encoder with 10 layers gives the best classification accuracy for the Merced
dataset with accuracy of 97.89%. For the AID dataset, layer 8 and 12 equally give the best
results with 91.76%. These results are consistent with the qualitative results obtained from
the attention maps. In next section, we will show that using only 50% of the layers can
yield competing classification accuracies.
4. Discussion
We further investigate the effect of varying the image size on the performance of
the model. To this end, we repeat the experiments using images with two different sizes,
224 × 224 and 384 × 384. Indeed, the vision Transformer models were pretrained on the
ImageNet dataset with image size 384 × 384.
The overall accuracies and running times of the experiments are summarized in
Table 4. The results clearly show an increase in image size when the model is trained with
large image size. However, increasing the size can remarkably raise the training time. On
average, using larger images has improved the result by 0.93% but doubled the training
time from 32 to 67 min. For the Optimal31 dataset, this cost has a slight improvement on
the accuracy with only 0.03%.
Table 4. Overall accuracies and training times obtained using different image size on different remote
datasets.
Finally, we compare the results of our method with the state-of-the-art results reported
so far in the literature. These methods are the attention recurrent convolutional network
(ARCNet) [41], in which multiple attentional features are generated using a CNN -LSTM
architecture. GoogleNet extracted features classified with an SVM classifier [58]. Gated
bidirectional network that uses hierarchical feature aggregation (GBNet) [31]. Multilayer
stacked covariance pooling (MSCP) [59], in which features from different layers of the
pretrained CNN are combined using covariance pooling and classified using an SVM. In
addition, we add the results of fine-tuned VGG16 and GoogleNet models [60] and models
fine-tuned with an auxiliary classifier [30].
Table 5 shows detailed comparisons for the Merced, AID, and Optimal-31 datasets,
respectively. Besides these three datasets, we compare our results on the well-known
NWPU dataset, which is composed of 45 classes containing 31,500 remote sensing images.
Depending on the data splits reported in the literature, we set the training–testing split
differently for each dataset. We termed the proposed method as V16_21k (224 × 224), and
V16_21k (384 × 384) for Vision Transformer that splits images into 16 × 16, pretrained
on Imagenet-21k dataset and fine-tuned with images of size 224 × 224 and 384 × 384,
respectively. The results in Table 5 show that the network yields interesting results for
Remote Sens. 2021, 13, 516 16 of 19
all datasets. In particular, the configuration with large image size and smaller patch size
achieves superior performance. In terms of computation time, the network takes for
Merced: 153 min; AID: 347 min; Optimal31: 220 min; and NWPU: 465 min. Furthermore,
Table 5 shows that the network yields very competitive results after pruning 50% of its
layers.
Datasets
Merced AID Optimal31 NWPU
Method
(50% Train) (20% Train) (80% Train) (10% Train)
ARCNet-VGG16 [41] 96.81 ± 0.14 88.75 ± 0.40 92.70 ± 0.35 -
ARCNet- AlexNet [41] - - 85.75 ± 0.35 -
ARCNet- ResNet [41] - - 91.28 ± 0.45 -
GoogLeNet+SVM [58] 92.70 ± 0.60 83.44 ± 0.40 - -
GBNet + global feature [31] 97.05 ± 0.19 92.20 ± 0.23 93.28 ± 0.27 -
VGG-16+MSCP [59] 98.36 ± 0.58 91.52 ± 0.21 - -
Fine-tuning VGG16 [31] 96.57 ± 0.38 89.49 ± 0.34 89.52 ± 0.26 87.15 ± 0.45
Fine-tuning GoogLeNet [60] - - 82.57 ± 0.12 82.57 ± 0.12
Inception-v3-aux [30] 97.63 ± 0.20 93.52 ± 0.21 94.13 ± 0.35 89.32 ± 0.33
GoogLeNet-aux [30] 97.90 ± 0.34 93.25 ± 0.33 93.11 ± 0.55 89.22 ± 0.25
EfficientNetB0-aux [30] 98.01 ± 0.45 93.69± 0.11 93.97 ± 0.13 89.96 ± 0.27
EfficientNetB3-aux [30] 98.22 ± 0.49 94.19 ± 0.15 94.51 ± 0.75 91.08 ± 0.14
Proposed V32_21k [384 × 384] 97.74 ± 0.10 95.51 ± 0.57 94.62 ± 0.38 92.81 ± 0.17
Proposed V16_21k [224 × 224] 98.14 ± 0.47 94.97 ±0.01 95.07 ± 0.12 92.60 ± 0.10
Proposed V16_21k [384 × 384] 98.49 ± 0.43 95.86 ± 0.28 95.56 ± 0.18 93.83 ± 0.46
Proposed V16_21k [384 × 384]
97.90 ± 0.10 94.27 ± 1.41 95.30 ± 0.58 93.05 ± 0.46
[pruning 50%]
5. Conclusions
In this work, we have proposed a method for classifying remote-sensing images based
on Vision Transformers. Different from CNNs, the model is able to capture long-range de-
pendencies among patches via an attention module. The proposed method was evaluated
on four public remote-sensing image datasets, and the experimental results demonstrated
the effectiveness of these new type of networks in improving the classification accuracies
compared to state-of-the-art methods. Moreover, we showed that using a combination
of data augmentation techniques can help in further boosting the classification accuracy.
To reduce the size of the model, we presented a simple model-compression solution that
prunes the network layers. For future developments, we suggest investigating alternative
approaches for compressing the transformer and generating light-weight models.
Author Contributions: Y.B. designed and implemented the method, and wrote the paper. L.B.,
M.M.A.R., R.A.D., and N.A.A. contributed to the analysis of the experimental results and paper
writing. All authors have read and agreed to the published version of the manuscript.
Funding: The authors extend their appreciation to the Researchers Supporting Project number
(RSP-2020/69), King Saud University, Riyadh, Saudi Arabia.
Acknowledgments: The authors extend their appreciation to the Researchers Supporting Project
number (RSP-2020/69), King Saud University, Riyadh, Saudi Arabia.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Hu, Q.; Wu, W.; Xia, T.; Yu, Q.; Yang, P.; Li, Z.; Song, Q. Exploring the use of google earth imagery and object-based methods in
land use/cover mapping. Remote Sens. 2013, 5, 6026–6042. [CrossRef]
2. Toth, C.; Jóźków, G. Remote sensing platforms and sensors: A survey. ISPRS J. Photogramm. Remote Sens. 2016, 115, 22–36.
[CrossRef]
3. Hoogendoorn, S.P.; Van Zuylen, H.J.; Schreuder, M.; Gorte, B.; Vosselman, G. Microscopic traffic data collection by remote sensing.
Transp. Res. Rec. 2003, 1855, 121–128. [CrossRef]
Remote Sens. 2021, 13, 516 17 of 19
4. Valavanis, K.P. Advances in Unmanned Aerial Vehicles: State of the Art and the Road to Autonomy; Springer Science & Business Media:
Berlin, Germany, 2008; ISBN 978-1-4020-6114-1.
5. Sheppard, C.; Rahnemoonfar, M. Real-time scene understanding for UAV imagery based on deep convolutional neural networks.
In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28
July 2017; pp. 2243–2246.
6. Al-Najjar, H.A.H.; Kalantar, B.; Pradhan, B.; Saeidi, V.; Halin, A.A.; Ueda, N.; Mansor, S. Land cover classification from fused
DSM and UAV images using convolutional neural networks. Remote Sens. 2019, 11, 1461. [CrossRef]
7. Liu, T.; Abd-Elrahman, A.; Zare, A.; Dewitt, B.A.; Flory, L.; Smith, S.E. A fully learnable context-driven object-based model for
mapping land cover using multi-view data from unmanned aircraft systems. Remote Sens. Environ. 2018, 216, 328–344. [CrossRef]
8. Bazi, Y. Two-branch neural network for learning multi-label classification in UAV imagery. In Proceedings of the IGARSS 2019—
2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 2443–2446.
9. Skidmore, A.K.; Bijker, W.; Schmidt, K.; Kumar, L. Use of remote sensing and GIS for sustainable land management. ITC J. 1997, 3,
302–315.
10. Xiao, Y.; Zhan, Q. A review of remote sensing applications in urban planning and management in China. In Proceedings of the
2009 Joint Urban Remote Sensing Event, Shanghai, China, 20–22 May 2009; pp. 1–5.
11. Daldegan, G.A.; Roberts, D.A.; de Ribeiro, F.F. Spectral mixture analysis in google earth engine to model and delineate fire
scars over a large extent and a long time-series in a rainforest-savanna transition zone. Remote Sens. Environ. 2019, 232, 111340.
[CrossRef]
12. Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [CrossRef]
13. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: San Diego, CA,
USA, 2005; Volume 1, pp. 886–893.
14. Li, Q.; Qi, S.; Shen, Y.; Ni, D.; Zhang, H.; Wang, T. Multispectral image alignment with nonlinear scale-invariant keypoint and
enhanced local feature matrix. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1551–1555. [CrossRef]
15. Sivic, J.; Russell, B.C.; Efros, A.A.; Zisserman, A.; Freeman, W.T. Discovering objects and their location in images. In Proceedings
of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Beijing, China, 17–21 October 2005; Volume 1,
pp. 370–377.
16. Huang, L.; Chen, C.; Li, W.; Du, Q. Remote sensing image scene classification using multi-scale completed local binary patterns
and fisher vectors. Remote Sens. 2016, 8, 483. [CrossRef]
17. Imbriaco, R.; Sebastian, C.; Bondarev, E.; de With, P.H.N. Aggregated deep local features for remote sensing image retrieval.
Remote Sens. 2019, 11, 493. [CrossRef]
18. Diao, W.; Sun, X.; Zheng, X.; Dou, F.; Wang, H.; Fu, K. Efficient saliency-based object detection in remote sensing images using
deep belief networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 137–141. [CrossRef]
19. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth
Obs. Remote Sens. 2014, 7, 2094–2107. [CrossRef]
20. Nogueira, K.; Miranda, W.O.; Santos, J.A.D. Improving spatial feature representation from aerial scenes by using convolutional
networks. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil, 26–29
August 2015; pp. 289–296.
21. Marmanis, D.; Datcu, M.; Esch, T.; Stilla, U. Deep learning earth observation classification using imagenet pretrained networks.
IEEE Geosci. Remote Sens. Lett. 2016, 13, 105–109. [CrossRef]
22. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Convolutional neural networks for large-scale remote-sensing image classifica-
tion. IEEE Trans. Geosci. Remote Sens. 2017, 55, 645–657. [CrossRef]
23. Lakhal, M.I.; Çevikalp, H.; Escalera, S.; Ofli, F. Recurrent neural networks for remote sensing image classification. IET Comput.
Vis. 2018, 12, 1040–1045. [CrossRef]
24. Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Generative adversarial networks for hyperspectral image classification. IEEE
Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [CrossRef]
25. Feng, J.; Yu, H.; Wang, L.; Cao, X.; Zhang, X.; Jiao, L. Classification of hyperspectral images based on multiclass spatial–spectral
generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5329–5343. [CrossRef]
26. Mou, L.; Lu, X.; Li, X.; Zhu, X.X. Nonlocal graph convolutional networks for hyperspectral image classification. IEEE Trans.
Geosci. Remote Sens. 2020, 1–12. [CrossRef]
27. Hu, W.; Li, H.; Pan, L.; Li, W.; Tao, R.; Du, Q. Spatial–spectral feature extraction via deep ConvLSTM neural networks for
hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4237–4250. [CrossRef]
28. Bi, Q.; Qin, K.; Li, Z.; Zhang, H.; Xu, K.; Xia, G.-S. A multiple-instance densely-connected ConvNet for aerial scene classification.
IEEE Trans. Image Process. 2020, 29, 4911–4926. [CrossRef]
29. Yu, Y.; Li, X.; Liu, F. Attention GANs: Unsupervised deep feature learning for aerial scene classification. IEEE Trans. Geosci.
Remote Sens. 2020, 58, 519–531. [CrossRef]
30. Bazi, Y.; Al Rahhal, M.M.; Alhichri, H.; Alajlan, N. Simple yet effective fine-tuning of deep CNNs using an auxiliary classification
loss for remote sensing scene classification. Remote Sens. 2019, 11, 2908. [CrossRef]
Remote Sens. 2021, 13, 516 18 of 19
31. Sun, H.; Li, S.; Zheng, X.; Lu, X. Remote sensing scene classification by gated bidirectional network. IEEE Trans. Geosci. Remote
Sens. 2019, 1–15. [CrossRef]
32. Liu, Y.; Liu, Y.; Ding, L. Scene classification based on two-stage deep feature fusion. IEEE Geosci. Remote Sens. Lett. 2018, 15,
183–186. [CrossRef]
33. Yu, Y.; Liu, F. A Two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification. Available online:
https://www.hindawi.com/journals/cin/2018/8639367/ (accessed on 20 November 2020).
34. Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification
via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [CrossRef]
35. Xue, W.; Dai, X.; Liu, L. Remote sensing scene classification based on multi-structure deep features fusion. IEEE Access 2020, 8,
28746–28755. [CrossRef]
36. Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Fortezza da Florence, Italy, 28 July–2
August 2019; pp. 1810–1822.
37. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-
length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Fortezza da Florence,
Italy, 28 July–2 August 2019; pp. 2978–2988.
38. Chen, N.; Watanabe, S.; Villalba, J.A.; Zelasko, P.; Dehak, N. Non-autoregressive transformer for speech recognition. IEEE Signal
Process. Lett. 2020, 28, 121–125. [CrossRef]
39. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv.
Neural Inf. Process. Syst. 2017, 30, 5998–6008.
40. Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. In Proceedings of the 2019
IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3285–3294.
41. Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans.
Geosci. Remote Sens. 2019, 57, 1155–1167. [CrossRef]
42. Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Tomizuka, M.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image
representation and processing for computer vision. arXiv 2020, arXiv:2006.03677.
43. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. arXiv
2019, arXiv:1906.05909.
44. Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the
37th International Conference on Machine Learning, Vienna, Austrlia, 12–18 July 2020; pp. 1691–1703.
45. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
46. He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder
representation from transformers. IEEE Trans. Geosci. Remote Sens. 2020, 58, 165–178. [CrossRef]
47. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; Long and Short Papers. Volume 1, pp. 4171–4186.
48. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning augmentation strategies from data. In
Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
15–21 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 113–123.
49. Jackson, P.T.; Atapour-Abarghouei, A.; Bonner, S.; Breckon, T.P.; Obara, B. Style Augmentation: Data Augmentation via Style
Randomization. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 16–20 June 2019.
50. Bowles, C.; Chen, L.; Guerrero, R.; Bentley, P.; Gunn, R.; Hammers, A.; Dickie, D.A.; Hernández, M.V.; Wardlaw, J.; Rueckert, D.
GAN augmentation: Augmenting training data using generative adversarial networks. arXiv 2018, arXiv:1810.10863.
51. DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552.
52. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. arXiv 2018, arXiv:1710.09412.
53. Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization strategy to train strong classifiers with localizable
features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2
November 2019; pp. 6022–6031.
54. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531.
55. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and
huffman coding. arXiv 2016, arXiv:1510.00149.
56. Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized Convolutional Neural Networks for Mobile Devices. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828.
57. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPA-
TIAL International Conference on Advances in Geographic Information Systems—GIS ’10, San Jose, CA, USA, 2–5 November
2010; p. 270.
Remote Sens. 2021, 13, 516 19 of 19
58. Xia, G.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial
scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [CrossRef]
59. He, N.; Fang, L.; Li, S.; Plaza, A.; Plaza, J. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE
Trans. Geosci. Remote Sens. 2018, 56, 6899–6910. [CrossRef]
60. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105,
1865–1883. [CrossRef]