0% found this document useful (0 votes)

222 views17 pages

A Lightweight CNN-Transformer Network For Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery

Uploaded by

yumiaowang8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views17 pages

A Lightweight CNN-Transformer Network For Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery

Uploaded by

yumiaowang8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Computers and Electronics in Agriculture 226 (2024) 109370

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture

journal homepage: www.elsevier.com/locate/compag

A lightweight CNN-Transformer network for pixel-based crop mapping

using time-series Sentinel-2 imagery
Yumiao Wang a,b , Luwei Feng c , Weiwei Sun a,*, Lihua Wang a , Gang Yang a, Binjie Chen a
a
Department of Geography and Spatial Information Techniques, Ningbo University, Ningbo 315211, China
b
Institute of East China Sea, Ningbo University, Ningbo, Zhejiang 315211, China
c
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China

A R T I C L E I N F O A B S T R A C T

Keywords: Deep learning approaches have provided state-of-the-art performance in crop mapping. Recently, several studies
Crop mapping have combined the strengths of two dominant deep learning architectures, Convolutional Neural Networks
Convolutional neural network (CNNs) and Transformers, to classify crops using remote sensing images. Despite their success, many of these
Transformer
models utilize patch-based methods that require extensive data labeling, as each sample contains multiple pixels
Pixel-based classification
Temporal Sentinel-2 data
with corresponding labels. This leads to higher costs in data preparation and processing. Moreover, previous
methods rarely considered the impact of missing values caused by clouds and no-observations in remote sensing
data. Therefore, this study proposes a lightweight multi-stage CNN-Transformer network (MCTNet) for pixel-
based crop mapping using time-series Sentinel-2 imagery. MCTNet consists of several successive modules,
each containing a CNN sub-module and a Transformer sub-module to extract important features from the images,
respectively. An attention-based learnable positional encoding (ALPE) module is designed in the Transformer
sub-module to capture the complex temporal relations in the time-series data with different missing rates.
Arkansas and California in the U.S. are selected to evaluate the model. Experimental results show that the
MCTNet has a lightweight advantage with the fewest parameters and memory usage while achieving the superior
performance compared to eight advanced models. Specifically, MCTNet obtained an overall accuracy (OA) of
0.968, a kappa coefficient (Kappa) of 0.951, and a macro-averaged F1 score (F1) of 0.933 in Arkansas, and an OA
of 0.852, a Kappa of 0.806, and an F1 score of 0.829 in California. The results highlight the importance of each
component of the model, particularly the ALPE module, which enhanced the Kappa of MCTNet by 4.2% in
Arkansas and improved the model’s robustness to missing values in remote sensing data. Additionally, visuali
zation results demonstrated that the features extracted from CNN and Transformer sub-modules are comple
mentary, explaining the effectiveness of the MCTNet.

1. Introduction 2019; Yuan et al., 2022).

Traditional crop mapping approaches using time-series remote
Reliable crop maps are essential for monitoring agricultural land use sensing data can be broadly categorized into three types: threshold-
and aiding decision-makers in issuing effective policies to guide sus based, statistics-based, and machine learning methods. Threshold-
tainable agricultural development (Bargiel, 2017). Remote sensing has based methods primarily consider the optimal seasonal threshold to
become the primary means of crop mapping due to its extensive range of separate different crop types based on phenological characteristics and
observation, regular revisit time, and low cost. Especially, after the expert experience (Chen et al., 2023; Huang et al., 2022). Statistics-
successful launch of Sentinel-2 satellites, it is easier to access optical based approaches focus on statistically linking phenology and bio
data with fine temporal and spatial resolutions. The time-series optical physical changes during crop growth by constructing predefined math
imagery characterizes the dynamic growth patterns of crops, providing ematical functions or models (Achanccaray et al., 2017; Asad and Bais,
additional valuable information for crop classification (Ashourloo et al., 2020). However, the aforementioned approaches have limited modeling

* Corresponding author.
E-mail addresses: [email protected] (Y. Wang), [email protected] (L. Feng), [email protected] (W. Sun), [email protected] (L. Wang),
[email protected] (G. Yang), [email protected] (B. Chen).

https://doi.org/10.1016/j.compag.2024.109370
Received 31 March 2024; Received in revised form 26 July 2024; Accepted 21 August 2024
Available online 28 August 2024
0168-1699/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

capacity and can be affected by inter-annual and regional discrepancies CNN and the Transformer models.
(Blickensdörfer et al., 2022; Qiu et al., 2018). Machine learning has the However, previous CNN-Transformer models simply integrated fea
advantage of effectively utilizing a wider range of features. Various tures extracted from CNN and Transformer while giving limited
machine learning models have been successfully applied to crop map consideration to missing data. Missing values in time-series images are
ping, such as support vector machine (SVM) and random forest (RF) often inevitable due to factors such as cloud cover and non-observation.
(Feng et al., 2019; Maponya et al., 2020). However, several drawbacks These missing values can disrupt the inherent temporal patterns of
exist in traditional machine learning models. First, feature engineering is crops, potentially affecting the performance of models, especially those
heavily relied upon to select important information and remove that consider sequential information, such as Transformers. Although
redundant information (Xu et al., 2020). Second, time-series spectral Rußwurm and Körner (2020) found that the self-attention mechanism of
features are often treated as a collection of independent features, Transformers reduces the impact of missing values in time-series remote
ignoring the temporal correlations (Inglada et al., 2017; Interdonato sensing data for crop mapping to a certain extent, it is not specifically
et al., 2019). designed to address missing values, and thus performance improve
With the fast development of artificial intelligence, deep learning ments are necessary. Several studies have applied interpolation tech
models have exhibited immense potential in crop mapping (Zhong et al., niques to eliminate missing values (Wang et al., 2022b; You et al., 2021).
2019). Superior to traditional machine learning methods, deep learning However, performing interpolation demands extensive computational
can automatically learn discriminating features from complex informa resources for multi-bands in large temporal-spatial scales. Moreover,
tion, eliminating the effort to select features manually (Schmidhuber, there remains an inherent uncertainty regarding the fidelity of the
2015). Among the various deep learning approaches, convolutional reconstructed data to the actual values, which raises concerns about the
neural networks (CNNs), recurrent neural networks (RNNs), and reliability of interpolation in applications where the accuracy of tem
Transformers have been widely used to classify crops (Rußwurm and poral information is paramount.
Körner, 2020). CNNs apply convolution operations to extract features. Therefore, we propose a lightweight multi-stage CNN-Transformer
Since convolution has a fixed kernel size which leads to a local receptive network (MCTNet) for pixel-based crop mapping using time-series
field, CNNs are highly efficient at extracting local information but have sentinel-2 imagery. MCTNet consists of several successive CNN-
insufficient ability to capture global information (Wang et al., 2022a). Transformer fusion (CTFusion) modules with pooling operations and a
RNNs are specifically designed for time-series data modeling, which multilayer perceptron (MLP) classifier. Each CTFusion has a CNN sub-
sequentially processes each time step through recurrent cells. In module and a Transformer sub-module. The CNN sub-module is
particular, long short-term memory (LSTM) is a widely used RNN designed with a 1D convolutional network to extract local spectral in
method for land cover classifications using time-series images (Sun formation along the time dimension. The Transformer sub-module is
et al., 2019). Unlike RNNs, Transformers utilize the multi-head self- designed based on the encoder of the original Transformer to capture the
attention mechanism and positional encoding to process all time steps of long-range dependencies of the time series data and extract the features
sequential data simultaneously (Vaswani et al., 2017). This mechanism from a global perspective. Besides, we designed an attention-based
is inherently flexible and can understand the relationships between learnable positional encoding module (ALPE) in the first Transformer
distant parts of the input sequence to obtain both local and global in sub-module to capture the temporal correlation in the time series data
formation. Different from the fixed receptive field of CNNs, Trans under different missing rates. We selected California and Arkansas in the
formers can adapt their attention dynamically and capture long-range U.S. as study areas to verify the effectiveness of the model with distinct
dependencies in sequential data, which is essential for understanding planting structures. At the same time, eight advanced models were
the overall temporal dynamics and growth patterns of crops across an selected for comparative verification. The main contributions of our
entire season. For example, Yuan and Lin (2021) proposed a research are:
Transformer-based model for crop mapping and achieved better per (1) We propose the MCTNet that integrates the advantages of the
formance than RNNs. Additionally, several studies have proposed more CNN and Transformer with a lightweight architecture for pixel-based
efficient models based on Transformers for mining time-series images, crop mapping.
such as temporal attention encoder (TAE) (Garnot et al., 2020) and (2) ALPE is designed to reduce the impact of missing values and
lightweight temporal attention encoder (LTAE) (Garnot and Landrieu, provide reasonable positional encoding, which can improve the model’s
2020). ability to handle time-series data under complex missing conditions.
Since CNNs excel at extracting local features while Transformers are (3) Compared with various state-of-the-art models, the proposed
adept at capturing long-term dependencies, several recent studies have model exhibits superior performance in key areas, including parameter
combined the strengths of both approaches to enhance crop mapping size, classification accuracy, and the quality of mapping results.
performance (Liu et al., 2022; Niu et al., 2022; Tang et al., 2024; Wang
et al., 2022a; Xiang et al., 2023). However, these hybrid models have 2. Materials and methods
primarily focused on patch-based crop mapping in small regions.
Although patch-based models can utilize additional spatial information 2.1. Study area
compared to pixel-based models, the cost of patch-based sampling is
more expensive (Feng et al., 2023), limiting their application in large Two states in the U.S. were chosen as the study areas, including
areas. Moreover, most of these models used an encoder-decoder struc California and Arkansas (Fig. 1 (a)). California has an extremely diverse
ture commonly applied in language translation and segmentation tasks. natural landscape, including grasslands, deserts, rainforests, mountains,
The encoder-decoder structures can increase the volume of the model, and oceans. The agricultural system in this state is quite complex, with
making it unsuitable for lightweight applications. In contrast, pixel- small plots and a variety of crop types (Zhong et al., 2011). Arkansas is
based crop mapping is a typical classification problem that only re an important agricultural state in the U.S. with over 40 % of the land
quires an encoder to extract features for classification (Li et al., 2020). dedicated to agricultural production (Martin et al., 2021). It is a major
Zhang et al. (2023) integrated a CNN and a modified Transformer to producer of crops such as rice, soybeans, corn, cotton, and wheat, which
design a global–local temporal attention encoder (GL-TAE) for pixel- are extensively cultivated on large farms. Due to the differences in major
based crop mapping. In their approach, time-series multi-spectral data crops and cultivation methods between California and Arkansas, the
is transformed into a 2D matrix, enabling the CNN to extract local crop mapping models can be thoroughly validated. Besides, two distinct
temporal patterns. These local features are then merged with features mapping areas were selected in each state to assess the practical appli
extracted by the Transformer to enhance crop mapping. The results cation capability of the models (Fig. 1 (b) and Fig. 1 (c)). These areas
demonstrate that this hybrid model significantly outperforms both the were selected for their considerable geographical separation and

2
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 1. Study area. (a) Arkansas and California. (b) Mapping areas in Arkansas. (c) Mapping areas in California (The information on crop types is from CDL).

comprehensive coverage of various crop types, which allows for better 2.2.2. ESA WorldCover 2021
validation of the model’s validity. The ESA WorldCover 2021 is a comprehensive global land cover map
for 2021, offering a fine spatial resolution of 10 m. This pioneering
product, derived from Sentinel-1 and Sentinel-2 data, provides a
2.2. Datasets detailed classification of the earth’s surface into 11 land cover classes.
Since this study focuses on crop mapping, ESA WorldCover 2021 was
Three types of datasets were collected in this study, including ground used to mask non-cropland areas, thereby streamlining sample collec
truth of crop type, land cover data, and remote sensing images. Table 1 tion and practical mapping applications.
provides a summary for each data source and the detailed descriptions
are in the following sections. 2.2.3. Remote sensing images
Sentinel-2 is a multi-spectral imaging system provided by the Euro
2.2.1. Ground truth pean Space Agency (ESA) and designed for high-resolution imaging of
Cropland Data Layer Data (CDL) is one of the most widely used crop the Earth’s surface. It comprises two satellites that can be combined to
maps published by the U.S. Department of Agriculture (USDA). CDL provide full coverage of the Earth’s surface every five days. Sentinel-2
describes the distribution of more than 100 crops in the U.S., completely has 13 bands ranging from the visible to the shortwave infrared, with
covering Arkansas and California. It has an annual update frequency, spatial resolutions varying up to a maximum of 10 m (Table 1). Among
high spatial resolution (30 m), and high classification accuracy (Boryan these bands, Band 1, Band 9, and Band 10 have an inadequate resolution
et al., 2011). Moreover, it offers a confidence layer that indicates the (60 m) and contribute less to the fine classification of crops; thus, we
predicted confidence level for each pixel. Therefore, this study used the selected the remaining ten bands as spectral features, including three
CDL data to generate labeled samples for model training and testing. visible bands, one near-infrared (NIR) band, four red-edge bands, and
two shortwave infrared (SWIR) bands. Besides, Sentinel-2 Level-2A
Table 1 surface reflectance images (bottom-of-atmosphere) were applied in this
A summary of the data used in this study. study because these images are atmospherically corrected and orthor
Category Data Name Variable Spatial Temporal Source ectified, contributing to more accurate and reliable monitoring of agri
Name Resolution Resolution cultural conditions.
Ground Crop Data Crop type 30 m Yearly USDA
Truth of Layer 2.2.4. Data preprocessing
Crop
For remote sensing observations, we collected all Sentinel-2 images
Type
Land ESA Land 10 m − ESA
over the study areas from the year 2021. Subsequently, we eliminated
Cover WorldCover cover cloud-affected pixels and computed the median value of the remaining
Map 2021 observations at ten-day intervals, resulting in a total of 36 temporal
Remote Sentinel-2 Band 2 10 m 5-day ESA sequences. It is noted that there are still missing data in the sequences,
sensing Level-2A Blue
and we used 0 to mark the missing data. We used the CDL map from
images Band 3 10 m
Green 2021 to collect labeled crop samples. In particular, we set a 95 % con
Band 4 10 m fidence to filter the CDL map to improve the quality of sampling and
Red used the ESA WorldCover 2021 to mask non-cropland areas. Then, we
Band 20 m randomly sampled 10,000 points in each study area and extracted the
5Red
Edge-1
corresponding crop type from the CDL map, as well as the associated
Band 6 20 m spectral features from the time-series images. As a result, each labeled
Red Edge- sample has 360 spectral features, consisting of 10 spectral bands with 36
2 temporal observations. The distribution of the crop samples is shown in
Band 7 20 m
Table 2. Specifically, crop types that constitute less than 5 % of the total
Red Edge-
3 number of samples were merged into a category labeled as “others”. The
Band 8 10 m NDVI time-series curve of each crop is generated by averaging the NDVI
NIR values of all the corresponding crop samples (Fig. 2). All the necessary
Band 8A 20 m steps for data collection and preprocessing were executed using the
Red Edge-
4
Google Earth Engine.
Band 11 20 m
SWIR 1 2.3. Multi-stage CNN-Transformer network (MCTNet)
Band 12 20 m
SWIR 2
The primary objective of the proposed MCTNet is to integrate the

3
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Table 2 Transformer sub-module, and a CNN sub-module. In particular, the

Numbers of the samples in the study areas. traditional positional encoding was replaced by an ALPE module in the
Study area Class Number of Training Validation Testing first Transformer sub-module. The ALPE module utilizes Input 2 to
Name Samples generate an enhanced positional representation. A detailed description
Arkansas Soybeans 4677 240 60 4377 of the modules in the proposed model is provided in the following.
Rice 2423 240 60 2123
Corn 1522 240 60 1222 2.3.1. Transformer sub-module
Cotton 762 240 60 462 The Transformer sub-module is designed based on the encoder of the
Others 616 240 60 316
All 10,000 1200 300 8500
original Transformer to capture the contextual information of the time-
California Grapes 2054 240 60 1754 series data. The architecture of the sub-module is shown in Fig. 4. It
Rice 2037 240 60 1737 removes the input embedding of the original Transformer and mainly
Alfalfa 974 240 60 674 consists of two key components: multi-head self-attention and positional
Almonds 783 240 60 483
encoding.
Pistachios 640 240 60 340
Others 3512 240 60 3212 The self-attention component computes attention scores for each
All 10,000 1440 360 8200 temporal feature, enabling the model to focus on relevant context while
suppressing noise. Multi-head self-attention utilizes H sets of learned
attention weights, permitting the model to capture diverse relationships
strengths of CNNs and Transformers for pixel-based crop mapping. CNNs and dependencies within the data. The input of the self-attention
are particularly effective at extracting local features while Transformers component consists of query (Q), key (K), and value (V) which are
are proficient at capturing long-term dependencies. By combining these calculated by the module input and three different transformation
capabilities, MCTNet is expected to construct a more comprehensive matrices. Q and K are used to calculate an attention weight matrix W ∈
representation of crop growth patterns, thereby enhancing the accuracy [0,1]T×T (where T is the number of time steps), and the output of self-
of crop mapping. Additionally, the ALPE module is specifically designed attention is produced by W and V:
to handle missing data effectively, ensuring both robustness and accu ( T)
racy of the model even in the presence of incomplete data. The MCTNet QK
Attention(Q, K, V) = WV = softmax √̅̅̅̅̅ V (1)
uses several successive CTFusion modules with pooling operations to dk
extract the multi-level local and global information of the time-series
data. The structure of the proposed MCTNet with three stages is where dk is the column number of the Q.
shown in Fig. 3. Given time-series images, a pixel sample can obtain Positional information is crucial in sequential data. In the original
time-series multi-spectral features (Input 1) and a set of time-series data Transformer, absolute positional encoding using a fixed sinusoidal
indicating whether the current temporal data is missing (Input 2). Input encoding mechanism with a predefined wavelength is used to represent
1 is converted to a 2D matrix for model processing. Specifically, the the order of the temporal input (Vaswani et al., 2017). The encoded
time-series features are extracted by the CTFusion modules sequentially position can be calculated as:
and then are fed to an MLP classifier to predict the crop type of the {
sin(ti /100002k/dm ), ifp = 2k
sample. A CTFusion is composed of two parallel components, a PE(ti )p = (2)
cos(ti /100002k/dm ), ifp = 2k + 1

Fig. 2. NDVI time-series profiles of the crops in (a) Arkansas and (b) California.

4
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 3. Architecture of the MCTNet with three stages.

Fig. 4. Architecture of the Transformer sub-module.

where ti is the temporal order, and dm is the encoding size. designed an attention-based learnable positional encoding (ALPE)
The absolute positional encoding is added to the input to enable the module (Fig. 5) to solve the above issues. This module first uses absolute
model to attend to both content and position. However, the absolute positional encoding as the initial positional vector to transform the time
positions lack flexibility and treat all positions equally regardless of the feature from one dimension to two dimensions to meet the input format.
contextual information. Besides, gaps often occur in the time-series data The two-dimensional positional vector is masked by Input 2 (mentioned
due to clouds, and the absolute positional encoding cannot help the in Section 2.3) and then processed by a 1D convolution layer. It is worth
model understand which temporal data is unreliable. Therefore, we noting that the mask is only used in the first stage of the model. After
that, the Efficient Channel Attention (ECA) module is applied to the

Fig. 5. Attention-based learnable positional encoding module.

5
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

processed positional vector. ECA is an efficient attention module that

has few parameters and can bring clear performance gain in 1D CNNs
(Wang et al., 2020). The final position matrix is expected to learn the
complex relationships from the time-series features and the importance
of different temporal positions under different missing rates.
ALPE(t) = ECA(Conv1D(PE(t) × mask)) (3)

where t represents the time series. For example, given a sequential input
with 36 time steps and 10 features. The absolute positional encoding of
its time series t is PE(t) ∈ R36×10 which is a two-dimensional vector. The
positions of the missing values in PE(t) are masked by zero to reduce the
impact of missing values on the subsequent position encoding learning.
Then, the positional vector is processed by a convolution layer and ECA
module to form the final positional vector ALPE(t) ∈ R36×10 .
ALPE incorporates learnable parameters into the standard positional
encoding, allowing the encoding to adapt based on the input data. These
parameters are trained to minimize the negative impact of missing
values, effectively modifying the positional information to better
represent the temporal structure of the data. By refining the positional
encoding, the ALPE module improves the model’s ability to capture the
relative relationships between time steps, even in the presence of
missing data. This enhanced representation allows the attention mech
anism to distribute attention weights more accurately, leading to better
information propagation across time steps and increased robustness to
missing values.

2.3.2. CNN sub-module and MLP classifier

The CNN sub-module is designed with a very simple structure
(Fig. 6). It consists of two consecutive 1D convolution layers that operate
along the time dimension of the input. Each convolution layer is fol
lowed by a batch normalization layer. The output of the two layers is
processed by a ReLU activation function. Besides, a connection directly
skips over the two convolutional layers and adds the input of the sub-
module to the output of the two layers. This shortcut connection can
enable the gradient to flow more easily during training and helps to
avoid the vanishing gradient problem. The MLP classifier is made up of a
linear layer with a Softmax activation. The input of the MLP is the
extracted features from the last CTFusion module. Specifically, the
output of the last CTFusion module is processed by a global max pooling
operation along the channel dimension to transform its form into a one- Fig. 6. Architecture of the CNN sub-module.
dimensional vector. Then the vector is fed to the MLP classifier to predict
the crop type. learning models were constructed with the PyTorch library. The models
all ran on a workstation with an Intel I7-12900 CPU, 64 GB RAM, and an
NVIDIA RTX3090 24G GPU. In the training process, the batch size and
2.4. Experimental setting
number of epochs of the deep learning models were set as 32 and 200,
respectively. The hyperparameters of the models for tuning, their can
There are 10,000 samples in Arkansas and California, respectively.
didates, and their final value in each study area are listed in Table 3. The
For each state, the samples were split into training, validation, and
other hyperparameters of the models were selected with default values.
testing sets. Specifically, 300 samples per crop type were randomly
To quantitatively evaluate the models, the confusion matrix was first
selected to compose the training and validation datasets (with a rate of
computed to show the distribution of all the predicted responses. Then,
8:2), and the rest samples were used for testing. The details of the sample
overall accuracy (OA), kappa coefficient (Kappa), and macro-averaged
partition are shown in Table 2. The training samples were used to train
F1 score were calculated based on the matrix to comprehensively
the models, and the validation samples were used for selecting hyper
measure the overall performance of the models. In particular, the macro-
parameters, including the number of stages (n_stage), the number of
averaged F1 score represents the average F1 score of all classes, with
heads (n_head) in the Transformer sub-module, the kernel size of the
each F1 score being the harmonic mean of precision and recall for its
CNN sub-module, the learning rate (lr), and the optimizer. To fully
respective class. The macro-averaged F1 usually serves as a crucial
assess the effectiveness of the MCTNet, we compared it with eight
metric for evaluating a model’s effectiveness in imbalanced datasets. For
advanced models, including SVM, RF, LSTM, Transformer, Residual
the sake of brevity, this metric will be referred to as “F1″ in subsequent
Neural Network (ResNet), TAE, LTAE, and GL-TAE. Particularly, ResNet
discussions. The formulas of the three metrics are shown in Eq. (4)-(6).
is a deep CNN with many variants, and this study adopted the ResNet-
18, which has 18 convolutional layers. Since the original Transformer 1 ∑n

is not suitable for classification tasks, this study adopted the modified OA = xii (4)
N i=1
Transformer model of Rußwurm and Körner (2020), which only contains
the encoder part of the original Transformer. 2∑ n
Pi Ri
All models were implemented in a Python environment, in which F1 = (5)
n i=1 Pi + Ri
SVM and RF were constructed with the Scikit-Learn library, and the deep

6
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Table 3 Table 4
The hyperparameters of different models in Arkansas and California. Ablation study of the MCTNet with three evaluation metrics.
Model Hyperparameters Arkansas California Study area Model OA Kappa F1
candidates
Arkansas MCTNet 0.968 0.951 0.933
MCTNet n_stage = [1, 2, 3,4,5] n_stage = 3 n_stage = 3 MCTNet_noMask 0.943 0.914 0.895
n_head = [1,2,5,10] n_head = 5 n_head = 5 MCTNet_noALPE 0.938 0.909 0.888
kernel size = [1,3,5] kernel size = 3 kernel size = 3 MCTNet_noCnn 0.938 0.907 0.876
lr = lr = 0.001 lr = 0.001 MCTNet_noTrans 0.925 0.888 0.860
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam
optimizer = [SGD, Adam] Adam
California MCTNet 0.852 0.806 0.829
GL-TAE dm = [64,128, 256, 512] dm = 256; dm = 256;
MCTNet_noMask 0.840 0.792 0.822
n_head = [2,4,8,16,32] n_head = 8 n_head = 16
MCTNet_noALPE 0.841 0.792 0.810
lr = lr = 0.001 lr = 0.001
MCTNet_noCnn 0.825 0.773 0.792
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam
MCTNet_noTrans 0.803 0.744 0.784
optimizer = [SGD, Adam] Adam
LTAE dm = [64,128, 256, 512] dm = 256 dm = 256
n_head = [2,4,8,16,32] n_head = 8 n_head = 8
lr = lr = 0.001 lr = 0.001 obtained the best performance with 0.968 OA, 0.951 Kappa, and 0.933
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam F1, followed by MCTNet_noMask. The MCTNet_noALPE and MCTNet_
optimizer = [SGD, Adam] Adam noCnn performed similarly, with an F1 that is approximately 4 % less
TAE dm = [64,128, 256, 512] dm = 128 dm = 128
than that of MCTNet. The MCTNet_noTrans produced the lowest accu
n_head = [2,4,8,16,32] n_head = 4 n_head = 4
lr = lr = 0.001 lr = 0.001
racy, indicating that the Transformer sub-module had the greatest
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam impact on the accuracy of the model. In California, the accuracy of the
optimizer = [SGD, Adam] Adam models decreased significantly compared to the results in Arkansas. This
Transformer dm = [64,128, 256, 512] dm = 128 dm = 128 may be attributed to the small crop plots and numerous types of crops in
n_head = [2,4,8,16,32] n_head = 4 n_head = 4
California, which increase the difficulty of classification. The MCTNet
n_layers = [1,2,4,6,8] n_layers = 2 n_layers = 2
lr = lr = 0.001 lr = 0.001 still exhibited the best performance, followed by the MCTNet_noMask,
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam MCTNet_noALPE, MCTNet_noCnn, and MCTNet_noTrans.
optimizer = [SGD, Adam] Adam
Resnet lr = lr = 0.001 lr = 0.001
[0.0001,0.001,0.01,0.1] optimizer = optimizer = Adam 3.2. Model performance comparison
optimizer = [SGD, Adam] Adam
LSTM hidden units = hidden units hidden units = 256 The results of the classification accuracy of the nine models in the
[64,128,256,512] = 256 lr = 0.001
two regions are displayed in Table 5. A major finding is that the MCTNet
lr = lr = 0.001 optimizer = Adam
[0.0001,0.001,0.01,0.1] optimizer = achieved the best performance while the LSTM, SVM, and TAE yielded
optimizer = [SGD, Adam] Adam the worst results in both regions. LSTM relies on sequential data to learn
RF n_estimators = n_estimators n_estimators = 200 temporal dynamics, but missing data can impact its ability to capture
[100,200,400] = 200 max_depth = 20
temporal dependencies. SVM is also sensitive to noise and outliers in the
max_depth = [5,10,15,20] max_depth =
20 dataset (Sabzekar and Hasheminejad, 2021). RF yielded significantly
SVM C=[0.5, 1, 1.5, 2, 2.5, 3] C=1 C=1 superior accuracy compared to LSTM, SVM, and TAE. This is because RF
kernel = [Linear, RBF] kernel = RBF kernel = RBF can exhibit robustness through majority voting with its numerous trees.
Apart from the proposed MCTNet, the GL-TAE also performed well in
∑ ∑ both study areas. This performance can be attributed to the fact that
N ni=1 xii - ni=1 x+i xi+ both MCTNet and GL-TAE integrate CNN and self-attention architec
Kappa = 2 ∑n
(6)
N - i=1 x+i xi+ tures. Despite their differing self-attention mechanisms, this integration
facilitates the advantageous combination of global and local attention,
where N and n represent the number of samples and the number of crop which is critical for effective crop mapping. Generally, the MCTNet
types, respectively. xii is the ith diagonal value of the confusion matrix. outperformed other comparison models remarkably in all metrics.
x+i and xi+ denote the sum of the predicted samples for class i and the To better analyze the classification results of different models, the
sum of the true samples for class i, respectively. Pi is the precision of class
i and Ri represents the recall of class i. Table 5
Classification accuracy of different models.
3. Results Study area Model OA Kappa F1

Arkansas MCTNet 0.968 0.951 0.933

3.1. Ablation experiment
GL-LTAE 0.952 0.928 0.905
LTAE 0.941 0.912 0.884
To validate the effectiveness of the components, we assessed the TAE 0.911 0.869 0.854
impact of the ALPE module, CNN sub-module, and Transformer sub- ResNet 0.945 0.917 0.894
module on the performance of the MCTNet. Specifically, we removed Transformer 0.956 0.934 0.909
LSTM 0.907 0.862 0.843
the three modules separately to build three new comparison models, RF 0.936 0.904 0.887
namely MCTNet_noALPE, MCTNet_noCnn, and MCTNet_noTrans. It is SVM 0.910 0.867 0.846
worth noting that the MCTNet_noALPE model used the absolute posi California MCTNet 0.852 0.806 0.829
tional encoding of the original Transformer. Besides, we also removed GL-LTAE 0.843 0.796 0.820
LTAE 0.838 0.788 0.805
the data missing mask of the ALPE to construct MCTNet_noMask to
TAE 0.787 0.728 0.767
assess its effectiveness. ResNet 0.837 0.787 0.821
Table 4 shows the overall performance of the models in the two study Transformer 0.813 0.759 0.789
areas. It is clear that the MCTNet outperformed the other models LSTM 0.799 0.739 0.766
significantly in both regions, illustrating that each module contributes to RF 0.823 0.770 0.793
SVM 0.786 0.724 0.762
improving classification performance. In Arkansas, the MCTNet

7
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

normalized confusion matrices are shown in Figs. 7 and 8. Obviously, other models. The broad range of crop types included in others greatly
the MCTNet exhibited a notably lower rate of misclassifications increases the difficulty of classification, while the proposed model still
compared to other methods. In Arkansas, the MCTNet demonstrated achieved an increase in OA of 5.4 % compared to the GL-TAE.
superior performance in accurately classifying corn, soybeans, and rice. The parameter sizes, processing time, and memory of the comparison
It is worth noting that the MCTNet generated very high accuracy on corn models are shown in Table 6. Specifically, we conducted inference and
(0.991 OA) and soybeans (0.990 OA), while the other models tended to training time evaluations 10 times to ensure a robust comparison,
misclassify the two crops to different degrees. According to Fig. 2, the adopting the median values as the final results. The training time covers
maturity stage of corn is significantly earlier than that of soybeans in the whole process of model training, and the inference time is the total
Arkansas. The MCTNet may be able to capture this difference, yielding time to calculate the results of the testing data with the trained model.
very few misclassified samples for corn and soybeans. Moreover, the The MCTNet has the fewest number of parameters and the lowest
other models erroneously classified a large number of rice samples as memory usage. Compared with the GL-TAE, the MCTNet has nearly half
cotton or others, while the MCTNet maintained a high classification the number of parameters. However, the training and inference time of
performance. In California, due to the complicated environment and the MCTNet are not superior to those of other models. This may be due
heterogeneous crop categories, the classification accuracy of all models to the sequential stages in MCTNet, which affect the efficiency of par
experienced a remarkable decline. Nevertheless, the MCTNet still ach allel computing.
ieved better classification accuracy in almonds and others compared to

Fig. 7. Confusion matrices of different models in Arkansas.

8
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 8. Confusion matrices of different models in California.

Table 6
The parameter size, processing time, and memory of the crop mapping models.
Model Arkansas California

Parameter Size Training (s) Inference (s) Memory (MB) Parameter Size Training (s) Inference (s) Memory (MB)

MCTNet 55,059 75.261 1.093 0.740 55,140 89.191 1.107 0.740

GL-TAE 105,289 54.258 0.924 1.306 121,834 70.336 0.967 1.500
LTAE 95,301 50.799 0.902 1.190 95,334 64.366 0.843 1.191
TAE 161,605 53.844 0.918 1.960 161,638 67.053 0.957 1.960
Transformer 1,325,204 88.771 1.106 15.545 1,325,332 103.876 1.198 15.547
Resnet 2,775,237 73.381 0.953 31.800 2,775,494 88.586 0.974 31.800
LSTM 276,741 48.514 0.716 3.250 276,998 57.486 0.736 3.250
RF − 1.530 1.239 4.969 − 2.164 1.267 8.686
SVM − 2.220 1.339 1.917 − 2.730 1.510 3.026

9
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

3.3. Crop mapping performance in the majority of mapping areas, underscoring the effectiveness of the
proposed model in practical applications. However, in California’s
The crop mapping results of the models are shown in Fig. 9 and mapping area 2, LTAE surpassed MCTNet. Additionally, in California’s
Fig. 10. It should be noted that TAE, LSTM, and SVM were excluded from mapping area 1, RF achieved greater accuracy compared to GL-TAE and
mapping due to their inferior accuracy. Residual maps were created to LTAE. These results contrast with the classification accuracy results
illustrate the distribution of errors and correct classifications, with white based on testing data presented in Table 5. A possible explanation for
for correct classifications and black for errors. For quantitative analysis, this discrepancy is the random selection of samples from the CDL. These
the mapping accuracy of each residual map was determined by calcu samples may not be entirely representative of the real situation, leading
lating the ratio of correctly classified pixels to the total pixel count, as to differences between the model’s performance in actual mapping and
detailed in Table 7. on the testing data. A model’s excellent performance on the testing data
Table 7 demonstrates that MCTNet exhibited superior performance only suggests a higher likelihood of superior performance in actual

Fig. 9. Crop maps from (a) MCTNet, (b) GL-TAE, (c) LTAE, (d) ResNet, (e) Transformer, (f) RF in mapping area 1 (1) and mapping area 2 (3) in Arkansas. (2) and (4)
represent the residual maps in mapping area 1 and mapping area 2, respectively.

10
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 10. Crop maps from (a) MCTNet, (b) GL-TAE, (c) LTAE, (d) ResNet, (e) Transformer, (f) RF in mapping area 1 (1) and mapping area 2 (3) in California. (2) and
(4) represent the residual maps in mapping area 1 and mapping area 2, respectively.

Table 7
The mapping accuracy of different models.
State Mapping Area MCTNet GL-TAE LTAE ResNet Transformer RF

Arkansas 1 0.893 0.888 0.883 0.871 0.877 0.875

2 0.854 0.847 0.850 0.827 0.852 0.811
California 1 0.897 0.887 0.886 0.872 0.893 0.895
2 0.873 0.871 0.892 0.851 0.868 0.869

mapping compared to suboptimal models, but it does not offer a guar other crops were inaccurately identified as soybeans. In mapping area 2,
anteed assurance of superior performance. where cotton, soybean, and corn are predominant, all models reflected
A detailed analysis of the predicted map of Arkansas (Fig. 9) shows the overall crop distribution pattern. MCTNet demonstrated superior
that all models perform well except for the central region of map area 1. accuracy with fewer misclassifications, particularly in the central area.
However, some pixels that should have been classified as cotton and Fig. 10 demonstrates that the crop fields in California are much smaller

11
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

and more fragmented than those in Arkansas. In mapping area 1, all feature extraction capabilities. To further investigate the contribution of
predicted maps identified the alfalfa correctly but failed to map the the CNN and Transformer sub-modules in the proposed model, the
other crops very accurately. Mapping area 2 is planted with many gradient-weighted class activation mapping (Grad-CAM) (Selvaraju
grapes, along with scattered almonds, pistachios, and rice. The mapping et al., 2017) was applied to visualize the extracted features by the sub-
results of mapping area 2 were similar to those of mapping area 1, with modules. Specifically, the Grad-CAM on the last convolutional layer of
high recognition accuracy for widely planted crops (grapes) but low the CNN sub-module and the last normalized layer of the Transformer
accuracy for small-scale crops. Notably, MCTNet yielded fewer errors in sub-module were selected for visualization. To explore the effects of the
certain localized areas. Nonetheless, all models produced more salt-and- multiple stages in the MCTNet, the visualization results were also pro
pepper errors in California than in Arkansas, which may be attributed to duced for each stage.
fragmented and heterogeneous landscapes. Fig. 12 (a1) − (a3) demonstrate the Grad-CAM results of CNN sub-
modules using the corn samples in Arkansas. In the spectral dimen
sion, the red edge and NIR bands, known to be sensitive to the physio
3.4. Cross-temporal performance logical state of plants (Sharifi, 2020), exhibit significant importance. In
the temporal domain, the features near the 180 DOY obtain more
In real applications, it is often impossible to collect training data for attention. This period corresponds to the tasseling and silking stages of
specific years, making the evaluation of the transferability and gener corn, during which the plants undergo significant structural and color
alizability of models crucial. To explore cross-temporal performance, we changes. From a remote sensing perspective, these developmental stages
conducted an additional experiment to assess the performance of models are characterized by notable alterations in the spectral reflectance of the
across different years. Specifically, we collected 5000 samples from each corn canopy, making them easily detectable. Additionally, as illustrated
study area in 2020 and 2022, and tested the performance of models in Fig. 2, the occurrence of these stages in corn is significantly earlier
trained on 2021 data using these new datasets. The results, shown in compared to other crops. Consequently, the spectral features during
Fig. 11, indicate that the proposed MCTNet consistently outperforms these stages are important for crop identification and thus receive
other models, achieving the highest scores in most scenarios, particu heightened attention. Moreover, we found that the important features
larly in Arkansas 2020 and 2022, and California 2020. Although become more concentrated from the first stage to the last stage. This is
MCTNet ranks second to RF in California 2022, it remains highly because the features learned by deep CNN models exhibit hierarchical
competitive. In contrast, the comparison models exhibit significant characteristics, that is, the shallow layers learn abstract features while
performance variations across different scenarios. For example, while the deeper layers learn specific and invariant features that serve for
the RF model maintains good performance across years in California, it classification (Zeiler and Fergus, 2013). Fig. 12 (c1) − (c3) also shows a
shows inferior performance in Arkansas. This indicates that the pro gradual concentration phenomenon in California with grape samples.
posed model effectively and precisely captures the temporal patterns of Different from corn, the important features of grapes are a little bit away
crops, demonstrating superior robustness and stability in handling un from the peak value date (Fig. 2). Deep learning models are often viewed
seen temporal data. as ’black boxes’ with limited interpretability, and the optimal combi
nation of features might not include the peak date, especially when the
4. Discussion peak is not markedly evident. In contrast, the Grad-CAM results of
Transformer sub-modules (Fig. 12 (b1) − (b3) and Fig. 12 (d1) − (d3))
4.1. Model interpretation are different from those of the CNN sub-modules. The important features
of Transformer sub-modules become more scattered along the temporal
The results in Section 3 substantiated the efficacy of the MCTNet by domain from the first stage to the last stage. This is because CNN models
leveraging the integration of the CNN and the Transformer to enhance

Fig. 11. Cross-temporal performance of different models (“_” represents the best model).

12
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 12. Grad-CAM visualization results for (a) corn in Arkansas by CNN sub-module, (b) corn in Arkansas by Transformer sub-module, (c) grape in California by
CNN sub-module, (d) grape in California by Transformer sub-module. (1) to (3) represent the tree stages in MCTNet. Yellow color indicates high-weight features,
while blue color means low-weight features. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

focus on local features while Transformer models can capture long-range series data on model accuracy, we calculated the missing rates for each
feature dependencies. The Transformer sub-module can classify crops by sample in the testing dataset in Arkansas and then assessed the robust
learning a specific pattern formed by different temporal features. ness of MCTNet and MCTNet_noALPE models using samples with
Meanwhile, we observed that the important features extracted by CNN missing rates less than 10 %, 15 %, 20 %, 25 %, 30 %, 35 %, and 40 %
and Transformer sub-modules exhibited certain complementarity, respectively. We believe it is important to maintain realistic observa
which can theoretically enhance the accuracy of crop classification. tions, so we did not refer to previous studies (Wang et al., 2022b; You
To further investigate how the extracted features affect the classifi et al., 2021) using linear interpolation and Savitzky-Golay smoothing to
cation, t-SNE (Van der Maaten and Hinton, 2008), a widely used process time-series data to eliminate missing values. To verify whether
dimensionality reduction approach, was applied to visualize the distri these steps are necessary, we used the data that underwent interpolation
butions of the extracted features of the sub-modules in each stage. and smoothing to generate two new models (named MCTNet1 and
Fig. 13 (a1) – Fig. 13 (a3) demonstrates the visualized results of CNN MCTNet_noALPE1). The model performance and the distribution of
sub-modules in Arkansas for three stages, respectively. Obviously, in the missing rates of data are presented in Fig. 15. It is obvious that the data
first stages, the distribution of the extracted features is highly concen with less than 10 % missingness constituted less than 20 % of the total,
trated and forms several indistinguishable clusters. In the subsequent which implies that the vast majority of the data contained missing values
stages, these clusters gradually aggregate into a few large groups, with with different rates. As the missing rate increased, the accuracy of all
each large group representing a type of crop. The Silhouette scores also models gradually declined. The MCTNet achieved the best performance
prove that the clustering performance becomes better in the latter at every rate, and its OA decreased slightly, from 0.973 to 0.968. By
stages. The visualized plots of Transformer sub-modules are similar to contrast, the OA of MCTNet_noALPE dropped from 0.963 to 0.950. This
those of the CNN sub-modules. Slightly differently, the Transformer sub- demonstrates that MCTNet, owing to the contribution of ALPE, was less
modules form more and smaller clusters in the first stage. Fig. 14 dis sensitive to data missingness than MCTNet_noALPE.
plays similar visualized results in California, which further demonstrate It is noteworthy that MCTNet significantly outperformed MCTNet1
the superiority of the multiple stages in improving crop discrimination. across all missing scenarios, indicating that interpolation and smoothing
might alter the original distribution of the data, leading to a reduction in
4.2. The impact of missing data on model performance model accuracy. In contrast, the accuracy of MCTNet_noALPE1 was
persistently higher than that of MCTNet_noALPE, especially when data
To investigate the impact of varying degrees of missingness in time- missingness is substantial. This discrepancy arose because

13
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 13. The extracted feature distributions of (a) CNN sub-module, (b) Transformer sub-module from (1) the first stage to (3) the last stage in Arkansas.

Fig. 14. The extracted feature distributions of (a) CNN sub-module, (b) Transformer sub-module from (1) the first stage to (3) the last stage in California.

14
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 15. The OA of the crop classification models using samples with different missing rates in Arkansas. (The red curve represents the percentage of the data that is
less than a certain missing rate.). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

MCTNet_noALPE lacks the ALPE module to discern the locations of some information during critical growth stages. In contrast, continuous
missing values. In cases where there are extensive missing values, the missing data creates prolonged gaps, causing the model to miss key
model becomes confused and cannot extract valid features to identify phenological information and leading to a greater decline in perfor
crops. The interpolation and smoothing can reduce many missing values mance. Additionally, MCTNet consistently outperforms MCTNet_
and restore the original data distribution to a certain extent, thereby noALPE across all missing data rates, demonstrating the ALPE module’s
improving the recognition accuracy of the model. Overall, MCTNet is effectiveness in improving model robustness against missing data.
significantly superior to other models, and it does not require interpo To explore the impact of the ALPE in detail, we randomly selected a
lation and smoothing, which can save a lot of calculations. corn sample and extracted the corresponding attention weight matrices
We further investigated the different impacts of random and from the MCTNet and MCTNet_noALPE. These matrices are from the
continuous missing data during the growing season on model perfor first head of the first Transformer sub-module of the models. We used
mance. Specifically, we used corn samples in Arkansas as an example. bipartite graphs, a widely used visualization tool for such matrices, to
The growing season of corn is approximately from DOY 120 to DOY 270, display the attention distribution, where the strength of the attention is
as shown in Fig. 2. We collected all corn samples in Arkansas that had no indicated by the opacity of the edges and nodes as described by
missing data during this period, totaling 1,361 samples. For each sam Rußwurm and Körner (2020). Specifically, Fig. 17(a) plots the time se
ple, we created time series data with missing points by deleting a specific ries NIR values of the sample, indicating that the period between DOY
rate of the original data both randomly and consecutively. We then 180 and DOY 300 represents the phenological period of the corn.
tested the processed samples using MCTNet and MCTNet_noALPE to However, the values for DOY 30, 40, 50, 60, 70, 110 and 360 are
assess the impacts of different missing conditions on the models’ missing. Fig. 17 (b) and (c) show the attention distributions of MCTNet
performance. and MCTNet_noALPE, respectively. It is worth noting that the attention
The results are shown in Fig. 16. As the missing rate increases, the OA weights lower than 90 % were filtered out for enhanced clarity. The
decreases for all models. Notably, random missing data (solid bars) attention distribution of MCTNet demonstrates that the attentions are
consistently results in higher performance compared to continuous focused on the phenological period of the crop and are almost unaffected
missing data (striped bars). This is because random missing data are by the missing values. In contrast, the attention distribution of
spread throughout the growing season, allowing the model to capture MCTNet_noALPE is more dispersed, with the attention being attracted

Fig. 16. Performance of the crop classification models using corn samples with different missing rates during the growing season (“random” denotes random missing
data, “continuous” denotes continuous missing data).

15
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

Fig. 17. Visual illustration of the attention weights from the MCTNet and MCTNet_noALPE.

by the missing values. This figure demonstrates the differences in sensitive to the missing values and achieved superior performance
attention distribution, highlighting how the ALPE module adjusts the without data interpolation and smoothing. Moreover, MCTNet achieved
attention weights to better handle missing data and enhance the model’s superior performance in cross-temporal scenarios, indicating its high
ability to capture relevant temporal dependencies. potential in practical applications. Further work aims to improve the
model’s productivity and generalizability across different regions.
4.3. Limitations and future work
CRediT authorship contribution statement
Although the proposed model achieved superior performance in crop
mapping, there are several limitations for improvement. First, the Yumiao Wang: Writing – original draft, Methodology, Funding
MCTNet model may produce salt-and-pepper noise in classification re acquisition, Conceptualization. Luwei Feng: Writing – review & editing,
sults. This noise results from pixel-wise decision-making, which can be Methodology, Funding acquisition, Formal analysis. Weiwei Sun: Re
pronounced in heterogeneous, small, and broken agricultural landscapes sources, Funding acquisition, Conceptualization. Lihua Wang: Investi
(like California (Fig. 10)). Further studies will consider the spatial in gation, Formal analysis, Data curation. Gang Yang: Visualization,
formation of the input pixel to improve mapping accuracy. For example, Software, Funding acquisition. Binjie Chen: Writing – review & editing,
as suggested by Interdonato et al. (2019), incorporating unlabeled pixels Formal analysis.
surrounding the input pixel in the modeling could enhance performance.
Another limitation of our study is the exclusive use of optical data, while
recent studies have shown the efficiency of SAR and hyperspectral data Declaration of competing interest
in crop monitoring and classification (Farmonov et al., 2023; Sharifi and
Hosseingholizadeh, 2020). We plan to integrate multi-modal remote The authors declare that they have no known competing financial
sensing data in future research to enhance the robustness and accuracy interests or personal relationships that could have appeared to influence
of crop type classification. Lastly, although the current study areas cover the work reported in this paper.
two U.S. states with unique agricultural characteristics, it remains
imperative to explore the performance of the proposed model in various Data availability
other countries and regions.
Data will be made available on request.
5. Conclusions
Acknowledgements
This study proposed a lightweight CNN-Transformer network
MCTNet for pixel-based crop mapping using time-series Sentinel-2 im This work was supported by the National Natural Science Foundation
ages. MCTNet effectively harnesses the advantages of both CNN and of China (No. 42201354, No. 42271340), the Zhejiang Provincial Nat
Transformer to obtain a comprehensive representation of crop growth ural Science Foundation of China (No. LQ22D010007), the Public Pro
patterns. Compared with eight advanced models, the MCTNet exhibited jects of Ningbo City (No. 2022S101, No. 2023S102), the Ningbo Science
better performance in most crop classification cases with the fewest and Technology Innovation 2025 Major Special Project (No. 2021Z107,
number of parameters. The evaluation of the impact of data missing on No. 2022Z032), the China Postdoctoral Science Foundation
the models demonstrated that the proposed ALPE made the MCTNet less (2023M742679).

16
Y. Wang et al. Computers and Electronics in Agriculture 226 (2024) 109370

References Niu, B., Feng, Q., Chen, B., Ou, C., Liu, Y., Yang, J., 2022. HSI-TransUNet: A transformer
based semantic segmentation model for crop mapping from UAV hyperspectral
imagery. Comput. Electron. Agric. 201, 107297.
Achanccaray, P., Feitosa, R.Q., Rottensteiner, F., Sanches, I.D., Heipke, C., 2017. Spatial-
Qiu, B., Huang, Y., Chen, C., Tang, Z., Zou, F., 2018. Mapping spatiotemporal dynamics
temporal conditional random field based model for crop recognition in tropical
of maize in China from 2005 to 2017 through designing leaf moisture based
regions. 2017 IEEE Int. Geosci. Remote Sens. Sympos. (IGARSS) IEEE 3007–3010.
indicator from normalized multi-band drought index. Comput. Electron. Agric. 153,
Asad, M.H., Bais, A., 2020. Weed detection in canola fields using maximum likelihood
82–93.
classification and deep convolutional neural network. Inform. Process. Agric. 7,
Rußwurm, M., Körner, M., 2020. Self-attention for raw optical satellite time series
535–545.
classification. ISPRS J. Photogramm. Remote Sens. 169, 421–435.
Ashourloo, D., Shahrabi, H.S., Azadbakht, M., Aghighi, H., Nematollahi, H.,
Sabzekar, M., Hasheminejad, S.M.H., 2021. Robust regression using support vector
Alimohammadi, A., Matkan, A.A., 2019. Automatic canola mapping using time series
regressions. Chaos Solitons Fractals 144, 110738.
of sentinel 2 images. ISPRS J. Photogramm. Remote Sens. 156, 63–76.
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. 61,
Bargiel, D., 2017. A new method for crop classification combining time series of radar
85–117.
images and crop phenology information. Remote Sens. Environ. 198, 369–383.
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-
Blickensdörfer, L., Schwieder, M., Pflugmacher, D., Nendel, C., Erasmi, S., Hostert, P.,
cam: Visual explanations from deep networks via gradient-based localization. In:
2022. Mapping of crop types and crop sequences with combined time series of
Proceedings of the IEEE international conference on computer vision, pp. 618-626.
Sentinel-1, Sentinel-2 and Landsat 8 data for Germany. Remote Sens. Environ. 269,
Sharifi, A., 2020. Remotely sensed vegetation indices for crop nutrition mapping. J. Sci.
112831.
Food Agric. 100, 5191–5196.
Boryan, C., Yang, Z., Mueller, R., Craig, M., 2011. Monitoring US agriculture: the US
Sharifi, A., Hosseingholizadeh, M., 2020. Application of sentinel-1 data to estimate
department of agriculture, national agricultural statistics service, cropland data layer
height and biomass of rice crop in Astaneh-ye Ashrafiyeh, Iran. J. Ind. Soc. Remote
program. Geocarto Int. 26, 341–358.
Sens. 48, 11–19.
Chen, H., Li, H., Liu, Z., Zhang, C., Zhang, S., Atkinson, P.M., 2023. A novel Greenness
Sun, Z., Di, L., Fang, H., 2019. Using long short-term memory recurrent neural network
and Water Content Composite Index (GWCCI) for soybean mapping from single
in land cover classification on Landsat and Cropland data layer time series. Int. J.
remotely sensed multispectral images. Remote Sens. Environ. 295, 113679.
Remote Sens. 40, 593–614.
Farmonov, N., Amankulova, K., Szatmári, J., Sharifi, A., Abbasi-Moghadam, D., Nejad, S.
Tang, P., Chanussot, J., Guo, S., Zhang, W., Qie, L., Zhang, P., Fang, H., Du, P., 2024.
M.M., Mucsi, L., 2023. Crop type classification by DESIS hyperspectral imagery and
Deep learning with multi-scale temporal hybrid structure for robust crop mapping.
machine learning algorithms. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 16,
ISPRS J. Photogramm. Remote Sens. 209, 117–132.
1576–1588.
Van der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res.
Feng, F., Gao, M., Liu, R., Yao, S., Yang, G., 2023. A deep learning framework for crop
9.
mapping with reconstructed Sentinel-2 time series images. Comput. Electron. Agric.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
213, 108227.
Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Proces. Syst. 30.
Feng, S., Zhao, J., Liu, T., Zhang, H., Zhang, Z., Guo, X., 2019. Crop type identification
Wang, H., Chen, X., Zhang, T., Xu, Z., Li, J., 2022a. CCTNet: coupled CNN and
and mapping using machine learning algorithms and Sentinel-2 time series data.
transformer network for crop segmentation of remote sensing images. Remote Sens.
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12, 3295–3306.
(Basel) 14, 1956.
Garnot, V.S.F., Landrieu, L., 2020. Lightweight temporal self-attention for classifying
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q., 2020. ECA-Net: Efficient channel
satellite images time series, Advanced Analytics and Learning on Temporal Data: 5th
attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF
ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised
conference on computer vision and pattern recognition, pp. 11534-11542.
Selected Papers 6. Springer, pp. 171-181.
Wang, Y., Feng, L., Sun, W., Zhang, Z., Zhang, H., Yang, G., Meng, X., 2022b. Exploring
Garnot, V.S.F., Landrieu, L., Giordano, S., Chehata, N., 2020. Satellite image time series
the potential of multi-source unsupervised domain adaptation in crop mapping using
classification with pixel-set encoders and temporal self-attention, Proceedings of the
Sentinel-2 images. Gisci. Remote Sens. 59, 2247–2265.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12325-
Xiang, J., Liu, J., Chen, D., Xiong, Q., Deng, C., 2023. CTFuseNet: a multi-scale CNN-
12334.
transformer feature fused network for crop type segmentation on UAV remote
Huang, Y., Qiu, B., Chen, C., Zhu, X., Wu, W., Jiang, F., Lin, D., Peng, Y., 2022.
sensing imagery. Remote Sens. (Basel) 15, 1151.
Automated soybean mapping based on canopy water content and chlorophyll
Xu, J., Zhu, Y., Zhong, R., Lin, Z., Xu, J., Jiang, H., Huang, J., Li, H., Lin, T., 2020.
content using Sentinel-2 images. Int. J. Appl. Earth Obs. Geoinf. 109, 102801.
DeepCropMapping: a multi-temporal deep learning approach with improved spatial
Inglada, J., Vincent, A., Arias, M., Tardy, B., Morin, D., Rodes, I., 2017. Operational high
generalizability for dynamic corn and soybean mapping. Remote Sens. Environ. 247,
resolution land cover map production at the country scale using satellite image time
111946.
series. Remote Sens. (Basel) 9, 95.
You, N., Dong, J., Huang, J., Du, G., Zhang, G., He, Y., Yang, T., Di, Y., Xiao, X., 2021.
Interdonato, R., Ienco, D., Gaetano, R., Ose, K., 2019. DuPLO: A DUal view point deep
The 10-m crop type maps in Northeast China during 2017–2019. Sci. Data 8, 41.
learning architecture for time series classificatiOn. ISPRS J. Photogramm. Remote
Yuan, Y., Lin, L., 2021. Self-supervised pretraining of transformers for satellite image
Sens. 149, 91–104.
time series classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14,
Li, Z., Chen, G., Zhang, T., 2020. A CNN-transformer hybrid approach for crop
474–487.
classification using multitemporal multisensor images. IEEE J. Sel. Top. Appl. Earth
Yuan, Y., Lin, L., Liu, Q., Hang, R., Zhou, Z.-G., 2022. SITS-former: a pre-trained spatio-
Obs. Remote Sens. 13, 847–858.
spectral-temporal representation model for Sentinel-2 time series classification. Int.
Liu, M., Chai, Z., Deng, H., Liu, R., 2022. A CNN-transformer network with multiscale
J. Appl. Earth Obs. Geoinf. 106, 102651.
context aggregation for fine-grained cropland change detection. IEEE J. Sel. Top.
Zeiler, M.D., Fergus, R., 2013. Visualizing and Understanding Convolutional Networks.
Appl. Earth Obs. Remote Sens. 15, 4297–4306.
arXiv.
Maponya, M.G., van Niekerk, A., Mashimbye, Z.E., 2020. Pre-harvest classification of
Zhang, W., Zhang, H., Zhao, Z., Tang, P., Zhang, Z., 2023. Attention to both global and
crop types using a Sentinel-2 time-series and machine learning. Comput. Electron.
local features: a novel temporal encoder for satellite image time series classification.
Agric. 169, 105164.
Remote Sens. (Basel) 15, 618.
Martin, E.R., Godwin, I.A., Cooper, R.I., Aryal, N., Reba, M.L., Bouldin, J.L., 2021.
Zhong, L., Hawkins, T., Biging, G., Gong, P., 2011. A phenology-based approach to map
Assessing the impact of vegetative cover within Northeast Arkansas agricultural
crop types in the San Joaquin Valley, California. Int. J. Remote Sens. 32, 7777–7804.
ditches on sediment and nutrient loads. Agr Ecosyst Environ 320, 107613.
Zhong, L., Hu, L., Zhou, H., 2019. Deep learning based multi-temporal crop classification.
Remote Sens. Environ. 221, 430–443.