MFINet - Multi-Scale Feature Interaction Network For Change Detection of High-Resolution Remote Sensing Images
MFINet - Multi-Scale Feature Interaction Network For Change Detection of High-Resolution Remote Sensing Images
Article
MFINet: Multi-Scale Feature Interaction Network for Change
Detection of High-Resolution Remote Sensing Images
Wuxu Ren 1 , Zhongchen Wang 1 , Min Xia 1, * and Haifeng Lin 2
Abstract: Change detection is widely used in the field of building monitoring. In recent years, the
progress of remote sensing image technology has provided high-resolution data. However, unlike
other tasks, change detection focuses on the difference between dual-input images, so the interaction
between bi-temporal features is crucial. However, the existing methods have not fully tapped the
potential of multi-scale bi-temporal features to interact layer by layer. Therefore, this paper proposes a
multi-scale feature interaction network (MFINet). The network realizes the information interaction of
multi-temporal images by inserting a bi-temporal feature interaction layer (BFIL) between backbone
networks at the same level, guides the attention to focus on the difference region, and suppresses
the interference. At the same time, a double temporal feature fusion layer (BFFL) is used at the end
of the coding layer to extract subtle difference features. By introducing the transformer decoding
layer and improving the recovery effect of the feature size, the ability of the network to accurately
capture the details and contour information of the building is further improved. The F1 of our model
on the public dataset LEVIR-CD reaches 90.12%, which shows better accuracy and generalization
performance than many state-of-the-art change detection models.
Citation: Ren, Z.; Xia, M.; Lin, H. Keywords: remote sensing images; change detection; transformer; self-attention mechanism; CNN
MFINet: Multi-Scale Feature
Interaction Network for Change
Detection of High-Resolution Remote
Sensing Images. Remote Sens. 2024, 16, 1. Introduction
1269. https://doi.org/10.3390/ With the development of earth observation technology and geographic information
rs16071269
technology, remote sensing images have become more and more abundant and diverse.
Academic Editor: Mohammad The widespread use of satellites, aircraft, and other sensors enables us to capture informa-
Awrangjeb tion on the Earth’s surface, including features of terrain, land cover, vegetation, buildings,
and other geographical objects [1]. This remote sensing technology can also obtain data
Received: 26 February 2024
in different spectral ranges, including infrared and ultraviolet spectra, which helps us
Revised: 28 March 2024
understand surface features more comprehensively [2–4].
Accepted: 2 April 2024
With the development of remote sensing technology and the acceleration of urban-
Published: 4 April 2024
ization, the problem of change detection has become more complex. It has become an
urgent challenge to detect change areas quickly and accurately from the massive amount of
land cover remote sensing image data [5]. In this context, the research of building change
Copyright: © 2024 by the authors. detection technology has become crucial, as shown in Figure 1. Its main goal is to accurately
Licensee MDPI, Basel, Switzerland. identify and locate regions where semantic changes have occurred from a pair of time
This article is an open access article series remote sensing images, that is, the true change region, and suppress the influence of
distributed under the terms and the pseudo-change region [6]. This technology has broad application prospects in many
conditions of the Creative Commons fields, including environmental monitoring [7], climate research [8], disaster assessment [9],
Attribution (CC BY) license (https:// agricultural management [10], urban planning [11], and water resource management.
creativecommons.org/licenses/by/
4.0/).
Change
Map
: Actual change regions
Img2 : Pseudo-change regions
Change detection methods in remote sensing imagery can be divided into pixel-
level [12], feature-level [13], and object-level [14] methods according to the granularity of
the change object. By sorting out the development process of remote sensing image change
detection, the development route of international remote sensing image change detection
technology can be divided into four different stages.
In the first stage, remote sensing technology faced constraints imposed by satellite and
optical instrument limitations, resulting in low data quality. The hallmark of this period
was the adoption of straightforward algebraic calculations or direct pixel comparisons to
generate change detection results. For example, principal component analysis (PCA) [15]
was extensively employed. Zhong et al. [16] proposed an unsupervised change detection
method utilizing PCA and k-means clustering. This approach involved segmenting differ-
ential images into non-overlapping blocks, projecting pixels into the feature vector space,
and employing k-means clustering for detection. Another notable example from this stage
is change vector analysis (CVA). Liu et al. [17] introduced a novel multi-scale morphological
compressed change vector analysis method. This method expanded on the spectral-based
compressed change vector analysis approach by jointly analyzing spectral–spatial change
information. It utilized morphological analysis to construct reconstructed spectral change
vector features, preserving more geometric details.
In the second stage, machine learning methods such as support vector machines
(SVMs) [18] and decision trees [19] were introduced. Volpi et al. [20] conducted research
using histogram statistics as fundamental detection features, followed by the application
of SVMs for land-use change detection. Im et al. [19] combined image neighborhood
correlation analysis with change detection methods based on decision tree classification.
Change detection methods based on machine learning algorithms have the capability to
automatically extract features from large-scale remote sensing data, exhibiting excellent
sensitivity to complex and subtle changes. However, these methods generally face the
problem of high computational overhead.
In the third stage, object-level change detection emerged as a departure from pixel-
level change detection. Unlike focusing solely on changes in individual pixels, object-level
change detection emphasizes detecting changes at the level of target objects or entities.
In the work by Wang et al. [21] presented a change detection approach based on objects,
which integrates spectral, shape, and texture features, employing multiple supervised
classifiers. The accuracy of change detection in urban environments was improved through
the utilization of a weighted voting ensemble strategy. Tan et al. [22] introduced an
object-based multi-feature change detection method, which uses multiple features and
random forests to select features. Object-level methods usually include steps such as object
extraction, feature representation, matching, and context modeling to obtain more accurate
change information. However, these methods can only extract low-level features in images,
which are obviously affected by factors such as radiation differences.
In the fourth stage, recent years have witnessed significant advancements in computer
vision technology, with deep learning providing promising solutions to change detection
problems. Traditional methods for change detection often rely on manually designed
features and rules. Faced with the ever-growing volume of high-resolution remote sens-
Remote Sens. 2024, 16, 1269 3 of 19
ing data, the performance of these methods gradually becomes limited. Deep learning
techniques, particularly the application of convolutional neural network (CNN) [23] and
transformer [24] models, have injected new vitality into change detection [25]. The promi-
nence of deep learning methods in the field of change detection arises from their ability to
learn features from data without the need for manual feature extraction, thereby enhancing
adaptability to change patterns [26]. Through deep learning, the model can automati-
cally capture the contextual information, textural features, and semantic information in
the image.
Existing deep learning-based change detection methods lack interactive expression
between bi-temporal images during the encoding phase, resulting in the isolation of bi-
temporal information and the limited discernibility of actual change regions. Furthermore,
in the decoding phase, the use of excessively high sampling rates and the absence of skip
connections with the encoding module prevent effective multi-scale information fusion.
This lack of fusion, along with poor communication of contextual information, hinders
the layer-wise restoration of image features. Consequently, this leads to numerous false
positives and negatives at segmented edges in the detected images [27]. Our proposed
method aims to enhance the bi-temporal interaction during the feature extraction phase of
Siamese models. It combines the advantages of local feature extraction from CNNs and the
global feature extraction capabilities of transformers. We optimize the overall information
recovery capability during the model’s upsampling process to achieve high-precision,
high-generalization change detection. The main contributions of our work are as follows:
1. A remote sensing image change detection network based on a multi-scale feature inter-
action structure named MFINet is proposed to solve the problem of insufficient target
attention caused by insufficient bi-temporal interaction in change detection tasks.
In the overall structure, we use a combination of a CNN encoder and a transformer
decoder to make full use of the CNN’s local perception and the transformer’s global
receptive field to effectively understand different levels of multi-source information.
2. A bi-temporal feature interaction layer (BFIL) is proposed to act as a medium for
multi-level feature interaction, enhance the semantic information exchange between
the same-level features of the Siamese network, and enhance the multi-temporal
information communication at different time nodes. It is conducive to the model
to discover the actual change regions and suppress the interference of the pseudo-
change region.
3. In order to strengthen the model’s perception of the fine-grained difference between
the bi-temporal deep processing features, we propose the bi-temporal feature fu-
sion layer (BFFL), which integrates rich bi-temporal deep features before image size
restoration by constructing bi-temporal homologous global guidance features.
2. Related Work
2.1. CNN-Based Change Detection Methods
CNNs are favored because of their inductive bias and generalization. Zhan et al. [28]
first introduced a CNN into SiameseNet as a solution for change detection. The twin
network reuses the same codec structure for two temporal images, learns the bi-temporal
image features in an equal way, and obtains the change information. Daudt et al. [29]
introduced a twin fully convolutional network (FCN) into the end-to-end remote sensing
image change detection task, and proposed three different network architectures. FC-EF
uses the method of splicing dual-phase images as input, while FC-Siam-conc and FC-Siam-
diff use a twin FCN structure. Peng et al. [30], based on the UNet++ encoder–decoder
structure, used global and fine information to generate feature maps with high spatial
accuracy. Then, the fusion strategy of multiple auxiliary outputs was used to combine
the change maps of different semantic levels to generate the final change map with high
accuracy. In summary, many researchers have directly transplanted classical models in
semantic segmentation, such as UNet and FCN, to the field of change detection. However,
the change detection task is bi-temporal, which is different from the single temporality of
Remote Sens. 2024, 16, 1269 4 of 19
semantic segmentation. These models often form a twin structure by copying the existing
codec structure, which lacks bi-temporal interaction and is difficult to adapt to change
detection datasets with large time spans. Therefore, Zhang et al. [31] proposed IFNet,
which adopts a two-stream architecture to interact with information twice, and then uses a
deep supervised difference discriminant network (DDN) for change detection. In order
to improve the integrity of the output change map and the internal compactness of the
object, IFNet fuses the multi-level deep features of the original image with the image
difference features through the attention mechanism. Yin et al. [32] proposed SAGNet,
which interspersed the bi-temporal interaction scheme between the coding levels. Through
the hybrid layer and the backbone network combined with the bi-temporal contextual
information, the bi-temporal feature distribution is more similar, and the automatic domain
adaptation between the two time domains is realized to a certain extent. Although the above
methods are all based on CNNs, they mainly focus on the local perception of convolution
kernels, and it is difficult to effectively model remote contextual information in bi-temporal
images, which greatly limits their performance.
...
Self-Attention
Multi-head
Norm
Norm
MLP
k
v
...
...
Tokens T2 New
Tokens T2
3. Methodology
3.1. Overall Structure
The overall structure of the MFINet is shown in Figure 3. The network mainly includes
two stages. The first stage is the encoding stage responsible for feature extraction. There are
Remote Sens. 2024, 16, 1269 5 of 19
three members, including the multi-scale encoding layer of the backbone network ResNet18,
the bi-temporal feature interaction layer, and the bi-temporal feature fusion layer. The main
function of the bi-temporal feature interaction layer is to receive the output of each layer
of the twin ResNet18. These outputs are features extracted from remote sensing images
taken at two different time points, including changes in the target area and background
information. The bi-temporal feature interaction layer allows the network to periodically
focus on pixels at different time points and assign weights according to their importance.
This helps one identify and capture the change area and the correlation between the
bi-temporal images. The structure of the bi-temporal feature fusion layer is dual-input
single-output, receiving the deepest information from the twin ResNet18, helping the
network to refine the underlying features of low-resolution high channels, so as to explore
the channel information that is beneficial to distinguish the change area. The second stage
is the decoding stage responsible for feature size recovery. There are two components,
including the transformer decoding layer and the classifier before output. In order to
combine shallow detail information and deep semantic information, the difference feature
maps of the two-way encoding blocks are given to the corresponding decoding blocks by
skip connection. In addition, a classifier is used as a post-processing module at the end of
the decoding layer to achieve binary classification.
T1 image T2 image Output
Swin transformer decoder
ResNet18 encoder
Classifier
Conv1×1 : Downsampling
Stem Stem
Conv3×3 : Upsampling
Weight sharing
Conv3×3
CNN CNN Transformer : Skip connection
Layer1 Layer1 Layer1
LN LN
Bi-temporal Feature Interaction Layer
SW_MSA W_MSA
CNN CNN Transformer
Layer3 Layer3 Layer3
LN LN
Figure 3. The overall structure diagram of the multi-scale feature interaction network; the internal
structure of the transformer layer and the classifier are displayed in the green dotted box.
Transpose Transpose
Matrix Matrix
Multiplication Multiplication
Softmax Softmax
Matrix Matrix
Multiplication Multiplication
Specifically, if we set the input single-temporal feature f n ∈ RC× H ×W from the tempo-
ral n, then the feature will first be mapped to three identical linear transformer layers. In the
layer, the original pixel matrix will first compress the channel to fuse the multi-channel fea-
tures, and then expand into a self-attention vector. According to the functions that will be
C
assigned in the future, these three generated sequences are called query vector Qn ∈ R 2 × L ,
C C
key vector Kn ∈ R 2 × L , and value vector Vn ∈ R 2 × L . The process of generating sequences
can be expressed by the following formula:
where C represents the number of channels in the feature map and three vectors. H
represents the height of the feature map. W represents the width of the feature map.
L = H × W represents the generated vector sequence. Linear (·) represents the linear layer
used to change the channel. Reshape(·) represents the operation of the matrix changing into
a vector sequence. After obtaining the triple vector sequence, the key vector performs matrix
multiplication with the query vector after transpose, which can calculate the similarity
score between each query vector and key vector, and convert the score through the softmax
activation function to the weight An , which is used to weight the calculated value vector.
This way of assigning attention weight to yourself can be expressed by the following dot
product formula:
An = so f tmax (Kn T Qn ), (2)
f n ′ = Vn An . (3)
Taking n = 1 and n = 2 as examples, the structure of the bi-temporal feature interac-
tion layer is introduced. In this layer, we allow information to interact and pass between
two tenses to better understand image changes.
First, we generate separate query vectors for n = 1 and n = 2, respectively. These query
vectors represent specific information at different time points. Then, we exchange the query
vectors and apply the query vectors of n = 1 to n = 2, and vice versa. In this way, we can
realize the information interaction between two tenses. Next, we use the exchanged query
Remote Sens. 2024, 16, 1269 7 of 19
vector together with the corresponding temporal key vector to calculate the similarity
score of the elements between different temporals. These similarity scores are used to
determine the correlation of different elements between the two tenses for information
transmission. Finally, we use these similarity scores as the weights of the self-attention
mechanism, and exchange them again to weight the value vectors of each other’s tenses.
This produces two outputs, f 1 ′ and f 2 ′ , that fuse multi-temporal information. The following
two sets of dot product formulas can express the above interaction process:
A1 = so f t max(K1 T Q2 ) (4)
A2 = so f t max(K2 T Q1 ) (5)
f 1 ′ = V2 A1 (6)
′
f 2 = V1 A2 (7)
The existing feature interaction methods often directly perform bi-temporal interaction
at the feature level. For example, the FC-CD series methods [29] interact with features
through pixel-level subtraction and channel cascade. This method easily leads to semantic
information confusion, making it difficult for the model to distinguish the similarities and
differences between the two groups of pictures. By exchanging attention-related queries and
key value information between two temporals, the BFIL can bridge the feature information
of another branch while retaining the single temporal feature. The self-association and the
guidance of parallel branches enhance the global attention of the model across the time
domain to a certain extent and suppress the interference of pseudo-changes.
where f 1 and f 2 represent the input bi-temporal features. Concat[·] represents the channel
cascade. AvgPool (·) represents average pooling. GELU (·) represents the GELU activation
function [38]. Conv1×1 represents 1 × 1 convolution operation. σ (·) represents the Sigmoid
activation function. These weights are used to adjust the existing dual-branch original
features. By performing pixel subtraction on the attention matrix corresponding to the bi-
temporal and the original features after the compressed channel, it is helpful to deeply mine
the potential difference features. Finally, the obtained bi-temporal features are integrated
through channel cascades to improve the information richness of bi-temporal features.
The formula of the feature fusion operation is expressed as follows:
f out = Concat Conv1×1 ( A g f 1 ) − f 2 , Conv1×1 ( A g f 2 ) − f 1 . (9)
This layer combines global attention and simple difference operations so that our
fusion layer can capture the subtle differences of the transformation more carefully and
comprehensively, thereby improving the accuracy of change detection. Existing fusion
feature algorithms, such as bilateral guided aggregation layer [39] and ensemble channel
attention module, rely on high-channel fusion of multi-scale features, resulting in huge
computational overhead. BFFL is more flexible and efficient, and can better adapt to
Remote Sens. 2024, 16, 1269 8 of 19
transformations in complex scenes. Through the targeted operation of deep features, our
method shows stronger discrimination when dealing with high-channel deep features, thus
providing a more powerful feature expression for change detection tasks.
T1 Features: f1 T2 Features: f2
AvgPool
GELU
Conv1×1
Conv1×1 Conv1×1
C
C :Channel-wise concatenation
:Pixel-wise multiplication
4. Experiment
4.1. Datasets
4.1.1. LEVIR-CD
As shown in Figure 6, the dataset uses large-scale and high-resolution remote sensing
images obtained by Google Earth, and the target changes include various types of buildings
in urban and rural areas such as homes and warehouses. Containing multiple sets of
image data, the time span between different groups varies, and the introduction of seasonal
changes and changes caused by illumination can effectively verify the network’s ability to
focus on target changes. The details of the dataset are shown in Table 1.
Img1
Img2
GT
Figure 6. LEVIR-CD diagram. Each column of (a–e) represents a sample. The first and second rows
show the bi-temporal remote sensing images, and the third row shows the ground truth.
4.1.2. GZ-CD
As shown in Figure 7, the dataset captures Guangzhou in 2006 and 2019 using 19 pairs
of remote sensing images obtained from Google Earth. The target changes in the dataset
include various types of buildings. It is worth noting that GZ-CD contains a small number
of samples, so the degree to which the network relies on a large number of labeled data can
be checked by comparing the level with other datasets [40]. The details of the dataset are
shown in Table 1.
Img1
Img2
GT
Figure 7. GZ-CD diagram. Each column of (a–e) represents a sample. The first and second rows show
the bi-temporal remote sensing images, and the third row shows the ground truth.
included man-made objects such as roads, cars, buildings, and natural objects such as
individual trees and forests. Significant seasonal differences led to significant brightness
changes, which made it difficult for the network to distinguish between target changes and
background changes [41]. The details of the dataset are shown in Table 1.
Img1
Img2
GT
Figure 8. Lebedev dataset diagram. Each column of (a–e) represents a sample. The first and second
rows show the bi-temporal remote sensing images, and the third row shows the ground truth.
epoch
lrbase × (1 − ) (14)
max_epoch
Five typical indicators were used to evaluate the performance of change detection,
and the higher the value, the better. Four of them were used to evaluate target changes:
Precision (P), Recall (R), Intersection over Union (IoU), and F1 score; two indicators were
used to evaluate the overall classification accuracy: Overall Accuracy (OA). Formally,
the five indicators are defined as
TP
P= (15)
TP + FP
TP
R= (16)
TP + FN
2
F1 = −1 (17)
P + R −1
TP
IoU = (18)
TP + FP + FN
Remote Sens. 2024, 16, 1269 11 of 19
TP + TN
OA = (19)
TP + TN + FP + FN
where TP, TN, FP, and FN represent the quantities of true positives, true negatives, false
positives, and false negatives, respectively.
LEVIR-CD GZ-CD
Method
F1 (%) IoU (%) F1 (%) IoU (%)
Backbone 86.54 78.41 82.70 71.45
Backbone + BFIL 87.53 80.95 84.09 73.97
Backbone + BFIL + BFFL 88.11 81.93 84.90 74.19
Backbone + BFIL + BFFL + Dec. (CNN) 89.96 82.12 85.59 74.44
Backbone + BFIL + BFFL + Dec. (Transformer) 90.12 82.33 86.08 74.87
1. The influence of BFIL: It is difficult for a simple twin CNN network to discover
the common and different features of bi-temporal features, and the ability of bi-
temporal mutual understanding will become worse as the number of layers deepens.
Therefore, we added a BFIL to the backbone network to strengthen the interactive
attributes of bi-temporal features, and used the attention weight as an interactive
means. The experimental results show that the BFIL can help the network to improve
the accurate detection of changing targets in the coding stage. For LEVIR-CD, F1
increased by 0.99% and IoU increased by 2.54%. For GZ-CD, F1 increased by 1.39%
and IoU increased by 2.52%.
2. The influence of BFFL: The fusion operation of deep bi-temporal features is a great test
of the lightweight degree and differential feature extraction ability of the module. It is
easy to confuse features using simple pixel subtraction or channel cascade, while BFFL
reduces the occurrence of feature confusion through multiple residual connections.
The experimental results show that the BFFL bi-temporal feature fusion significantly
increases the segmentation accuracy of the changed region features. For LEVIR-CD,
F1 increased by 0.58% and IoU increased by 0.98%. For GZ-CD, F1 increased by 0.81%
and IoU increased by 0.22%.
3. The influence of decoder selection: We compared two kinds of decoder methods.
One is ResNet18, which is consistent with the encoder, and the other is the swin
transformer used in our model. In terms of experimental results, the improvement in
indicators in the changing region is limited. The F1 for LEVIR-CD increased by 0.16%,
and IoU increased by 0.21%. For GZ-CD, F1 increased by 0.49% and IoU increased
by 0.43%.
encoder structure of Unet++_MSOF and SNUNet [42], IFNet uses channel attention and
spatial attention to optimize the feature weight distribution in the process of multi-scale
skip connections. SAGNet and SAFNet [43] add a bi-temporal interaction layer between the
encoding layers to communicate the semantic information of the twin branches. Secondly,
there are models combining transformers and self-attention mechanisms, such as STANet,
which models spatio-temporal relationships through multi-scale pooling and self-attention
mechanisms. DASNet [44] introduces a dual attention mechanism to capture long-distance
dependencies and enhance feature representation to improve the recognition performance
of the model. BIT uses a CNN in the initial feature extraction, and uses a transformer
encoder and decoder to correlate bi-temporal information in the form of sequences in the
middle and late stages. These methods have achieved competitive performance on various
change detection datasets. Figures 9–11 qualitatively show the prediction graphs of each
method on three datasets, where different colors are assigned to identify the correctness or
inaccuracy of the detection, including TP (white), TN (black), FP (red), and FN (green).
(a)
(b)
(c)
(d)
Img1 Img2 Ground Truth FC-Siam-conc Unet++_MSOF STANet DASNet SAGNet BIT MFINet
Figure 9. The quantitative performance visualization of different methods on LEVIR-CD. (a–d) denote
the prediction results of all comparison methods for different samples. In the color classification,
the true positive is white, the true negative is black, the false positive is red, and the false negative
is green.
(a)
(b)
(c)
(d)
Img1 Img2 Ground Truth FC-Siam-conc Unet++_MSOF STANet DASNet SAGNet BIT MFINet
Figure 10. The quantitative performance visualization of different methods on GZ-CD. (a–d) denote
the prediction results of all comparison methods for different samples. In the color classification,
the true positive is white, the true negative is black, the false positive is red, and the false negative
is green.
Remote Sens. 2024, 16, 1269 13 of 19
(a)
(b)
(c)
(d)
Img1 Img2 Ground Truth FC-Siam-conc Unet++_MSOF STANet DASNet SAGNet BIT MFINet
Figure 11. The quantitative performance visualization of different methods on the Lebedev dataset.
(a–d) denote the prediction results of all comparison methods for different samples. In the color
classification, the true positive is white, the true negative is black, the false positive is red, and the
false negative is green.
Table 3. The comparison results of different comparison models in the LEVIR-CD test set (bold
numbers represent the optimal results).
distinguish between true and false changes. Secondly, for Figure 10c, both tenses contain
buildings, but the small buildings in Img1 are removed in Img2. This bi-temporal image is
a typical case to test the bi-temporal interaction ability of the model. All the comparison
models visually missed the small, white building on the right side, and MFINet successfully
achieved accurate detection. This shows that MFINet has advantages in capturing small
details in the spatio-temporal changes of images, which helps to better understand and
utilize temporal information. Finally, in Figure 10d, the target size involved is small and
easily ignored. Relying on the advantages of global feature extraction, the transformer-
based method can identify the approximate area of small targets, but it is also accompanied
by more missed detection. In contrast, our proposed model achieves the lowest missed
detection rate, further confirming the superiority of MFINet in small target detection. These
visualization results provide an intuitive confirmation of our model performance, and also
highlight the advantages of MFINet over other models in different scenarios and challenges.
Table 4. The comparison results of different comparison models in the GZ-CD test set (bold numbers
represent the optimal results).
Table 5. The comparison results of different comparison models in the Lebedev dataset test set (bold
numbers represent the optimal results).
4.5. Discussion
4.5.1. Comprehensive Efficiency Analysis of the Models
This paper aims to achieve high-precision detection while reducing computational
complexity. Therefore, for LEVIR-CD, we conducted a comprehensive analysis and com-
parison of the network from multiple perspectives, including floating-point operations
(FLOPs), number of parameters (Params), inference time, and F1-score. The unit of flops
is Memory Access Cost (Mac). We randomly selected 1000 images of 256 × 256 pixels in
the validation set for the inference operation, and averaged all the results to evaluate the
inference time of the model. The specific results are shown in Table 6. MFINet performed
well on multiple performance indicators. Although the FC-CD series had a slight advantage
in the F1 value, MFINet was significantly better than other models involving transformers
in FLOPs and Params, achieving the highest F1 value. This shows that MFINet greatly
reduces the computational burden while achieving high performance, and provides a more
efficient solution for practical applications. However, it is worth noting that because our
model uses GELU as the activation function many times in the bi-temporal feature fusion
layer and decoder, the inference time does not show an advantage over other comparison
models. Although it is competitive in computational cost, it also suggests that we can
consider the choice of activation function when further optimizing the model to further
improve the inference speed. In general, MFINet achieves excellent detection results with
less computational cost, and is more friendly to hardware devices. This provides a more
feasible choice for actual deployment, especially in resource-constrained environments.
MFINet shows potential in high-performance target detection.
5. Conclusions
The multi-scale feature interaction network proposed in this paper provides an inno-
vative solution for remote sensing image change detection tasks. Different from the existing
model’s dependence on high-depth encoding, our model achieves efficient information
interaction for multi-temporal remote sensing images through lightweight encoding and
bi-temporal feature interaction. At the same time, the transformer decoding layer is in-
troduced in the decoding stage of the network architecture, which effectively improves
the recovery effect of the feature size, and makes the network capture the details and
contour information of the building more accurately in the output stage. The model shows
high change area detection accuracy and overall image prediction accuracy on datasets of
different scales, and the computational overhead is far lower than that of similar models. It
shows strong generalization ability and is suitable for remote sensing images of different
scenes and time scales.
Author Contributions: Conceptualization, W.R. and M.X.; methodology, W.R. and Z.W.; software,
W.R. and Z.W.; validation, W.R. and Z.W.; formal analysis, H.L.; investigation, W.R.; resources, M.X.
and H.L.; data curation, W.R.; writing—original draft preparation, W.R.; writing—review and editing,
M.X.; visualization, Z.W.; supervision, M.X.; project administration, M.X.; funding acquisition, W.R.
All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported in part by the National Natural Science Foundation of PR China
(42075130).
Remote Sens. 2024, 16, 1269 18 of 19
Data Availability Statement: The data and the code of this study are available from the corresponding
author upon request.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Ding, L.; Xia, M.; Lin, H.; Hu, K. Multi-Level Attention Interactive Network for Cloud and Snow Detection Segmentation. Remote
Sens. 2024, 16, 112. [CrossRef]
2. Peng, X.; Zhong, R.; Li, Z.; Li, Q. Optical Remote Sensing Image Change Detection Based on Attention Mechanism and Image
Difference. IEEE Trans. Geosci. Remote Sns. 2021, 59, 7296–7307. [CrossRef]
3. Marin, C.; Bovolo, F.; Bruzzone, L. Building Change Detection in Multitemporal Very High Resolution SAR Images. IEEE Trans.
Geosci. Remote Sens. 2015, 53, 2664–2682. [CrossRef]
4. Wang, Z.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual Encoder–Decoder Network for Land Cover Segmentation of Remote Sensing
Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2372–2385. [CrossRef]
5. Fang, S.; Li, K.; Li, Z. Changer: Feature Interaction is What You Need for Change Detection. IEEE Trans. Geosci. Remote Sens. 2023,
61, 5610111. [CrossRef]
6. Diakogiannis, F.; Waldner, F.; Caccetta, P. Looking for change? Roll the Dice and demand Attention. Remote Sens. 2021, 13, 3707.
[CrossRef]
7. Willis, K.S. Remote sensing change detection for ecological monitoring in United States protected areas. Biol. Conserv. 2015,
182, 233–242. [CrossRef]
8. Jin, H.; He, W.; Liu, Q.; Wang, J.; Feng, G. The applicability of research on moving cut data-approximate entropy on abrupt
climate change detection. Theor. Appl. Climatol. 2016, 124, 475–486. [CrossRef]
9. Qiao, H.; Wan, X.; Wan, Y.; Li, S.; Zhang, W. A novel change detection method for natural disaster detection and segmentation
from video sequence. Sensors 2020, 20, 5076. [CrossRef]
10. Lunetta, R.S.; Knight, J.F.; Ediriwickrema, J.; Lyon, J.G.; Worthy, L.D. Land-cover change detection using multi-temporal MODIS
NDVI data. In Geospatial Information Handbook for Water Resources and Watershed Management; CRC Press: Boca Raton, FL, USA,
2022; Volume II, pp. 65–88.
11. Zhang, Z.; Liu, F.; Zhao, X.; Wang, X.; Shi, L.; Xu, J.; Yu, S.; Wen, Q.; Zuo, L.; Yi, L.; et al. Urban Expansion in China Based on
Remote Sensing Technology: A Review. Chin. Geogr. Sci. 2018, 28, 727–743. [CrossRef]
12. Rokni, K.; Ahmad, A.; Solaimani, K.; Hazini, S. A new approach for surface water change detection: Integration of pixel level
image fusion and image classification techniques. Int. J. Appl. Earth Obs. Geoinf. 2015, 34, 226–234. [CrossRef]
13. Wiratama, W.; Lee, J.; Sim, D. Change detection on multi-spectral images based on feature-level U-Net. IEEE Access 2020,
8, 12279–12289. [CrossRef]
14. Xu, L.; Jing, W.; Song, H.; Chen, G. High-resolution remote sensing image change detection combined with pixel-level and
object-level. IEEE Access 2019, 7, 78909–78918. [CrossRef]
15. Maćkiewicz, A.; Ratajczak, W. Principal components analysis (PCA). Comput. Geosci. 1993, 19, 303–342. [CrossRef]
16. Celik, T. Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and k-Means Clustering. IEEE
Geosci. Remote Sens. Lett. 2009, 6, 772–776. [CrossRef]
17. Liu, S.; Du, Q.; Tong, X.; Samat, A.; Bruzzone, L.; Bovolo, F. Multiscale Morphological Compressed Change Vector Analysis for
Unsupervised Multiple Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4124–4137. [CrossRef]
18. Bovolo, F.; Bruzzone, L.; Marconcini, M. A novel approach to unsupervised change detection based on a semisupervised SVM
and a similarity measure. IEEE Trans. Geosci. Remote Sens. 2008, 46, 2070–2082. [CrossRef]
19. Im, J.; Jensen, J.R. A change detection model based on neighborhood correlation image analysis and decision tree classification.
Remote Sens. Environ. 2005, 99, 326–340. [CrossRef]
20. Volpi, M.; Tuia, D.; Bovolo, F.; Kanevski, M.; Bruzzone, L. Supervised change detection in VHR images using contextual
information and support vector machines. Int. J. Appl. Earth Obs. Geoinf. 2013, 20, 77–85. [CrossRef]
21. Wang, X.; Liu, S.; Du, P.; Liang, H.; Xia, J.; Li, Y. Object-Based Change Detection in Urban Areas from High Spatial Resolution
Images Based on Multiple Features and Ensemble Learning. Remote Sens. 2018, 11, 276. [CrossRef]
22. Tan, K.; Zhang, Y.; Wang, X.; Chen, Y. Object-Based Change Detection Using Multiple Classifiers and Multi-Scale Uncertainty
Analysis. Remote Sens. 2019, 10, 359. [CrossRef]
23. Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017,
29, 2352–2449. [CrossRef] [PubMed]
24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
Adv. Neural Inf. Process. Syst. 2017, 30.
25. Liu, M.; Chai, Z.; Deng, H.; Liu, R. A CNN-transformer network with multiscale context aggregation for fine-grained cropland
change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4297–4306. [CrossRef]
26. Ren, H.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual-Attention-Guided Multiscale Feature Aggregation Network for Remote Sensing
Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4899–4916. [CrossRef]
Remote Sens. 2024, 16, 1269 19 of 19
27. Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS
2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [CrossRef]
28. Zhan, Y.; Fu, K.; Yan, M.; Sun, X.; Wang, H.; Qiu, X. Change Detection Based on Deep Siamese Convolutional Network for Optical
Aerial Images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1845–1849. [CrossRef]
29. Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th
IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067.
30. Peng, D.; Zhang, Y.; Guan, H. End-to-End Change Detection for High Resolution Satellite Images Using Improved UNet++.
Remote Sens. 2019, 11, 1382. [CrossRef]
31. Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change
detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [CrossRef]
32. Yin, H.; Weng, L.; Li, Y.; Xia, M.; Hu, K.; Lin, H.; Qian, M. Attention-guided siamese networks for change detection in high
resolution remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103206. [CrossRef]
33. Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection.
Remote Sens. 2020, 12, 1662. [CrossRef]
34. Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE
Trans. Geosci. Remote Sens. 2022, 60, 1–13. [CrossRef]
35. Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60,
1–14. [CrossRef]
36. Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal
network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [CrossRef]
37. Chen, C.P.; Hsieh, J.W.; Chen, P.Y.; Hsieh, Y.K.; Wang, B.S. SARAS-net: Scale and relation aware siamese network for change
detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023;
Volume 37, pp. 14187–14195.
38. Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv
2016, arXiv:1610.02136.
39. Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic
segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [CrossRef]
40. Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A Semisupervised Convolutional Neural Network
for Change Detection in High Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5891–5906. [CrossRef]
41. Lebedev, M.A.; Vizilter, Y.V.; Vygolov, O.V.; Knyaz, V.A.; Rubis, A.Y. Change detection in remote sensing images using conditional
adversarial networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-2, 565–571. [CrossRef]
42. Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE
Geosci. Remote Sens. Lett. 2022, 19, 1–5. [CrossRef]
43. Yin, H.; Ma, C.; Weng, L.; Xia, M.; Lin, H. Bitemporal Remote Sensing Image Change Detection Network Based on Siamese-
Attention Feedback Architecture. Remote Sens. 2023, 15, 4186. [CrossRef]
44. Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese
Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14,
1194–1206. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.