0% found this document useful (0 votes)
9 views17 pages

ADS Net

The document presents ADS-Net, an attention-based deep supervision network designed for remote sensing image change detection. It addresses limitations in existing methods by utilizing a dual-stream structure to extract features from bi-temporal images and employs an adaptive attention mechanism to enhance detection accuracy. Experimental results demonstrate that ADS-Net outperforms traditional change detection methods in terms of effectiveness and robustness.

Uploaded by

aningkasomwoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

ADS Net

The document presents ADS-Net, an attention-based deep supervision network designed for remote sensing image change detection. It addresses limitations in existing methods by utilizing a dual-stream structure to extract features from bi-temporal images and employs an adaptive attention mechanism to enhance detection accuracy. Experimental results demonstrate that ADS-Net outperforms traditional change detection methods in terms of effectiveness and robustness.

Uploaded by

aningkasomwoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Contents lists available at ScienceDirect

International Journal of Applied Earth


Observations and Geoinformation
journal homepage: www.elsevier.com/locate/jag

ADS-Net:An Attention-Based deeply supervised network for remote sensing


image change detection
Decheng Wang a, Xiangning Chen a, *, Mingyong Jiang a, Shuhan Du a, Bijie Xu a, Junda Wang b
a
Space Information Academic, Space Engineering University, Beijing 101416, China
b
Beijing Satellite Navigation Center, Beijing 100094, China

A R T I C L E I N F O A B S T R A C T

Keywords: Change detection technology is an important key to analyze remote sensing data and is of great significance for
Change detection accurate comprehension of the earth’s surface changes. With the continuous development and progress of deep
Attention mechanism learning technology, fully convolutional neural networks are applied gradually in remote sensing change
Deep supervision network
detection tasks. The present methods mainly encounter the problems of simple network structure, poor detection
Difference feature
of small change areas, and poor robustness since they cannot completely obtain the relationships and differences
between the features of bi-temporal images. To solve such problems, we propose an attention mechanism-based
deep supervision network (ADS-Net) for the change detection of bi-temporal remote sensing images. First, an
encoding–decoding full convolutional network is designed with a dual-stream structure. Various level features of
bi-temporal images are extracted in the encoding stage, then in the decoding stage, feature maps of different
levels are inserted into a deep supervision network with different branches to reconstruct the change map. Ul­
timately, to obtain the final change detection map, the prediction results of each branch in the deep supervision
network are fused with various weights. To highlight the characteristics of change, we propose an adaptive
attention mechanism combining spatial and channel features to capture the relationship of different scale
changes and achieve more accurate change detection. ADS-Net has been tested on the LEVIR-CD and SVCD
datasets of challenging remote sensing image change detection. The results of quantitative analysis and quali­
tative comparison indicate that the ADS-Net method comprises better effectiveness and robustness compared to
the other state-of-the-art change detection methods.

1. Introduction reconnaissance, environmental monitoring, and disaster assessment (Ji


et al., 2019; Alcantarilla, et al., 2018; Qiao et al., 2020; Ye et al., 2021).
Remote sensing change detection is the technology of utilizing Traditional remote sensing change detection methods are mainly
remote sensing images covering the same surface area in different pe­ classified into three categories: (1) based on image arithmetic methods;
riods, integrating with corresponding features and remote sensing im­ (2) based on image transformation methods; and (3) based on object
aging mechanisms, analyzing the change of objects’ location, status, and classification methods. In the image arithmetic-based method, arith­
features in the area (Bruzzone and Bovolo, 2013). The present work metic operations are directly performed on the corresponding pixels of
aimed to find the change information of interesting objects, and filter two images to create a difference map. The spatial resolution of remote
out the irrelevant change information and unchanging information as sensing images is low in the early development of change detection
interference factors. With the advent of high-resolution optical sensors technology. Hence, to detect the ground surface change, the spectral
(e.g., WorldView-3, GeoEys-1, QuickBird and Gaofen-2), bi-temporal characteristics of a single pixel are used generally. Common methods are
remote sensing data can be easily obtained, expanding the application image difference (Singh, 1986), image ratio (Howarth and Wickware,
range of high-resolution bi-temporal image change detection. At pre­ 2007), and multi-operator fusion (Zheng et al., 2013). Since such
sent, change detection as one of the main issues in Earth observation has methods are typically susceptible to noise, it is difficult to obtain the
been extensively used in many fields, such as urban planning, military changes of interest objects. Hence, it is essential to find an appropriate

* Corresponding author.
E-mail addresses: [email protected] (D. Wang), [email protected] (X. Chen), [email protected] (M. Jiang), poisonous_mushroom@163.
com (S. Du), [email protected] (B. Xu), [email protected] (J. Wang).

https://doi.org/10.1016/j.jag.2021.102348
Received 20 January 2021; Received in revised form 2 April 2021; Accepted 20 April 2021
Available online 30 April 2021
0303-2434/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

threshold to determine the changed and unchanged pixels making the Table 1
change detection effect poor. The image transformation-based change Summary of traditional change detection methods.
detection method transforms the remote sensing image into a specific Methods Example Studies
feature space while emphasizing the changed pixels and suppressing the
Traditional change Image arithmetic- Singh A. (Singh, 1986), Howarth
unchanged pixels (Wang et al., 2018). The main idea is to use algorithms detection based method et al. (Howarth and Wickware,
including independent component analysis (ICA) (Yi-Quan et al., 2016); methods 2007), Wang et al. (Zheng et al.,
iteratively reweighted multivariate alteration detection (IRMAD) (Wang 2013)
et al., 2015), change vector analysis (CVA) (Singh and Talwar, 2015), Image W et al. (Yi-Quan et al., 2016),
transformation-based Wang et al. (Wang et al., 2015),
principal component analysis (PCA) (Wei et al., 2016), and slow feature method Singh et al (Singh and Talwar,
analysis (SFA) (Wu et al., 2014) to analyze and transform the feature on 2015), Zhao et al. (Zhao, 2011),
bi-temporal images. Then, changes are detected based on the trans­ Bovolo et al. (Bovolo and Member,
formed features. These methods improved the accuracy of remote 2012), Thonfeld et al. (Thonfeld
and Feilhauer, , 2016), Wei et al. (
sensing change detection to a certain extent. However, it is a very
Wei et al., 2016), Wu et al. (Wu
difficult process to choose the most appropriate image transformation- et al., 2014)
based method for a specific area. The above different algorithms Object classification- Han et al. (Han et al., 2020), Xin
cannot be applied to all change detection tasks owing to their specific based method et al. (Xin et al., 2018), Zhang et al.
characteristics. The change detection method based on object classifi­ (Zhang et al., 2018), Tan et al. (Tan
et al., 2019)
cation uses the spectrum, texture, structure, and geometric features of Deep learning change detection methods Ji et al. (Ji et al., 2019), Wang et al.
bi-temporal images for similarity analysis. In object classification-based (Wang et al., 2018), Peng et al. (
change detection methods (Han et al., 2020), the strategy of post- Peng and Guan, 2019), Daudt et al.
classification is usually used. First, the interesting objects are extrac­ (Daudt, 2018), Zhang et al. (Zhang
et al., 2020), Chen et al. (Chen and
ted from bi-temporal images and compared and analyzed to generate the
Shi, 2020), Liu et al. (Liu et al.,
final change map. Although good change detection effects have been 2019)
achieved, the process is relatively complex, and the effect of change
detection depends on the accuracy of object classification.
With the development of deep learning technology, it has become the baseline model from 83.9% to 87.3% with acceptable computational
easier to utilize convolutional neural networks(CNN) to extract high- overhead. A new type of deep neural network architecture was proposed
level features of images. The good generalization capability of high- by Liu et al. (Liu et al., 2019) based on information transmission and
level features is very helpful to detect change information. The current attention mechanism. In the design of the DNN structure, an information
CNN-based remote sensing change detection models have achieved a transmission module was introduced for information interaction and
high accuracy rate and the effect better than other traditional methods transmission, and the machine is utilized to give the corresponding
(Zhan et al., 2017). Ji et al. (Ji et al., 2019) proposed a CNN-based attention weight to bi-temporal image features. The F1-Score of this
change detection framework to locate changing building instances and network is 7.4% higher than the original CNN. To facilitate retrieval, we
change building pixels from high-resolution aerial images. The building summarized the relevant literature of the above-mentioned change
extraction network is run through two extensively used structures: Mask detection methods in Table 1.
R-CNN for object instance segmentation and multi-scale fully convolu­ The change detection method based on deep learning has a better
tional network for pixel-based semantic segmentation. This method performance compared to any traditional method. It can learn change
reached an average precision (AP) of 63% at the object (building characteristics based on the sample label information with supervised
instance) level. The region-based Faster R-CNN method was applied by technique, and detect the change of interested region and features.
Wang et al. (Wang et al., 2018) for the change detection of high- However, there are some limitations to the existing change detection
resolution remote sensing images. Compared to the traditional network structure and function. Thus, the corresponding solutions are
methods and other deep learning-based change detection methods, it proposed in the present paper to the following problems:
reduced numerous error changes and achieved higher detection accu­
racy, the Kappa coefficient reached 71.16% and 79.42% in the two (1) The rich information of pixels in high-resolution remote sensing
datasets, respectively. Peng et al. (Peng and Guan. , 2019) proposed a images is not completely utilized. By increasing the resolution of
novel end-to-end CD method based on the encoder-decoder architecture remote sensing images, the requirements for change detection
UNet++ for semantic segmentation. It concatenates the registered pixel segmentation are higher. Most of the existing methods apply
image pairs as the input of the enhanced UNet++ network and generates the classic semantic segmentation networks of U-Net and U-
the final change detection map. Its F1-Score reached 87.56% while Net++ (Peng and Guan. , 2019); (Wiratama, et al., 2020)), where
achieving the best performance among all the comparative SOTA the change detection effect is not good. In this paper, we try to
methods. Daudt et al. (Daudt, 2018) proposed two Siamese extensions of focus on important information in the spatial and channel domain
fully convolutional networks. During the training process, the image and use the attention mechanism to concatenate the two mod­
difference and the image concatenate feature were respectively fused ules. We combine various scale features to construct a deep su­
representing better performance and faster detection speed. Then, a pervision network and fuse low-level and high-level features with
deep-supervised image fusion network (DSIFN) was proposed by Zhang various weights to improve the change detection precision.
et al. (Zhang et al., 2020) based on the Siamese network. The highly (2) Insufficient feature fusion results in a poor training effect of the
representative deep features of bi-temporal images are extracted change detection network. The present change detection feature
through a completely convolutional dual-stream structure, and the fusion methods are generally divided into pre-fusion and post-
extracted depth features are fed into a deeply supervised difference fusion (Wiratama and Sim, 2019). Pre-fusion denotes concate­
discrimination network for change detection. The F1-Score of DSIFN nating two images or inserting the difference of bi-temporal im­
reached 90.3%, which is superior to other benchmark methods and ages into the network for feature extraction and change
yields a changing area with complete boundaries and high internal detection. Early feature maps cannot present the deep informa­
compactness. Chen et al. (Chen and Shi, 2020) presented a novel Spatial- tion of a single original image since it is more sensitive to noise
Temporal attention neural network based on Siamese and designed a CD and simply results in the accumulation of errors. Post-fusion is
self-attention mechanism to calculate attention weights between any based on using two identical networks to process each bi-
two pixels at various times and positions. They improved the F1-score of temporal image separately and then utilizing the respective

2
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 1. The proposed ADS-Net architecture.

extracted features through an integrated network for change In this paper, we propose an attention-based deeply supervised
detection (Liu et al., 2019). However, the change maps are network (ADS-Net) for remote sensing image change detection. First, to
greatly impacted by the quality of the training images in the extract the multi-layer features of bi-temporal images, a dual-stream full
process of discriminating the difference. In this paper, we adopt convolutional neural network is used. Then, a fusion attention module of
the mid-layer fusion method. After extracting each layer’s fea­ the channel and spatial feature is added to the decoding part of the
tures, the bi-temporal features in the encoding stage are concat­ network and the features of each layer are merged to obtain various
enated with the output of the former layer in the decoding part, prediction maps of multiple supervision modules. Ultimately, the pre­
and the bi-temporal feature maps are concatenated with their diction results through each branch supervision network are weighted
difference maps as input for each decoding layer. In this way, we and then fused to form the ultimate prediction result, realizing the
obtain the sufficient and effective fusion of bi-temporal features. function of deep supervision.

3
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 2. The internal structure diagram of the attention fusion module.

The rest of the paper is organized as follows. Section 2 introduces the the extracted feature pair with its difference as input and strengthen the
proposed method in detail. The change detection experiments and re­ change information via the designed attention module. Then, we carry
sults analysis are presented in Section 3. Discussion is presented in out convolution, BatchNorm regularization, ReLU activation, and
Section 4. The paper is concluded in Section 5. Dropout operation to aggregate features. Ultimately, deconvolution
operation is performed to expand the size of feature maps. By concate­
2. Methodology nating the results with feature maps of the corresponding size in the
encode part, they are inserted into the next decode steps to realize the
In this section, we represent the proposed ADS-Net method from four ultimate reconstruction of the feature map.
aspects of basic network structure, attention fusion module, deep su­ The implementation steps of ADS-Net architecture are as follows:
pervision fusion, and loss function.
(1) The 256 × 256 × 3 bi-temporal images are inserted into T1 and
2.1. Basic network structure T2 networks respectively. After five convolution and down-
sampling modules, concatenate operation is performed on two
Fig. 1 represents the proposed attention-based deeply supervised 16 × 16 × 256 feature modules. Then, a decoding network is
network (ADS-Net) for remote sensing change detection. The network inserted. To reduce the number of network parameters, we shared
structure consists of two encoding and decoding parts. First, in the the weights of the T1 and T2 networks.
encode part, a pair of Siamese networks (Daudt, 2018) is utilized to (2) Correspondingly, the Conv5 bi-temporal features are subtracted
extract the registered bi-temporal image features, then the feature pairs from above to obtain the different features. The T1, T2 features
of different sizes are concatenated with their differences and input to the and their different features are merged, and the proposed atten­
decoding stage, through the deconvolution operation reconstruct tion module is inserted to aggregate the features. The up-
change map. Ultimately, the reconstructed change maps of various sampling operation is then performed to enlarge the size of the
decoding layers are weighted and fused to form the ultimate change feature map. As shown in Fig. 1, before each level of upsampling,
detection result. the features of the previous level are merged with the corre­
In the encoding stage, both networks contain convolution modules sponding T1, T2 encoding features, and then passed to the
(conv1, conv2, …, conv5) extracting the features of bi-temporal images upsampling layer through the attention module. Finally, a change
(T1, T2) respectively, and obtaining 5 pairs of different sizes feature map is generated consistent with the original image size.
layers. The first two convolution modules conv1 and conv2 both (3) The encoding layer with five convolution modules may not be
comprise two convolutional layers and a maximum pooling layer. Conv3 able to extract all types of change features, hence, we start with
and conv4 both contain three convolutional layers and a maximum the Conv2 module and add four different depths of decoding
pooling layer. Conv5 includes three convolutions layers. It is worth networks to form a deep supervision strategy. Each supervisory
noting that after each convolutional layer in the network, ReLU acti­ network has a structure similar to the main network introduced in
vation function, BatchNorm regularization, and Dropout operation are (2) except for the different network depths. To form a final
added to guarantee the feature layer with preventing overfitting and change map with higher accuracy, their predicted change maps
strong expressive ability. The layers in different colors in Fig. 1 represent are fused with different weights.
the different scale feature layers output by each convolution module.
Furthermore, to effectively obtain different information and relieve data 2.2. Adaptive attention fusion module
pressure, the weights and structure of the T1 and T2 network are shared,
hence, bi-temporal image features are converted to the same space for In the field of computer vision, machines selectively are focused on
comparison. the important part of visible information such as humans, while ignoring
Four decoding modules are designed in the decode stage to recon­ other irrelevant information. This mechanism is termed the attention
struct the change map for feature maps of different sizes, moreover, mechanism (Li et al., 2019). In the remote sensing change detection task,
various change maps obtained by each module are weighted and fused we mostly consider the areas where bi-temporal images have been
to obtain the ultimate change detection result (w2, w3, w4, w5 represent changed. Therefore, improving the characteristics of the changed areas
the fusion weight of each module, respectively). First, we concatenate is more helpful to enhance detection efficiency. In recent years,

4
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

numerous researchers have added different types of attention modules which is used to construct the channel attention module well.
to the change detection network comprising spatial attention and Spatial attention module(SAM): First, Max pooling is performed on
channel attention (Zhang et al., 2020; Chen and Shi, 2020; Zhang et al., each pixel position of the input concatenated feature map (H × W × C).
2021). Spatial attention has a role in increasing the distance difference In other words, the element at each position takes the maximum value of
between the changed and unchanged pixels. The role of channel atten­ its channel to create a 1 × W × H vector (W and H are respectively the
tion is to amplify channels associated with the changes in ground fea­ numbers of feature map columns and rows). Then, the vector is con­
tures and prevent irrelevant channels. ducted to a two-dimensional convolution operation (conv2d) with
In the process of change detection, not all high-dimensional features convolution kernel size k2, and the convolution result is normalized to a
are helpful for difference discrimination (Saha et al., 2019; Jie and weight coefficient with a value between (Bruzzone and Bovolo, 2013)
Samuel, 2019) and irrelevant features may make change detection more through the sigmoid activation function. Ultimately, the obtained
difficult (Woo et al., 2018). Thus, we propose an adaptive attention spatial attention weight coefficient is element-wise multiplied by each
fusion module to improve usefulness information and suppress irrele­ channel of the original feature map to obtain the locally improved
vant information. As shown in Fig. 2, the attention fusion module pro­ spatial attention feature map (MS ), which is calculated as follows:
posed in this paper is a dual-stream attention mechanism. We put the
MS = σ(conv2d(Maxpool(F))) ⊗ F (6)
former feature maps into channel attention and spatial attention oper­
ations respectively and integrated them in an element-wise way to Where F represents the input of the combined feature map to the
achieve the improved feature maps, which can better reconstruct the SAM module and σ(⋅)denotes the sigmoid activation function, based on
following change map. Remarkably, the convolution kernel sizes of Formula (2). For the two-dimensional convolution kernel size k2, it is
spatial and channel attention are determined adaptively based on the similar to the way of determining k1. We adopt the strategy of adaptively
change map. determining the value based on the size of the feature map (W and H).
Channel attention module(CAM): First, average pooling is per­ Since the bi-temporal images and their feature maps are all square maps
formed on each channel of the input concatenated feature map (H × W utilized in the experimental process, namely, the number of rows and
× C). Indeed, the elements of each channel are averaged to create a C × the columns are equal, thus, W = H. We build the functional relation­
1 × 1 vector (C is the number of channels). Then, a one-dimensional ship between W and k2 as follow:
convolution operation (conv1d) is performed with a convolution
W = g(k2 ) (7)
kernel size of k1 on this vector. The result of convolution is normalized to
weight coefficient with a value between (Bruzzone and Bovolo, 2013) The size of input bi-temporal images utilized in our paper is 256 ×
through the sigmoid activation function. Ultimately, the obtained 256, which is reduced to 1/2 of the original after each pooling, hence,
channel attention weight coefficient is element-wise multiplied by each the size of the feature map in different periods is always an exponential
spatial element of the original feature map, and the globally enhanced power of 2. k2 is determined the same as k1, as follows:
channel attention feature map (MC ) is obtained, for which the calcula­ ⃒ ⃒
⃒log2 (W) + b⃒
tion expression is as follows: k2 = ⃒⃒ ⃒
⃒ (8)
a odd
MC = σ (conv1d(Avgpool(F))) ⊗ F (1)
In this paper, through numerous experiments, we found that SAM
Where F is the merged feature map inserted to the CAM module, ⊗ has the best performance when a and b are set to 2 and 3 respectively. A
represents element-wise multiplication, σ (⋅)is the sigmoid activation relatively large convolution kernel is utilized in a low-dimensional
function, and its expression is as follows: feature map with a larger size, and a relatively small convolution
1 kernel is used in a high-dimensional feature map with a smaller size to
σ (z) = z
(2) emphasize more prominent change features. The spatial attention
1 + e−
module is constructed well using the method of adaptively determining
For the one-dimensional convolution kernel size k1, we adopt the the convolution kernel size based on the size of the feature map.
strategy of adaptively determining its value based on the number of Ultimately, the enhanced feature maps inserted by CAM and SAM are
channels C. The functional relationship between C and k1 is constructed element-wise to obtain the final attention module output result, as fol­
as follow: lows:
C = f (k1 ) (3) M = MC ⊕ MS (9)
In general, linear mapping is the simplest mapping, however, the Since the average pooling method is utilized in CAM and the
expressive power of linear relationships is limited. Normally, the size of maximum pooling method is used in SAM to reduce dimensionality, the
the channel number C is an exponential power of 2 (Wang et al., 2020); change information is obtained by the two modules after global and
thus, the formula (3) can be expressed as: local enhancement respectively, thus, the two modules are combined to
C = f (k1 ) = 2(a*k1 − b)
(4) achieve better detection results.

Hence, after giving the specific value of channels number C, the size
of the convolution kernel k1 can be expressed as: 2.3. Deep supervision based on weighted fusion
⃒ ⃒
⃒log2 (C) + b⃒
k1 = ⃒⃒ ⃒
⃒ (5) Deep supervision refers to the method of addition of auxiliary clas­
a odd sifiers in the middle hidden layer of the deep neural network as a
Where |M|odd denotes the odd number closest to M. In the present network branch to supervise the backbone network. This is utilized to
paper, we conducted several experiments and found that the CAM has solve the problems of deep neural network training gradient disap­
the best performance when a and b are set to 2 and 1, respectively. In the pearance and decelerate convergence the speed (Lee, 2015). The core
high-dimensional features with more channels, the larger convolution idea of deep supervision is to present an integrated direct supervision
kernel has a wider receptive field, and in the low-level features with layer for the hidden layer, instead of only providing supervision in the
fewer channels, the smaller convolution kernel concentrates on a more output layer, and propagate this supervision back to the earlier layer. By
compact receptive field. Thus, the method of adaptively determining introducing an accompanying objective function for each hidden layer
convolution kernel size is achieved based on the number of channels, to present this integrated direct hidden layer supervision, these
accompanying objective functions can be considered as additional soft

5
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

constraints in the learning procedure. {


We adopt a deep supervision strategy based on weighted fusion. As − log2 yp , y=1
L = − ylog2 yp − (1 − y)log2 (1 − yp ) = (11)
− log2 (1 − yp ), y=0
shown in Fig. 1, four networks of different depths are constructed to
conduct change detection together in the decode stage. The change map Where y represents the sample label, yp denotes the output value
is made by the main network with the features of the conv5 convolution followed by the activation function, and the value is between (Bruzzone
module after decoding. The remaining three-branch networks utilize the and Bovolo, 2013). The output probability is higher for positive samples
feature respectively obtained by conv2, conv3, and conv4 convolution and the loss is smaller. For negative samples, the smaller output prob­
modules to achieve detection results. The low-dimensional image fea­ ability yields the smaller the loss. Therefore, the loss function is slow in
tures are normally extracted by a shallower convolutional layer, while the iterative process of numerous simple samples and may not be su­
the deeper convolutional layer extracts high-dimensional features. Thus, periorly optimized. Focal Loss on the original basis adds two factors,
to make full use of the features of different dimensions, we fuse each γand α, to make it more concentrated on difficult and misclassified
branch output results with the reconstruction change map of the main samples, as follows:
network through various weights and form the ultimate change detec­ {
tion result. − α(1 − yp )γ log2 (yp ), y=1
FL = (12)
Each supervision network must be fused with corresponding weights, − (1 − α)yγp log2 (1 − yp ), y=0
and the choice of weights is proportional to the network’s performance
in the change detection task. Specifically, in each training epoch, the F1- where γ > 0decreases the loss of easy to classify samples and adjusts
Score of each network output result is calculated, and then the propor­ the reduction rate of the simple sample weight. αrepresents the balance
tion of a certain network F1-Score is taken to the sum of all networks F1- factor, with a value between (Bruzzone and Bovolo, 2013) to balance the
Score as the weight of this network. F1-Score is an indicator for uneven ratio of positive and negative samples. During the experiment,
measuring the accuracy of the binary classification model. It also con­ we found that by the too large value of γ, the loss value will become very
siders the accuracy and recall of the classification model, which is a small throughout the training process, which is not conducive to the
weighted average of accuracy and recall. It is often used in change model convergence. It will increase the computational cost, hence, the
detection tasks to assess the consistency of predicted changes map and value of γis set to 2. The value of αis determined based on the ratio of
actual change map. The calculation formula of F1-Score is introduced positive and negative samples. We mark the changed pixels as positive
along with other evaluation metrics in Section 4. The weight (W) of the samples and the unchanged pixels as negative samples. In the present
supervision network is calculated as follows: paper, using two datasets LEVIR-CD (Chen and Shi, 2020) and Season-
varying Change Detection Dataset (SVCD) (Lebedev et al., 2018); we
Fi
Wi = ∑5 (10) crop each bi-temporal image of the training set to 256 × 256 as input to
j=2 Fj decrease the pressure on computer GPU memory. We discard the un­
changed bi-temporal image after cutting and determine the positive
Where Fi represents the F1 score of the i-th supervised network
∑ sample pixels proportion in the total annotated image of the final
detection results in the current training epoch, and 5j=2 Fj denotes the
training set. The average positive sample proportion of each bi-temporal
sum of the F1 scores of all four network detection results in the current image in the two datasets is 0.1 and 0.05 respectively. Therefore, the
training epoch. value of αin LEVIR-CD dataset is 0.1, and the value of α in SVCD dataset
Thus, after each training epoch, the respective weights can be is 0.05.
determined by the overall network based on each branch network’s We adopt a deep supervision strategy, for which the loss value is
performance, and carry out weighted fusion. The proposed deep su­ calculated by both the main network and each branch network, and the
pervision strategy-based weighted fusion enhances deep networks four-loss values are added as the loss value of the entire network. Ac­
fusion by overcoming the disappearing gradient problem, moreover, it cording to Formula (13), Lall is the overall loss of the network, FL5
can learn more meaningful features from low to high levels, while represents the main network loss value, FL2 , FL3 andFL4 are the three-
effectively improving the effectiveness and of performance change branch network loss values respectively.
detection network.
Lall = FL2 + FL3 + FL4 + FL5 (13)
2.4. Loss function
3. Experiments and analysis
In satellite image remote sensing change detection tasks, there is
mostly a serious imbalance in the number of changed and unchanged Here, we verify the effectiveness of the proposed method through
pixels. Indeed, the unchanged pixels are much more than changed pixels some experiments. First, the two remote sensing change detection
leading to the serious category imbalance problem during the training of datasets are introduced that are used in our experiments, and the image
the deep neural network. Hence, it is difficult to train the network to the preprocessing method is described. Secondly, the relevant evaluation
global optimal from the local optimal. In recent years, this kind of metrics are represented for quantitative analysis of change detection,
problem has attracted a huge deal of attention. For example, it was and the relevant parameter settings in this experiment are described.
proposed to (Lin et al., 2017; Cui, 2019; Zhang et al., 2020) to construct Then, several excellent change detection methods are introduced for
a new loss function to further focus on the small number of sample comparison. Ultimately, we conducted a comprehensive analysis and
categories, thereby enhancing classification accuracy. compared the experimental results. The effectiveness of our attention
To alleviate effectively the poor network training problem resultant mechanism and deep supervision strategy is verified through ablation
from sample imbalance, we apply the Focal Loss function (Lin et al., experiments.
2017) for remote sensing change detection task for binary classification.
The Focal Loss function reduces numerous simple negative sample
3.1. Datasets and preprocessing
weights in training, while further considering the mining of difficult
samples. The Focal Loss function is a modification based on the cross-
To verify the proposed ADS-Net change detection algorithm
entropy loss function. The formula of binary classification cross-
completely, two different types of datasets are utilized in the experi­
entropy loss is as follows:
ment. The first dataset is LEVIR-CD proposed by (Chen and Shi, 2020)
collecting 637 pairs of Google Earth images with a resolution of 0.5 m
and a size of 1024 × 1024 pixels through the Google Earth API. These

6
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 3. The LEVIR-CD dataset before and after cropping; (a) and (b) the bi-temporal image pair; (c) the Ground Truth of changed pixel, the top is the original image,
and the bottom is cropped image.

Fig. 4. Some scenes of the SVCD dataset, the top ones are T1 time images, the middle ones are T2 time images, and the bottom scenes are Ground Truth.

images are taken from 20 various areas in several cities in Texas, USA, reducing the impact of other irrelevant changes on the model. In the
from 2002 to 2018. It contains numerous changes resultant from sea­ present paper, the LEVIR-CD dataset is cropped into image pairs with a
sonal and light changes, which assists to train a more effective change size of 256 × 256 pixels to alleviate the pressure on computer GPU
detection model while further focusing on interesting changes and memory. The unchanged image pairs are discarded to build 1390 pairs

7
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 5. The curves of various metrics on two datasets with iterations: (a) the performance of the LEVIR-CD dataset, (b) the performance of the SVCD dataset.

of bi-temporal images, of which 1000 pairs are utilized for model coefficient is a metric utilized for consistency testing. For change
training and 200 pairs used to verify the training process, 190 pairs are detection problems, the so-called consistency is whether the model
used for testing after the training. The LEVIR-CD dataset before and after prediction results are consistent with the actual classification results.
cropping is provided in Fig. 3. Thus, we use it for measuring the change detection effect. The overall
The second dataset is Season-varying Change Detection Dataset performance of the model is reflected by F1-Score and Kappa coefficient,
(SVCD) proposed by (Lebedev et al., 2018). It uses 7 pairs of the same for which the values are between (Bruzzone and Bovolo, 2013). The
area remote sensing images changing with the seasons obtained by larger value will reflect better model performance. The four evaluation
Google Earth (Digital Globe), with a size of 4725 × 2700 pixels. The metrics are represented as follows:
obtained image has a spatial resolution of 3–100 cm and contains objects
TP
of various sizes (from cars to large building structures) and seasonal Pr = (14)
TP + FP
alterations of natural objects (from single trees to wide forest areas). The
whole image is rotated randomly and cropped into 256 × 256 segments TP
to create our dataset. To shorten the training time and ease the pressure Re = (15)
TP + FN
of GPU, we selected 5847 images randomly as the training set, as well as
1725 and 959 images as a verification set and test set, respectively. Some 2Pr × Re
F1 = (16)
bi-temporal images of the SVCD dataset are presented in Fig. 4. Pr + Re

OA − P
Kappa = (17)
3.2. Evaluation metric and parameter setting 1− P
TP is the number of correctly detected changed pixels, TN represents
To compare the difference between the labeled map and the pre­ the number of correctly detected unchanged pixels, FP is the number of
dicted change map, and assess the effectiveness of our proposed tech­ false alarm pixels, and FN is the number of lost unchanged pixels. In the
nique, we utilized four evaluation metrics including Precision (Pr), Kappa calculation formula, OA denotes the overall accuracy, P repre­
Recall (Re), F1-Score (F1), and Kappa coefficient (Kappa). In the change sents the proportion of expected agreement between the ground-truth
detection task, the higher Precision denotes the more accuracy of and predictions with given class distributions (El Amin et al., 2017).
detected changed pixels and the higher Recall represents the greater The expressions of OA and P are as follows:
ability of the model to find more changed pixels. F1-Score is a metric for
measuring the accuracy of the binary classification model. It considers OA =
TP + TN
(18)
the precision and recall of the classification model at the same time, N
moreover, it is a harmonic average of model precision and recall. Kappa

8
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 6. The ADS-Net changes detection results of some scenes on the LEVIR dataset: (a) and (b) the T1 and T2 time of bi-temporal remote sensing images; (c) the
Ground Truth map of actual changes, and (d) the change map predicted by ADS-Net.

(TP + FP)(TP + FN) + (FN + TN)(FP + TN) the training reaches 40 epochs, the network is stabilized. On the SVCD
P= (19) dataset, the training speed is slower owing to the more complex types of
N2
changes and much more training samples, hence, the network gradually
where N is the total number of pixels. tends to converge at 160 epochs.
In the ADS-Net model, all of convolution kernels size is 3 × 3 in the
encoding stage, attention module convolution kernel size is adaptively 3.3. Comparison methods
determined in the decode stage based on the input feature map size and
the number of channels, moreover, other convolution kernels’ sizes are To assess the performance of the proposed model, we introduce the
all set to 3 × 3. The training period is set to 200 epochs in the training following four change detection benchmark methods and compare their
process, and the batch size is 16. The learning rate uses a multi-step performance on two datasets:
adjustment strategy (MultiStepLR) where the initial learning rate is
0.001, and the learning rate decays to half of the original after every 80 1) FC-Siam-diff (Daudt, 2018): It is a fully convolutional change
iterations. detection network based on Siamese structure sharing the weights of
The proposed method is run by Pytorch with python3.6 as the the two channels in the encoding stage. The difference of bi-temporal
backend, which is powered by a workstation with Intel Core i9-9820X features is input to the network as skip connection in decode stage to
CPU (3.3 GHz, 64 GB RAM) and a single NVIDIA GeForce GTX TITAN reconstruct the change map.
X. We utilize CUDA computing architecture and cuDNN library for 2) FC-Siam-conc (Daudt, 2018): It was proposed at the same time with
accelerated training. FC-Siam-diff, as a fully convolutional change detection network
To more conveniently observe the model training situation, we draw based on Siamese structure. The difference is that bi-temporal fea­
the variation curve of various indicators on the validation set with the tures are concatenated to the network input as skip connection in the
number of epochs during the training process, as shown in Fig. 5. It is decode stage to reconstruct the change map.
observed that on the LEVIR dataset, the training speed is faster. When

9
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 7. The ADS-Net changes detection results of some scenes on the SVCD dataset: (a) (b) the T1 and T2 time of bi-temporal remote sensing images, (c) the Ground
Truth map of actual changes, and (d) the change map predicted by ADS-Net.

3) STA-Net (PAM) (Chen and Shi, 2020): It is a spatial–temporal are highly consistent with the Ground Truth. Moreover, on the LEVIR-
attention network based on the Siamese structure using a change CD dataset (Fig. 6), the influence of noise caused by irrelevant
detection self-attention mechanism to model spatial–temporal re­ changes such as illumination and seasons can be ignored. On the SVCD
lationships. The self-attention module is integrated into the feature dataset (Fig. 7), the changes in land types can be well detected, indi­
extraction process to calculate attention weights between any two cating that ADS-Net possesses better change detection performance and
pixels at various times and positions, and use them to generate more robustness.
distinguishing features. To further verify the proposed effectiveness of ADS-Net, we trained
4) DSIFN (Zhang et al., 2020): It is a deeply supervised image fusion and tested the above four methods and ADS-Net method on two datasets,
network (DSIFN) for change detection in high-resolution bi-temporal and then evaluated the proposed method effect through the comparative
remote sensing images. The multi-level deep features of the original analysis of qualitative and quantitative aspects, for remote sensing
image and image difference features are fused via the attention change detection.
module, and change detection is conducted via the deeply supervised 1) LEVIR-CD dataset:
difference discrimination network (DDN) based on full convolution. We chose four typical scenes in the LEVIR-CD dataset and compared
the four benchmark methods with ADS-Net. These scenes are mainly
altered in buildings, and the image texture changes resultant from
3.4. Comparison of experimental results
various seasons and lighting are utilized as interference factors. The
performances of different methods are represented in Figs. 8-11. The red
The proposed ADS-Net method change detection results are repre­
box shows the local details to facilitate observation and comparison.
sented in Fig. 6 and Fig. 7., on the LEVIR-CD and SVCD datasets several
By comparing the above experimental results, it is found that ADS-
scenes. The black pixels represent the unchanged area, and the white
Net is very sensitive to alterations in buildings, moreover, it can cap­
pixels denote the changing area. It is observed that both large scenes and
ture some subtle changes (Figs. 9, 11(h)). Compared to the other
small objects can be well detected in the changing area of bi-temporal
methods, it significantly reduces the number of false detection pixels
images, and the shape and boundary sharpness of the changing area

10
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 8. The first scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

Fig. 9. The second scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

(Fig. 8(h)). It can overcome the influence of irrelevant changes in metric reached 88.29%, which is respectively increased by 1.53% and
different times of illumination and climate on the model to the greatest 1.08% compared to the excellent remote sensing change detection
extent, and detect the changes of interested objects accurately (Fig. 10 method STA-Net and DSIFN in the past two years. ADS-Net achieved
(h)). 89.8% in F-Score, which is 2.48% and 1.53% higher than STA-Net and
To further assess the performance of the above methods, the quan­ DSIFN, respectively. Moreover, comparing the two methods of FC-Siam-
titative results of various evaluation indicators of different methods are diff and FC-Siam-conc, it is found that the concatenate feature fusion
represented in Table 2. It is found that the proposed ADS-Net achieved method has a better effect compared to the direct subtract (Figs. 8-11 (d)
optimal values in the Recall, F1-Score, and Kappa indicators. The Kappa (e)). The reason is the direct subtract will lose the relevant information

11
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 10. The third scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

Fig. 11. The fourth scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

in the feature map, and the concatenate operation may increase the spatial–temporal attention module is added by STA-Net emphasizing the
redundant information. However, it will not reduce the key information characteristics of the changing area, hence, the performance has been
related to the change. Compared to FC-Siam-diff, FC-Siam-conc was greatly improved. The DSIFN method connects channel and spatial
increased by 2.34% and 2.62% on two indicators of F1-Score and Kappa, attention modules in series and then adopts a deep supervision strategy
respectively. Therefore, in our proposed ADS-Net decoding stage, representing a better performance compared to STA-Net, which in­
concatenated jump connections are also utilized to obtain a very good creases by 0.95% and 0.45% in F1-Score and Kappa, respectively. ADS-
result. Compared to the FC-Siam-diff and FC-Siam-conc, a Net adaptively selects the size of the convolution kernel in the attention

12
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Table 2 indicators of different methods. The proposed ADS-Net achieved the best
The quantitative evaluation results of different methods on the LEVIR-CD performance in three metrics of Precision, F1-Score, and Kappa. The F1-
dataset. Score and Kappa metrics reached 82.72% and 81.48% respectively,
Method Precision Recall F1-Score Kappa which are lower than the detection results on the LEVIR-CD dataset,
FC-Siam-diff 0.8314 0.8464 0.8281 0.8101
however, they have been significantly improved compared to the other
FC-Siam-conc 0.8649 0.8593 0.8515 0.8363 benchmark methods.
STA-Net 0.8382 0.9108 0.8732 0.8676 Based on the detection results of LEVIR-CD and SVCD datasets, ADS-
DSIFN 0.9120 0.8798 0.8827 0.8721 Net has a good recall rate for small areas of change and a good sense of
ADS-Net 0.8967 0.9136 0.8980 0.8829
color changes resultant from various seasons, climates, and illumina­
tion. It can well detect some complicated changes and significantly
module and concatenates the channel and spatial attention to increment reduce the false detection rate caused by noise interference. To compare
the weight of the change feature. Hence, the recall rate is improved the detection effects of ADS-Net and other benchmark methods more
compared with DSIFN. Both methods also use a deep supervision strat­ intuitively, Table 4 represents the percentage increase of the metrics of
egy, DSIFN introduces loss in the middle layer for backpropagation, and ADS-Net in two datasets compared to the benchmark network.
updates weights to further enhance the change detection network.
However, ADS-Net uses various weights at the back end of the network 3.5. Ablation experiments
to merge the branch network results and yield the final result (Figs. 8-11
(g) (h)). The comprehensive comparison indicates that ADS-Net has Compared to similar methods, the ADS-Net remote sensing change
better detection performance. detection method proposed in this paper has better performance, which
2) SVCD dataset: is mainly caused by the attention fusion module and the deep supervi­
We also chose four typical scenarios in the SVCD dataset where the sion network. To further verify the effect of the work, we performed
changes are more complex, including changes in cars, houses, and ablation experiments on two datasets.
changes in vegetation with seasons. The four benchmark methods were (1) Effectiveness of attention module
compared with ADS-Net, moreover, the performance of different In the research of remote sensing change detection, several methods
methods was compared, as shown in Figs. 12-15. have used the attention mechanism to highlight the change information
By comparing the above experimental results, it is found that ADS- and achieve better detection performance. Therefore, to verify the
Net has a strong ability for detecting the changes in complex ground attention module effectiveness in this paper, we used the non-attention
features. In the red box of Fig. 12 (h), the detection result of ADS-Net is module method and the attention fusion method with LEVIR-CD and
closest to the Ground Truth map. The red box in Fig. 13 represents SVCD datasets. Fig. 16 represents the comparison of the four evaluation
models’ false detection of the sample. In contrast, only ADS-Net has no indicators of the two types of methods.
false detection of the area. ADS-Net shows its sensitivity to changes in Through comparison, it is found that after adding the proposed
small objects such as cars in Fig. 14, and the model has a high degree of channel and spatial attention fusion module, every metric of the model
refinement. Moreover, ADS-Net possesses a more prominent detection was improved. The recall was improved the most, with an increase of
effect on ground changes, as shown in Fig. 15 (h), clearly distinguishing 2.69% and 3.26% on the two datasets, respectively. It is indicated that
the boundary between land and grass. the proposed attention module can better capture subtle change infor­
Table 3 represents the quantitative results of various evaluation mation, and accurately classify possible change pixels, thereby
enhancing the overall detection performance of the model.

Fig. 12. The first scene of SVCD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f) STA-
Net (PAM), (g) DSIFN, (h) ADS-Net.

13
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 13. The second scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

Fig. 14. The third scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc, (f)
STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

(2) Role of deep supervision network Score of test results utilizing the main network, each branch network,
The proposed deep supervision network integrates the detection re­ and the deep supervision network respectively on the LEVIR-CD and
sults obtained by four different deep networks and the feature infor­ SVCD datasets. The deep supervision network is a weighted fusion of
mation of different dimensions from the low to the high level. To verify each branch network output, where the weight is determined based on
the function of the deep supervision network, we tested each branch the F1-Score of the individual detection results of each branch network
network. De-Conv2, De-Conv3, De-Conv4, and De-conv5 correspond to (Equation (10)). It is found that by increasing the number of network
the decoding networks of Conv2, Conv3, Conv4, and Conv5 respec­ layers, the change detection results become more and more accurate,
tively, of which De-conv5 is the main network. Fig. 17 represents the F1- and the performance of the deep supervision network improves

14
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 15. The fourth scene of LEVIR-CD dataset and change detection results: (a) T1 image, (b) T2 image, (c) Ground Truth map, (d) FC-Siam-diff, (e) FC-Siam-conc,
(f) STA-Net (PAM), (g) DSIFN, (h) ADS-Net.

network to changing areas and also improves the speed of network


Table 3 training. Hence, the changed information is mainly considered, while
The quantitative evaluation results of different methods on the SVCD dataset. ignoring the unchanged features. (2) A layer-by-layer concatenate and
Method Precision Recall F1-Score Kappa deeply supervision strategy is designed based on the U-Net architecture.
FC-Siam-diff 0.8104 0.7093 0.7394 0.7120 The five convolution modules in the network use jump connections to
FC-Siam-conc 0.8377 0.7251 0.7523 0.7375 fuse bi-temporal features. The change maps generated by different up-
STA-Net 0.8625 0.7596 0.7871 0.7739 sampling layers are supervised in the decoding stage. This strategy im­
DSIFN 0.8760 0.7992 0.8189 0.8006 plements supervised learning of different depth features, thus improving
ADS-Net 0.8979 0.7958 0.8272 0.8148
the accuracy of changes in different types of objects (cars, buildings,
roads, etc.).
In the process of network training (Fig. 5), we found that the training
Table 4 speed of ADS-Net on the LEVIR CD dataset is very fast and converged in
The percentage improvement of ADS-Net over the benchmark methods. about 30 epoch. However, the training speed of the network on the
Datasets Metrics FC-Siam-diff FC-Siam-conc STA-Net DSIFN SVCD dataset is slower, converging at 160 epochs. This is directly
LEVIR-CD Precision +6.53% +3.18% +5.85% − 1.53% related to the complexity of the dataset. The LEVIR CD data set only
Recall +6.72% +5.43% +0.28% +3.38% marks changes in buildings, while the SVCD dataset marks changes in
F1-Score +6.99% +4.65% +2.48% +1.53% multiple categories such as buildings, cars, and roads, which is more
Kappa +7.28% +4.66% +1.53% +1.08% complicated. Therefore, ADS-Net requires a longer time to train on the
SVCD Precision
SVCD dataset, and the accuracy of change detection is relatively low.
+8.75% +6.02% +3.54% +2.19%
Recall +8.65% +7.07% +3.62% − 0.34%
F1-Score +8.78% +7.49% +4.01% +0.82% ADS-Net takes less than 0.06 s to predict a 256 × 256 size image, which
Kappa +10.28% +7.73% +4.09% +1.42% has a low computational burden and strong practicability.
Although our model achieved a good change detection effect, it still
has some limitations. First, to obtain more refined change detection
compared to the main network. It is indicated that the deep supervision
results for complex ground objects, a reliable method should increase
strategy adopted in this paper can improve the change detection effect.
the convolution module and deepen the network structure. However,
this will make the network parameters huge and the training time
4. Discussion
longer. Therefore, we considered using weakly supervised learning and
pruning strategies for ADS-Net to reduce network parameters and
In the remote sensing change detection task, the change of the
improve detection efficiency. Furthermore, we used Focal loss as the loss
ground type is concerned. The accuracy and meticulousness of the
function of ADS-Net. Although it can alleviate the problem of sample
changing area reflect the use-value of the change detection method. Our
imbalance to a certain extent, it is not necessarily completely suitable for
paper mainly proposes two improvements based on the shortcomings of
every change detection task. Therefore, we will continue to study the
the existing methods: (1) A new adaptive spatial and channel fusion
loss function on change detection, enhance the feedback function of the
attention mechanism is designed to enhance the changing features in the
network, and further improve the accuracy of change detection.
two dimensions of the spatial domain and the channel domain at the
same time. The size of the convolution kernel is determined by the size
of feature map adaptively. This module enhances the sensitivity of the

15
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Fig. 16. The verification result of the attention module effectiveness: (a) the verification result on the LEVIR-CD dataset, (b) the verification result on the
SVCD dataset.

CRediT authorship contribution statement

Decheng Wang: . : Methodology, Software, Writing - original draft.


Xiangning Chen: Supervision. Mingyong Jiang: . : Validation, Visu­
alization. Shuhan Du: Investigation. Bijie Xu: Validation. Junda
Wang: . : Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial


interests or personal relationships that could have appeared to influence
the work reported in this paper.

Acknowledgments:

This study is supported by the Preliminary Research of Equipment


Fig. 17. The F1-Score of deep supervision network and each branch network on Program of China (305020506), Experimental Technology Research of
two datasets. China (421414323) and Military Commission Science and Technology
Committee Leading Fund (18-163-00-TS-004-080-01).
5. Conclusions We would like to express our gratitude to EditSprings (https://www.
editsprings.com/) for the expert linguistic services provided.
In this paper, a deep supervision network (ADS-Net) based on the
attention mechanism is proposed for the task of change detection in References
high-resolution remote sensing images. To emphasize the important
Bruzzone, L., Bovolo, F., 2013. A Novel Framework for the Design of Change-Detection
change features and suppress the irrelevant features of space and Systems for Very-High-Resolution Remote Sensing Images[J]. Proceedings of the
channel dimensions, an adaptive fusion attention mechanism is IEEE 101 (3), 609–630. https://doi.org/10.1109/JPROC.2012.2197169.
Ji, S., Shen, Y., Lu, M., et al., 2019. Building Instance Change Detection from Large-Scale
designed for the concatenated features of bi-temporal images. Moreover,
Aerial Images using Convolutional Neural Networks and Simulated Samples[J].
to improve the detection performance of different types of object Remote Sensing 11 (11), 1343. https://doi.org/10.3390/rs11111343.
changes, we proposed a layer-by-layer concatenated deep supervision Alcantarilla, Pablo F, et al. “Street-View Change Detection with Deconvolutional
strategy to generate a more accurate change map. Evaluating the two Networks.” Autonomous Robots, vol. 42, no. 7, 2018, pp. 1301–1322. https://doi.
org/10.1007/S10514-018-9734-5.
datasets, the effectiveness of the proposed method was tested in detail. Qiao, H., Wan, X., Wan, Y., et al., 2020. A Novel Change Detection Method for Natural
The effectiveness of the proposed adaptive fusion attention mechanism Disaster Detection and Segmentation from Video Sequence[J]. Sensors 20 (18),
and deep supervision strategy was confirmed in ablation experiments. 10.3390/S20185076.
Ye, Su., John, Rogan, Zhe, Zh.u., Ronald, Eastman J., 2021. A near-real-time approach
The results indicated that ADS-Net can accurately detect change areas of for monitoring forest disturbance using Landsat time series: stochastic continuous
different complexity, and ADS-Net is better compared to other state-of- change detection[J]. Remote Sensing of Environment 252. https://doi.org/10.1016/
the-art methods in the comprehensive evaluation metric F1-Score and J.RSE.2020.112167.
Singh, A., 1986. Change detection in the tropical forest environment of northeastern
Kappa coefficient. In future work, we will try to mark multiple types of India using Landsat[J]. Remote sensing and tropical land management 44,
changes, such as “disappeared”, “newly built”, and “damaged” and 273–1254.
study the network structure to detect multiple changes. Moreover, Howarth, P.J., Wickware, G.M., 2007. Procedures for change detection using Landsat
digital data[J]. International Journal of Remote Sensing 2 (3), 277–291. https://doi.
remote sensing change detection of bi-temporal heterogeneous images is
org/10.1080/01431168108948362.
also an important research direction, including change detection of Zheng, Y., Zhang, X., Hou, B., et al., 2013. Using combined difference image and k-means
radar and optical images in the same area at different times. clustering for SAR image change detection[J]. IEEE Geoscience and Remote Sensing
Letters 11 (3), 691–695. https://doi.org/10.1109/LGRS.2013.2275738.
Wang, Q., Yuan, Z., Du, Q., et al., 2018. GETNET: A general end-to-end 2-D CNN
framework for hyperspectral image change detection[J]. IEEE Transactions on
Geoscience and Remote Sensing 57 (1), 3–13. https://doi.org/10.1109/
TGRS.2018.2849692.

16
D. Wang et al. International Journal of Applied Earth Observations and Geoinformation 101 (2021) 102348

Yi-Quan, W., Zhao-Qing, C., Fei-Xiang, T., 2016. Change detection of multi-temporal Photogrammetry and Remote Sensing, vol. 166, 2020, pp. 183–200. https://doi.org/
remote sensing images based on contourlet transform and ICA[J]. Chinese Journal of 10.1016/ J.ISPRSJPRS.2020.06.003.
Geophysics 59 (4), 1284–1292. https://doi.org/10.1002/cjg2.20231. Chen, H., Shi, Z., 2020. A Spatial-Temporal Attention-Based Method and a New Dataset
Wang, B., Choi, S.K., Han, Y.K., et al., 2015. Application of IR-MAD using synthetically for Remote Sensing Image Change Detection[J]. Remote Sensing 12(10):1662.
fused images for change detection in hyperspectral data[J]. Remote Sensing Letters 6 https://doi.org/10.3390/rs12101662.
(7–9), 578–586. https://doi.org/10.1080/2150704X.2015.1062155. Liu, R., Cheng, Z., Zhang, L., et al., 2019. Remote Sensing Image Change Detection Based
Singh, S., Talwar, R., 2015. Assessment of Different CVA Based Change Detection on Information Transmission and Attention Mechanism[J]. IEEE. Access PP(99):1–1.
Techniques Using MODIS Dataset[J]. Mausam 66 (1), 77–86. https://doi.org/10.1109/ACCESS.2019.2947286.
Zhao, He, 2011. Improving change vector analysis in Multi-temporal space to detect land Wiratama, Wahyu, et al. “Change Detection on Multi-Spectral Images Based on Feature-
cover changes by using cross-correlogram spectral matching algorithm[C]// Level U-Net.” IEEE Access, vol. 8, 2020, pp. 12279–12289. https://doi.org/10.1109/
Geoscience & Remote Sensing Symposium. IEEE. https://doi.org/10.1109/ ACCESS.2020.2964798.
IGARSS.2011.6048960. Wiratama, W., Sim, D., 2019. Fusion network for change detection of high-resolution
Bovolo, F., Member, I.E.E.E., et al., 2012. A Framework for Automatic and Unsupervised panchromatic imagery[J]. Applied Sciences 9 (7), 1441. https://doi.org/10.3390/
Detection of Multiple Changes in Multitemporal Images[J]. IEEE Transactions on app9071441.
Geoence & Remote Sensing 50 (6), 2196–2212. https://doi.org/10.1109/ Liu, J., Gong, M., Qin, A.K., et al., 2019. Bipartite Differential Neural Network for
TGRS.2011.2171493. Unsupervised Image Change Detection[J]. IEEE Transactions on Neural Networks
Thonfeld, Frank, Feilhauer, et al. Robust Change Vector Analysis (RCVA) for multi-sensor and Learning Systems PP(99):1–15. https://doi.org/10.1109/TNNLS.2019.2910571.
very high resolution optical satellite data.[J]. International Journal of Applied Earth Li X, Wang W, Hu X, et al. Selective Kernel Networks[C]// 2019 IEEE/CVF Conference
Observation & Geoinformation, 2016. https://doi.org/10.1016/j.jag.2016.03.009. on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. https://doi.org/
Wei, H., Jinliang, H., Lihui, W., et al., 2016. Remote sensing image change detection 10.1109/CVPR.2019.00060.
based on change vector analysis of PCA component[J]. Remote Sensing for Land & Jie, Shen, Samuel, et al. Squeeze-and-Excitation Networks.[J]. IEEE transactions on
Resources. https://doi.org/10.6046/gtzyyg.2016.01.04. pattern analysis and machine intelligence, 2019. https://doi.org/10.1109/
Wu, C., Du, B., Zhang, L.P., 2014. Slow Feature Analysis for Change Detection in TPAMI.2019.2913372.
Multispectral Imagery. IEEE T Geosci. Remote 52, 2858–2874. https://doi.org/ Zhang, H., Wang, M., Wang, F., et al., 2021. A Novel Squeeze-and-Excitation W-Net for
10.1109/TGRS.2013.2266673. 2D and 3D Building Change Detection with Multi-Source and Multi-Feature Remote
Han, Y., Javed, A., Jung, S., et al., 2020. Object-Based Change Detection of Very High Sensing Data[J]. Remote Sensing 13 (3), 440. https://doi.org/10.3390/rs13030440.
Resolution Images by Fusing Pixel-Based Change Detection Results Using Weighted Saha, S., Bovolo, F., Bruzzone, L., 2019. Unsupervised Deep Change Vector Analysis for
Dempste-Shafer Theory[J]. Remote Sensing 12 (6), 983. https://doi.org/10.3390/ Multiple-Change Detection in VHR Images[J]. IEEE Transactions on Geoscience and
rs12060983. Remote Sensing PP(99):1–17. https://doi.org/10.1109/TGRS.2018.2886643.
Xin, W., Sicong, L., Peijun, D., et al., 2018. Object-Based Change Detection in Urban Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[J]. 2018.
Areas from High Spatial Resolution Images Based on Multiple Features and Ensemble https://doi.org/10.1007/978-3-030-01234-2_1.
Learning[J]. Remote Sensing 10 (2), 276. https://doi.org/10.3390/rs10020276. Wang Q, Wu B, Zhu P, et al. ECA-Net: Efficient Channel Attention for Deep Convolutional
Zhang, Y., Peng, D., Huang, X., 2018. Object-Based Change Detection for VHR Images Neural Networks[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern
Based on Multiscale Uncertainty Analysis[J]. IEEE Geoscience & Remote Sensing Recognition (CVPR). IEEE, 2020. https://doi.org/10.1109/CVPR42600.2020.01155.
Letters 15 (1), 13–17. https://doi.org/10.1109/LGRS.2017.2763182. Lee, Chen-Yu, et al., 2015. Deeply-Supervised Nets. In: Proceedings of the Eighteenth
Tan K, Zhang Y, Wang X, et al. Object-Based Change Detection Using Multiple Classifiers International Conference on Artificial Intelligence and Statistics, pp. 562–570.
and Multi-Scale Uncertainty Analysis[J]. Remote Sensing, 2019, 11(3). https://doi. Lin, T.Y., Goyal, P., Girshick, R., et al., 2017. Focal Loss for Dense Object Detection[J].
org/ 2010.3390/rs11030359. IEEE Transactions on Pattern Analysis & Machine Intelligence PP(99):2999–3007.
Zhan, Y., Fu, K., Yan, M., et al., 2017. Change Detection Based on Deep Siamese https://doi.org/10.1109/iccv.2017.324.
Convolutional Network for Optical Aerial Images[J]. IEEE Geoscience and Remote Cui, Yin, et al., 2019. Class-Balanced Loss Based on Effective Number of Samples. In:
Sensing Letters 14 (10), 1845–1849. https://doi.org/10.1109/LGRS.2017.2738149. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Wang, Q., Zhang, X., Chen, G., et al., 2018. Change detection based on Faster R-CNN for pp. 9268–9277. https://doi.org/10.1109/cvpr.2019.00949.
high-resolution remote sensing images[J]. Remote Sensing Letters 9 (10–12), Zhang, Linbin, et al. “A Class Imbalance Loss for Imbalanced Object Recognition.” IEEE
923–932. https://doi.org/10.1080/2150704X.2018.1492172. Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol.
Peng, Zhang, Guan. End-to-End Change Detection for High Resolution Satellite Images 13, 2020, pp. 2778–2792. https://doi.org/10.1109/JSTARS.2020.2995703.
Using Improved UNet++[J]. Remote Sensing, 2019, 11(11):1382. https://doi.org/ Lebedev, M.A., Vizilter, Y.V., Vygolov, O.V., Knyaz, V.A., Rubis, A.Y., 2018. CHANGE
10.3390/rs11111382. DETECTION IN REMOTE SENSING IMAGES USING CONDITIONAL ADVERSARIAL
Daudt, Rodrigo Caye, et al., 2018. Fully Convolutional Siamese Networks for Change NETWORKS. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. XLII-2 (565–571)
Detection. In: 2018 25th IEEE International Conference on Image Processing (ICIP), https://doi.org/10.5194/isprs-archives-XLII-2-565-2018.
pp. 4063–4067. https://doi.org/10.1109/ICIP.2018.8451652. El Amin, A.M.; Liu, Q.; Wang, Y. Zoom out CNNs features for optical remote sensing
Zhang, Chenxiao, et al. “A Deeply Supervised Image Fusion Network for Change change detection. In Proceedings of the 2017 2nd International Conference on
Detection in High Resolution Bi-Temporal Remote Sensing Images.” Isprs Journal of Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp.
812–817. https://doi.org/ 10.1109/ICIVC.2017.7984667.

17

You might also like